Abstract
This paper aims to assist the administration departments of medical organisations in making the right decision on selecting a suitable multiclass classification model for acute leukaemia. In this paper, we proposed a framework that will aid these departments in evaluating, benchmarking and ranking available multiclass classification models for the selection of the best one. Medical organisations have continuously faced evaluation and benchmarking challenges in such endeavour, especially when no single model is superior. Moreover, the improper selection of multiclass classification for acute leukaemia model may be costly for medical organisations. For example, when a patient dies, one such organisation will be legally or financially sued for incidents in which the model fails to fulfil its desired outcome. With regard to evaluation and benchmarking, multiclass classification models are challenging processes due to multiple evaluation and conflicting criteria. This study structured a decision matrix (DM) based on the crossover of 2 groups of multi-evaluation criteria and 22 multiclass classification models. The matrix was then evaluated with datasets comprising 72 samples of acute leukaemia, which include 5327 gens. Subsequently, multi-criteria decision-making (MCDM) techniques are used in the benchmarking and ranking of multiclass classification models. The MCDM used techniques that include the integrated BWM and VIKOR. BWM has been applied for the weight calculations of evaluation criteria, whereas VIKOR has been used to benchmark and rank classification models. VIKOR has also been employed in two decision-making contexts: individual and group decision making and internal and external group aggregation. Results showed the following: (1) the integration of BWM and VIKOR is effective at solving the benchmarking/selection problems of multiclass classification models. (2) The ranks of classification models obtained from internal and external VIKOR group decision making were almost the same, and the best multiclass classification model based on the two was ‘Bayes. Naive Byes Updateable’ and the worst one was ‘Trees.LMT’. (3) Among the scores of groups in the objective validation, significant differences were identified, which indicated that the ranking results of internal and external VIKOR group decision making were valid.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Medical informatics is the intersection of information science, computer science, and health care [1,2,3,4,5,6,7,8,9,10]. This field deals with the resources, devices, and methods required to optimize the acquisition, storage, retrieval, and use of information in health [11,12,13,14,15,16,17,18,19,20,21,22,23,24]. The decisions of the administration departments of medical organisations are critical, particularly decisions regarding the selection of automated solutions for the diagnosis and detection of complex diseases, such as acute leukaemia [25]. The importance of selecting appropriate automated solutions can be attributed to their extensive use [26]. Automated solutions based on artificial intelligence techniques can provide rapid acute leukaemia diagnosis and classification and increase the reliability and accuracy of diagnostic results [26,27,28,29,30,31,32]. Many physicians, cancer treatment centres and hospitals have started using automated models for acute leukaemia classification to address the several potential limitations of manual analysis [26, 29, 30]. However, despite the increasing number of automated classification models, finding models that deliver highly accurate results in a short time and without error remains challenging [33]. Therefore, the administration departments of health organisations have been facing difficulties in evaluating and benchmarking automated classification models for acute leukaemia and determining the best model, especially when no single model is superior [29, 33, 34]. Moreover, evaluating and comparing different classification models is difficult in the presence of multiple evaluation criteria [35, 36]. Given the existence of different classification models for acute leukaemia, the health sector has difficulty deciding which model should be used. The required processes for tasks related to the evaluation and benchmarking of automated classification models for dangerous medical cases are crucial to the identification of the classification model that delivers the best results [27]. These processes are crucial because the selection of an incorrect classification model can lead to the loss of a patient’s life, legal accountability and even financial costs for the health organisations. For example, when a model incorrectly identifies non-cancer cells as cancerous in a patient, the surgery and diagnostic tests the patient have to undergo may pose adverse effects on his or her mental health. Conversely, when a model incorrectly identifies cancer cells as non-cancerous, the disease remains untreated, and the patient may die as a result. Both cases have a negative impact on the reputation and performance of healthcare organisations. Therefore, determining the most efficient technique for selecting a suitable classification model for acute leukaemia is necessary. Given that these models are not cheap, as well as related to the medical aspect for humans, they must be evaluated and benchmarked [35]. The procedures related to the multiclass classification of acute leukaemia through evaluation and benchmarking remains challenging [29]. The tasks involved in the evaluation and benchmarking automated models for acute leukaemia are difficult decision-making tasks and requires numerous measurements [34]. Two basic sets of criteria are commonly utilised in the evaluation and benchmarking of acute leukaemia multiclass classification modes: (1) time complexity and (2) reliability group. The first group for reliability has a set of sub-criteria (TP, TN, FP, FN, ave-accuracy, precisionμ, precisionM, recallM, fscore and error rate) [37, 38]. Snousy et al. considered the main requirements for the best classification model in terms of accuracy [33], and nine classification models based on accuracy criterion were compared in their study. Despite the importance of the remaining criteria [39,40,41], several studies [32, 42,43,44,45,46] adopted the classification accuracy criterion for the evaluation and benchmarking of classification models. However, the quality assessment of acute leukaemia classification models requires additional attention. In the same context, some other aspects must be considered in the evaluation processes [33]. According to Rawat et al., although accuracy is the most widely used metric, each class of aspect is considered with equal importance, and the differences among the types of classes are neglected [32]. However, in real cases, particularly those related to medicine, the distinction among certain classified classes is important. In [47,48,49], True Positive, True Negative, False Positive and False Negative sensitivity were used as key criteria for evaluation and benchmarking, but other requirements that might have an impact on classification performance were neglected. In [35], the calculation of time complexity was found to be time consuming for classification. High computational cost causes the slowdown of classification [50]. Misha et al. indicated that the dataset size should be considered in the classification task because a large dataset affects processing time; this condition is known as time complexity [35]. Ludwig et al. stated that in the scope of cancer data analysis, speed and accuracy are the main aspects that must be considered in the evaluation of the efficiency of classification models [51]. Classification tasks are considered good if the results with low computational time are delivered and classification accuracy is simultaneously improved [52]. In other words, the main requirements that must be considered when developing any acute leukaemia multiclass classification model are as follows: (1) time complexity and (2) reliability. Reliability should have a high rate, and the time complexity for conducting the output should be low [52]. However, these requirements are competing requirements [53]; that is, high reliability cannot simultaneously be obtained with low time complexity. Thus, the developers usually focus on either increasing reliability or decreasing time complexity. If a highly reliable multiclass classification model is required, then time must be sacrificed, and vice versa. The trade-off and conflict among the evaluation criteria are reflected on the evaluation and benchmarking process. This situation leads to conflicts among criteria in the comparison, and the benchmarking process is affected. Consequently, benchmarking among multiple criteria is difficult with trade-off and conflict [54]. Reliability and time complexity should be measured in the evaluation of any classification model. However, current approaches for comparing novel and previous models in all the reviewed studies do not focus on the evaluation and benchmarking criteria; they only emphasised the evaluation aspect and neglected the rest because they are not sufficiently flexible to deal with the conflict or trade-off among the various criteria [33]. Conflict and trade-off are considered the first issue faced by the evaluation and benchmarking of multiclass classification models. The second issue is the importance of each criterion. Acute leukaemia evaluation in terms of multiclass classification models involves a set of criteria, and the importance of each criterion is distinct and depends on the objectives of the developed model. That is, the importance of one of the evaluation criteria might be boosted in exchange for the low importance of another criterion based on model objectives [34]. Therefore, trade-off and conflict exist between evaluation and benchmarking criteria due to importance differences of each criterion in different models [55]. The third issue emerges when the benchmarking process is conducted on the basis of simultaneous multiple criteria and sub-criteria [56,57,58]. This approach is considered to be difficult due to the trade-off among the criteria and their various importance; however, the reliability of a criteria set indicates that the values depend on the confusion matrix containing four parameters: True Positive, False Positive, True Negative and False Negative [47, 59]. The four parameters are prone to lose values in experiments, affecting the remaining values of other criteria in the reliability group. Despite the criticism with respect to these parameters, the studies still used these parameters for the evaluation of multiclass classification models [56,57,58, 60]. By contrast, the current evaluation and benchmarking tools have limitations. These tools cannot entirely cover the required measurements by the multiclass classification model. Moreover, these tools have limitations in terms of the overall parameter calculation of the reliability group, comparison between the two additional classification methods and matching between the classification methods because the tools cannot rank the models according to performance [61,62,63]. In the preceding discussion, the problem of evaluation and benchmarking process in multiclass classification models of acute leukaemia is defined as a multi-criteria problem. Therefore, an integrated and comprehensive platform covering all the aspects of performance in the evaluation and benchmarking of multiclass classification models for acute leukaemia should be developed. This integrated platform will serve as a tool that supports the decisions of the administrators of medical organisations in the evaluation and benchmarking of available alternatives and the identification of the best model. The main objective of the current paper is to propose a framework for evaluating and benchmarking multiclass classification models for acute leukaemia. The remaining parts of this article are divided into the following seven sections: the ‘Related Studies’ section presents related literature review. The ‘Multi-criteria decision-making’ section shows the theoretical background of the recommended solution. The ‘Methodology’ section reports the evaluation and benchmarking framework for multiclass classification models. The results and discussion are reported in the ‘Results and discussion’ section. ‘Validation’ deliberates the validation results for the proposed framework. The ‘Limitations and future study’ section highlights the limitations of the proposed framework and future studies. The ‘Conclusion’ section presents the conclusion of the research.
Related studies
The selection of a suitable classification model for acute leukaemia is considered a challenge faced by medical institutions, especially those with specialisation in cancer treatment. The essence of the challenge lies in the capacity of the selected model to allow a precise and immediate acute leukaemia classification.
Previous literature distinctly explained that classification tasks of acute leukaemia differ with respect to result accuracy provided and overall performance. Similarly [29, 33, 34], no previous classification model has been considered superior. Many studies have discussed the development of automated models for acute leukaemia analysis, as well as the way the models is used and the benefits that health organisations could gain from using them [29, 32, 34, 47, 49, 64,65,66,67,68,69]. However, studies that aimed to provide an evaluation and benchmarking of available classification models and determine the best one are limited. Existing academic literature featuring topics related to the evaluation and benchmarking of acute leukaemia multiclass classification models are scarce and scattered; some studies are only limited to the evaluation and benchmarking of one aspect of performance. In [70], automated microscopy was analysed with DM96TM. Snousy et al. compared nine classification models under decision tree family in terms of accuracy and explored their performance in determining blood cells, then compared their accuracy with that of the manual method and XE-2100TM. The study attempted to examine experimental effects to different methods of feature selection with respect to accuracy [33]. An ALL-IDB, which is a public image dataset of peripheral blood samples for normal people and patients with leukaemia, was proposed in [27]; supervised classification and segmentation of the data were provided by the image dataset, which is particularly designed for comparing and evaluating algorithms for segmentation and classification. In [71], three automatic detection approaches for leukaemic cells were compared. The first approach is based on support vector machine, the second is based on a neural network and the third is Gaussian mixture model estimation. The comparison relied on three criteria, namely, accuracy, precision and recall. In addition to the effect of various segmentations on classification results, in [39], two classification schemes were compared in terms of segmentation quality. The first scheme is based on support vector machine, whereas the second is based on random forest. Evaluation and benchmarking methods must be utilised to cover all main requirements and substantively determine the performance and quality of classification models for acute leukaemia. In addition to reduced processing time and small error rate,Saritha et al. assured that the automated classification model has high accuracy and efficiency. Suitable treatment to patients can be provided with the early identification of leukaemia [52]. Despite the substantial effort in the evaluation and benchmarking of acute leukaemia classification tasks, no study has provided an integrated solution that covers the key evaluation criteria for evaluating and benchmarking multiclass classification models and helping the administrators of medical organisations and various users to determine a suitable model. This study attempts to fill the evaluation and benchmarking research gap with respect to acute leukaemia classification tasks.
Multi-criteria decision making (MCDM)
Numerous MCDM definitions are available in academic literature. However, MCDM was defined by Keeney and Raiffa [72] as decision theory extension, which is aimed to cover the decision of any multiple objectives. MCDM is used as a methodology to aid in cases, such as those assessing alternatives on individuals, which are often followed by conflicting criteria and combined into one overall appraisal [73,74,75,76,77]. Among the other definitions of MCDM, [78] defined MCDM as an umbrella term, which describes the collection of formal approaches. These processes decide to take explicit account of multiple criteria to assist individuals or groups exploring important decisions which matter [79,80,81,82,83,84]. Among the most well-known decision techniques, MCDM is known for its decision-making capabilities, enabling it to address complicated decision problems whilst handling multiple criteria [85, 86]. Furthermore, MCDM demonstrates a systematic method to address decision problems on the basis of multiple criteria [86,87,88,89,90]. The goal is to help decision makers deal with this kind of problems [91]. MCDM procedure often relies on approaches with quantitative and qualitative nature and frequently concentrates on simultaneously dealing with multiple and conflicting criteria [92, 93]. MCDM also has the capabilities to increase decision quality based on the approach via effective and rational ways more than traditional processes [94]. Furthermore, MCDM intends to acquire the following: categorise suitable alternatives among a group of available ones and rank the alternatives according to performance in decreasing order [95,96,97,98,99]. The last is the selection of these alternatives [100,101,102,103,104,105,106]. Suitable alternatives will be scored based on the previous goals. Essential terms are required in any MCDM solution, namely, the decision or evaluation matrix, which are also called decision criteria [107]. Decision matrix must be created using elements, including n criteria and m alternatives. Each criteria intersection and alternative is specified as x_ij. Therefore, matrix (x_ij) _ (m*n) is expressed as follows:
where A_1, A_(2)…...,A_m are possible alternatives to be ranked by the decision makers (i.e. classification models); C_1,C_(2)...…,C_n are the criteria against which the performance of each alternative is evaluated and x_ij is the rating of alternative A_i with respect to criterion C_j, and W_j is the weight of criterion C_j. Special processes must be accomplished to score the alternatives. Normalisation is included in some of these processes. Maximisation indicator, addition of weights and other processes are based on the method. For example, suppose that D is the decision matrix utilised in scoring the Ai performance of the alternative, where based on Cj. Enhancing the decision-making process is important and possible by involving decision makers and stakeholders. Using appropriate decision-making methods towards handling multi-criteria problems is also necessary. Healthcare is one of the extensively utilised domains of MCDM [93, 108]. Improving decision making in healthcare is possible through a systematic method and by determining the best decision through different MCDM methods [109, 110]. Especially, many decisions in the healthcare and medical fields are complex and unstructured [108]. Numerous MCDM techniques have been developed, and the most commonly used MCDM techniques are th best-worst method (BWM), weighted product method (WPM), hierarchical adaptive weighting (HAW), simple additive weighting (SAW), multiplicative exponential weighting (MEW), weighted sum model (WSM), analytic network process (ANP), analytic hierarchy process (AHP), technique for order of preference by similarity to ideal solution (TOPSIS) and VlseKriterijumska Optimizacija I Kompromisno Resenje’ (VIKOR), which uses different notations [1, 73, 79,80,81, 108, 110,111,112,113,114,115,116,117,118,119,120,121]. Available MCDM techniques are diverse, and this diversity makes the selection of suitable techniques difficult. Each technique has its own limitations and strengths [81, 109, 112, 122, 123]. Thus, selecting the most suitable MCDM method is important. To the best of the our knowledge, none of the analysed methods have been used to rank multiclass classification models for acute leukaemia. In our previous work [87], we found that BWM and VIKOR are the two of the best MCDM methods.
The current study utilised ‘best–worst’ methods because it can provide more consistent results than AHP and other MCDM weighting methods. Moreover, the BWM-based pairwise comparisons are fewer than those in other methods [112, 124,125,126]. The pairwise comparison based on BWM also focuses on reference comparisons. This condition means that this comparison executes the most important preference of criterion over all the other criteria in addition to the preference of all the other criteria of least important criterion [111, 112, 127]. Conversely, MCDM methods are frequently used to rank alternatives, and the most common is VIKOR. The method utilises the approach for compromise priority for multiple response optimisation [110, 128, 129]. VIKOR is based on an aggregating function that represents ‘closeness to the ideal’. The index for VIKOR ranking is based on a particular measure of ‘closeness’ to the ideal solution. Furthermore, VIKOR has the capability towards the ranking of the alternatives to accurately and rapidly determine the best [128]. The style for recent VIKOR studies changed, and VIKOR is usually integrated with another MCDM method. Reviewed studies identified and provided different examples for applying VIKOR with BWM to improve consistency for subjective weights. A similar integration between VIKOR with BWM realises a robust method. Given the advantages of the two methods in overcoming uncertainties associated with the problem described in [130,131,132,133,134,135,136], using VIKOR and BWM is easy and clear even for those with no background on MCDM [136]. Utilising VIKOR with different cases (e.g., individual and groups) has been recommended. Two main cases of decision making are basically emphasised: the first case is decision making based on a single decision maker; the second involves many decision makers and is called group decision making (GDM), in which individuals collectively select alternatives from the ones presented to them. The decision is not attributed to any single group member because of the individual and social processes, such as social influence, which contribute to the outcome. The GDM techniques systematically collect elements and combine components from experts, including their knowledge and judgement from different fields. In relation to a group case, the judgement criteria of each expert, which require subjective judgement, are provided. The same expert assigns weight for every criterion [110, 137]. Finally, evaluation and benchmarking for acute leukaemia multiclass classification suggests a need to integrate BWM and VIKOR methods. The suggestion is based on assigning weights for criteria (reliability, time complexity rate) according to BWM and on the basis of the evaluation of an expert. The utilisation of VIKOR is recommended in the ranking of multiclass classification models.
Methodology
This section introduces the evaluation and benchmarking methodology of the automated multiclass classification models. In addition, the section will introduce the procedures and steps of the proposed framework. The output ranked multiclass classification models based on the set of criteria using the BWM and VIKOR for weighting and ranking, respectively. All the overall conceptual elements of the present study are illustrated in Fig. 1.
Construction of decision matrix
Decision matrix considers the main component in the evaluation and benchmarking framework. The main parts of decision matrix are decision criteria and alternatives. In the present case, the criteria represent the metrics used for measuring the quality of multiclass classification models. The next subsection describes the procedures followed to develop and evaluate the multiclass classification models and construct the decision matrix.
Data source
The dataset proposed by [138] for acute leukaemia microarray was adopted in this study. The dataset is recognised for its popularity and usage in the academic literature and the most frequently utilised in the papers (References [139,140,141], which is available for the public). The dataset has three categories for acute leukaemia: acute myelogenous leukaemia (AML), ALL B cell and ALL T cell. The dataset comprises 5327 genes and 72 samples, of which 38 are AML, 9 are ALL-B and 25 are ALL-T types.
Development of multiclass classification models
Developing multiclass classification models requires a three-step process. Firstly, the target dataset, which include the selection of relevant features, is prepared. Secondly, training (learning process), which involves the establishment of a class through machine learning, is achieved by analysing the instances of a training dataset. Each individual instance, should belong to a predefined class, and each instance is assumed to belong to a predefined class. Thirdly, machine learning algorithms are executed with other independent datasets, which are also known as testing datasets. This step is in line with the aim of performing machine learning estimation. If the performance for multiclass classification model appears to be ‘acceptable’, then the model can be utilised for future classification cases when the class label is unknown. Ultimately, the multiclass classification models, which supply an acceptable result, can be considered an acceptable multiclass classification model.
The microarray data generally contain dozens of sample sizes (small) and high dimensionality (thousands of genes). Nevertheless, the results of the classification can be affected by the genes, more specifically, few parts of these genes. This condition means that most genes have no classification value. Genes with no relevancy, apart from their negative effects on the classification performance, can cause conflict in the classification model. Moreover, given that irrelevant genes can lead to over-fitting, a positive effect can be attained by reducing the number of genes. This approach can minimise the computed input. This positive effect affects the overall performance and results for the classification [28, 142, 143]. In this study, the genes that are highly relevant with classification classes, which are known as informative genes, are selected. The chi-square (X2) [33] method was used for the individual evaluation of the features. The X2 value is computed as follows [28, 142, 143]:
where V is the set of possible values for a, n is the number of class, Ai (a = V) is the number of samples in the ith class with a = v and Ei (a = v) is the expected value of Ai (a = v); Ei(a = v) = P (a = v) P (ci) N, where P(a = v) is the probability of a = v, P(ci) is the probability of one sample labelled with the ith class and N is the total number of samples [33].
A total of 22 models for multiclass classification are built based on 22 well-known machine learning algorithms available in Weka software, which have been extensively used in prior studies [33, 42, 51, 60, 142, 144,145,146] and demonstrated satisfactory results when used in the classification of microarray dataset. These algorithms include the following: Rule.zero, Bayes_Net, Bayes.NaiveByesUpdateable, Lazy.IBK, Meta.AdaboostM1, Meta.Bagging, Meta.filteredclassifier, Meta.logitboost, Tree.j48, REPTree, RandomTree, RandomForest, Rule. Decision Table, Rules.part, Meta.RandomCommittee, Trees.LMT, Treed.HoeffdingTree, Kstar, Functions.Smo, Functions.SIMPLE.Logistic, Byes.NaiveBayes and Decision Stump. The dataset is divided into two parts to develop multiclass classification models. The first part is utilised for training purposes, and the other is used for testing purposes. The set for training is used in training the machine learning algorithms, and the other part of the dataset (testing set) is utilised to test the trained machine learning algorithms. The test dataset is classified into three categories, namely, AML, ALL-B and ALL-T, using the 22 multiclass classifications.
Establishment and evaluation of the decision matrix
The establishment of the decision matrix is dependent on the crossover between the evaluation criteria, namely, Ave accuracy, error rate, precisionM, precisionμ, recallM, FP, FN, TP, TN, fscore and time complexity, and the 22 developed multiclass classification models. Figure 2 presents the structure of the proposed decision matrix.
Figure 2 shows the structure of the proposed decision matrix; the top row represents the main evaluation criteria, and the first column on the left represents different developed multiclass classification models as alternatives. The values (data) in this DM denote the evaluation results of all developed multiclass classification models according to all evaluation criteria. Each multiclass classification model is evaluated based on all evaluation criteria, where the matrix of parameter, relationship of parameters, parameter behaviour and error rate represent the four sub-criteria sets in the group of reliability. Firstly, the matrix of parameter is generated (TP, TN, FN and FP), and the basic sub-criteria are represented by these parameters in the reliability group of criteria. Given that this study addressed the multiclass classification problem, one-verse all approach is used in the calculation of the reliability set of the criteria. According to these criteria, the multiclass confusion matrix is converted to three confusion matrices, and each of matrix describes the parameters for a certain class of acute leukaemia (AML, ALL-B and ALL). Based on the three confusion matrices, the remaining sub-criteria within the reliability group are calculated for each matrix by using a specific formula. Therefore, values for each multiclass classification model will be separately calculated to generate the values considering the input of the decision matrix. Finally, the calculation procedure for time complexity is based on the consumed time by two elements: the input of the dataset sample and result output. The calculation process for the sample process relies on the number and size of samples as indicated in the following equation:
where T_o is the processing time to obtain outputs, and T_i is the time of inputting the sample. The time complexity is calculated by Weka software through the experimental process. As mentioned in Section 2, the three specific issues encountered by the proposed decision matrix are as follows: (1) trade-off and conflict among the evaluation criteria, (2) multiple evaluation criteria and (3) the importance of criteria. A weight difference is observed between the main criteria and sub-criteria. MCDM is used to address this issue, as presented in the next section.
Development of the evaluation and benchmarking framework
The proposed evaluation and benchmarking framework are developed based on MCDM techniques. The framework is developed based on the integration of BWM and VIKOR for weighting and ranking the best alternatives in the proposed decision matrix and selecting the best one. The subsequent steps are presented below.
Development of evaluation and benchmarking/selection integrated methods of BWM and VIKOR using MCDM
The suitable methods for benchmarking and ranking multiclass classification models are BWM and VIKOR. The VIKOR method is a mathematical model recommended for ranking and solving specific issues related to (1) trade-off and conflict and (2) multi-evaluation criteria encountered by the proposed decision matrix. BWM is also used for weighting the criteria to solve (3) the importance of criteria in relation to the proposed decision matrix.
Accordingly, the combination of BWM and VIKOR methods is justified for benchmarking and ranking the multiclass classification models.
Calculation of the weights of criteria based on BWM method
Assigning proper weights for multi-service criteria using BWM requires several steps. The procedure for BWM includes the following steps [112, 147]:
-
Step 1.
Determining a set of decision criteria
For BWM, the first step is to determine the criteria set, C1, C2,.… Cn, which should be considered by the decision maker when selecting the best alternative. In the present study, the set of criteria is obtained from the conducted analysis in the literature.
-
Step 2.
Determination of the best and worst criteria
Considering the best criterion as the most desirable or most important decision criteria is possible, and the worst criterion represents the less desirable or important criteria to the decision. This step involves the description of the best and the worst criteria depending on the perspective of the three decision makers/evaluators. Appendix 1 Section 2 presents the BWM comparison questions and the list of experts.
-
Step 3.
Conduct the pairwise comparison between the best criterion and the other criteria
The pairwise comparison process occurs between the identified best criterion and the other criteria. The aim of this step is to determine the best criterion preference over all the other criteria. The value must be determined by an evaluator/expert and must be from 1 to 9 to represent the importance of the best criterion over the other criteria. This step will result in a vector identified as ‘Best-to-Others’, which is
where aBj indicates the importance of the best criterionB over criterion j, and aBB = 1.
-
Step 4.
Pairwise comparison process between the other criteria and the worst criterion
The aim of comparison is to identify the preference for all the criteria over the least important criterion. The importance is determined by an evaluator/expert of all the criteria over the worst criterion, and the numbers from 1 to 9 are used towards indicating the importance. The result for this step is a vector recognised as ‘Others-to-Worst’. The vector result of ‘Others-to-Worst’ is represented as Aw = (a1w, a2w, …, aaw), where ajw represents the preference of the criterion j over the worst criterion W. Clearly, aww = 1. Two types of reference comparisons, namely, Best-to-Others and Others-to-Worst criteria, are illustrated in Fig. 3.
-
Step 5.
Elicit the optimal weights (W*1, W*2, …W*n)
The optimal weight for the criteria is the one where for each pair of WB/Wj and Wj/Ww, WB/Wj = aBJ and Wj / Ww = ajw.
To fulfil these conditions for all j, a solution where the maximum absolute differences for all j are minimised must be obtained:
Considering the non-negativity and sum condition for the weights, the following problem is created:
The aforementioned problem can be transferred to the following problem:
By finding a solution for the last problem, the optimal weights (w*1; w*2;…; w*n) and ξ n are obtained. The value for ξ* reflects the outcomes’ reliability, depending on the extent of consistency in the comparisons. A value close to zero represents high consistency, and thus, high reliability [112, 126, 127, 148]. After that, the ratio for consistency calculated by using ξ* and the corresponding consistency index is as follows (Table 1):
As proposed by [112], the bigger the ξ* is, the more consistent the vectors are.
Ranking the multiclass classification models based on VIKOR method
Owing to the suitability of VIKOR for many alternatives and multiple conflicting criteria decision cases, it is used to rank multiclass classification models. VIKOR can provide rapid results, thereby determining the most suitable option at the same time. The weights for all the criteria will be gathered from the BWM and will be utilised in VIKOR. The results for the decision alternative are ranked in ascending order. The models of multiclass classification are ranked based on values of weighted criteria that employ the VIKOR method. VIKOR steps are presented below [149, 150].
-
Step 1: Identify the best f∗i and worst f−i values of all criterion functions, i = 1; 2; ...; n. If the ith function represents a benefit, then
-
Step 2:
Based on the BWM method, the weights for each criterion are computed. A set of weights w = w1, w2, w3, ⋯, wj, ⋯, wn from the decision maker is accommodated in the DM. This set is equal to 1. The resulting matrix can also be computed as demonstrated in following equation.
This process will produce a weighted matrix as follows:
-
Step 3:
Compute the values of Sj and Rj, j = 1,2,3,….,J, i = 1,2,3,…,n by using the following equations:
where wi indicates the criterion weights expressing their relative importance.
-
Step 4:
Compute the values of Qj,j = (1, 2, ⋯, J) by the following relation:
where
v is introduced as the weight of the strategy of ‘the majority of criteria’ (or ‘the maximum group utility’); here, v = 0.5.
-
Step 5:
The alternatives can now be ranked by sorting the values of S, R and Q in ascending order. Optimal performance is indicated by the lowest value.
-
Step 6:
Propose as a compromise solution alternative (a′), which ranks best by the measure Q (minimum) if the following two conditions are satisfied:
-
C1.
‘Acceptable advantage’:
where (a′′) is the alternative at second position in the ranking list by Q, DQ = 1/(J − 1), J is the number of alternatives.
-
C2.
‘Stability’ is acceptable in the decision-making context. Alternative a′ should also be ranked best by S and/or R. This compromise solution is stable within the process of decision making, which can be ‘voting by majority rule’ (v > 0:5), ‘by consensus’ (v ≅0.5) or ‘with veto’ (v < 0.5). Here, v is the decision-making strategy weight of ‘the majority of criteria’ (or ‘the maximum group utility’). The Q value provides an idea of which multiclass classification model has higher values of evaluation criteria than the others. According to this technique, the multiclass classification models with high values of evaluation criteria will have the lowest Q value. Two main decision-making contexts will be applied: individual decision making and GDM. In the former, decision making will be based on a single individual decision maker, whereas GDM is based on multiple decision makers/experts. GDM will be performed in two ways: internal aggregation and external aggregation. Figure 4 illustrates the procedures that will be followed to apply the types of aggregation.
Figure 4 shows that the internal GDM is calculated by using the arithmetic mean of the final weights of the three experts’ preferences to eliminate the possible variation among them. VIKOR is then applied based on final weights obtained from the arithmetic mean of the three experts. By contrast, external aggregation is calculated by using the arithmetic mean of the Q values for each expert’s ranking, and then the final Q values depend on external group ranking.
Results and discussion
This section presents the results of the proposed framework of evaluation and benchmarking the multiclass classification models of acute leukaemia. Section 5.1 presents the data in decision matrix. Section 5.2 presents the results of the development in benchmarking framework that involves BWM results in subsection 5.2.1 to show the weights for the main criteria and subs-criteria and the results of the VIKOR method in subsection 5.2.2. Section 5.3 presents the validation processes and results.
Data presentation in decision matrix
The results obtained from the evaluation of the 22 multiclass classification models are presented in this section. The outcome of the implementation process of those 22 multiclass classification models generated four parameters (tp, tn, fp, fn) which are considered fundamental values to calculate the rest reliability criteria group values. The values of time complexity criterion were calculated according to its respective framework. The values of reliability group of criteria and time complexity criterion were considered an input to fill the decision matrix. Table 2 illustrates the completed decision matrix.
Table 2 shows that each multiclass classification model has been evaluated based on 11 evaluation criteria. The next section will discuss in detail the results of integration between the BWM and VIKOR.
Results of the framework of evaluation and benchmarking multiclass classification models
The results of the proposed benchmarking framework are represented in two subsections. The first section is the weight result by using the BWM, whereas the second is the result of using VIKOR. The VIKOR section is divided into the individual context and the group context. The group context includes the result of the internal and external aggregation, which will be described in detail in subsequent sections.
Results for weight using BWM method
In this section, BWM results are presented and explained. Three experts were asked to make their evaluation and benchmarking preferences on criteria of multiclass classification models via BWM comparison questions. Table 3 presents the first expert’s process results of main criteria and their sub-criteria. Appendix 2 (Tables 21 and 22) shows the detailed results of the other two experts.
R: Reliability, TM: Time Complexity, MOP: Matrix of parameter, ROP: Relationship of parameter, BOP: Behaviour of parameter, ER: Error Rate, True Positive: TP, True Negative: TN, FP: False Positive, FN: False Negative. Table 3 and Appendix 2 (Tables 21 and 22) present the three experts’ processes weighted results based on BWM. For the evaluation and benchmarking criteria, the best and worst criteria are identified, the best criteria is compared with the other criteria, and the worst criterion is determined. Lastly, the linear model of BWM solved according to Eqs. (6, 7) in Sect. 4.2.1.1 to obtain the weights. Eq. (8) has been used to calculate the consistency ratio of each expert’s preferences. To calculate the global weights of each criterion for the three experts, BWM method derives the local weights for each criteria group at each level as shown in Table 3 and Appendix 2 (Tables 21 and 22) that explains the importance of each criterion regarding the parent. Consequently, the global weights for each criterion is obtained. Each global weight explains each criterion’s importance with respect to the goal for each expert. Firstly, the weight of each criterion was determined by making a comparison between criteria based on BWM. These weights are called ‘local weights’. To find the global weights with respect to the goal, the criteria’s origin weights and their associated local weights were multiplied, as presented in Table 4.
Table 4 presents the overall local and global weights for the three experts for 11 evaluation and benchmarking criteria. The overall CR for the three experts scores an acceptable ratio of less than 0.1. These global weights have been used in our benchmarking framework because the global weights represent the importance of the criteria with respect to the goal. Table 4 shows that the global weight results of the first expert assigned the maximum weight for true positive with a value of 0.201. The minimum weight obtained by precisionM and recallM is 0.035 and 0.035, respectively. The second expert assigned the maximum weight for time complexity criterion with a value of 0.500. The minimum weight obtained by ave-accuracy is 0.011. The third expert assigned the maximum weight for time complexity with a value of 0.200. The minimum weight obtained by true negative is 0.015. Final weight results are used in applying VIKOR method the next section.
Ranking’s results of VIKOR method
The results after the ranking of the multiclass classification models based on weighted evaluation criteria are presented in this section. Individual decision making and GDM contexts are explained. The results of the individual and group VIKOR decision-making contexts are presented in the following subsections.
-
VIKOR Results of Individual Context for Different Experts’ Weights
VIKOR is utilised to rank alternatives based on the decision matrix results presented in Table 2 and the results of the weights presented in Table 4. The ranking show the importance of the evaluation criteria from the viewpoint of each expert. VIKOR technique depends on Q value in ranking the alternatives. The alternative with a lower Q value is considered the better alternative, whereas the alternative with a higher Q value is considered the worst alternative. Table 5 shows the VIKOR results of ranking according to the weights that reflect the viewpoint of the first expert. Tables 23 and 24 in Appendix 3 show the VIKOR results of the two other experts.
Table 5 and Appendix 3 (Tables 23 and 24) present the three VIKOR ranking results provided by the experts. In the first rank, ‘Bayes.NaiveByesUpdateable’ had the lowest Q value of 0.0358 for and was thus the best multiclass classification model in this rank. By contrast, ‘RandomTree’ had the highest Q value of 1 and was thus the worst multiclass classification model in this rank. In the second rank, ‘Byes.NaiveBayes’ had the lowest Q value of 0 and was thus the best multiclass classification model in this rank. By contrast, ‘Rule.Decision Table’ had the highest Q value 1 and was thus the worst multiclass classification model. In the third rank, the lower Q value was 0 for ‘Bayes.NaiveByesUpdateable’, which was eventually considered the best multiclass classification model in this rank. By contrast, the higher Q value was 0.9956 for ‘Rules.part’, which was considered the worst. Differences in weight provided by the experts affected the ranking scores. Figure 5 shows the variance among the VIKOR results.
Figure 5 demonstrates the final VIKOR ranking for three experts. Ten classification models were selected from each score ranking results [2]. The selected classification models with the best score received the highest ranking (first five classification models), whereas the classification models with the worst score received the lowest ranking (last five classification models).
The first five classification models with the highest-ranking level vary with regard to the weights provided by the experts. According to the weights provided by expert one (A) and expert three (C), Bayes.NaiveByesUpdateable and BayesNet models appeared in the first and second indices, respectively. By contrast, the first and second indices based on the weights provided by expert two (B) were Byes.NaiveBayes and RandomTree. Random Forest and Decision Stump appeared in the third and fourth indices based on the weight provided by expert (A) and expert (C,) whereas the two classification models did not appear in first five indices according to the second expert. Rules.part and Rule.zero were in the third and fourth indices based on the weight provided by expert (B). Meta.AdaboostM1 was in the fifth index according to the weight given by expert (A), whereas Rule.zero appeared in the fifth index based on the weigh obtained from expert (B) and expert (C).
The last five classification models considered with the lowest-ranking level vary based on the weights provided by the expert. Accordingly, RandomTree is the worst model with index 22 according to expert (A), whereas the same model was in the third worst classification model based on expert (C). The worst one according to expert (B) is Rule. Decision Table, in additional the same model was the fifth worst model according to experts (A) and (C). Rules.part appeared as the worst classification model based on expert (C) and the second worst classification model according to expert (A). Trees.LMT was the second worst classification model according to expert (B). In the same last classification model, it was the fourth worst classification model according to experts (A) and (C). Tree.j48 is the third worst model according to expert (A) and the second worst model according to expert (C). Lastly, Meta.AdaboostM1 and Meta.logitboost were the fourth and fifth worst classification models, respectively, based on expert (B).
The results of the individual context clearly show variances among the rankings of three experts. Therefore, the utilisation of group VIKOR decision-making context, which aims to provide ranking alternatives which in turn considers overall decision makers, is necessary. The following sections present the results of group VIKOR decision-making context.
-
Group VIKOR with Internal and External Aggregation
To extend VIKOR into a group decision environment, two ways were used; (1) internal and (2) external aggregation, both of which depend on multiple decision makers. Internal GDM results are calculated by using the arithmetic mean of the final weighs of the three experts’ preferences to eliminate the variance between them, then the VIKOR is applied based on final arithmetic mean results. By contrast, external aggregation results are calculated by finding the arithmetic mean of the Q values for each expert’s ranking results. The final Q values then depend on the external group ranking. Table 6 illustrates the overall ranking results of VIKOR with internal and external group decision making for 22 multiclass classification models.
As shown in Table 6, the order of the best/first three classification models are Bayes.NaiveByesUpdateable, BayesNet and Decision Stump. The order of the last worst/two classification models based on the results of internal and external GDM are Trees.LMT and Rule. Decision Table. The rest of the classification models with the same order in both internal and external decision making are Meta.RandomCommittee, Lazy.IBK, Meta.logitboost and Byes.NaiveBayes in the following order 8, 13, 14, 15, respectively. By contrast, some classification models are ranked differently between the internal and external group decision making. The order of those classification models based on internal ranking are as follows: REPTree, Rule.zero, RandomForest, Kstar, Meta.Bagging, Meta.AdaboostM1, Functions.SIMPLE. logistic, Functions.Smo, Meta.filteredclassifier, Treed.HoeffdingTree, RandomTree, Rules.part and Tree.j48 in the following order: 4, 5, 6, 7, 9, 10, 11, 12, 16, 17, 18 and 19, respectively. The order of the same classification models based on external ranking aere REPTree, Rule.zero, RandomForest, Kstar, Meta.Bagging, Meta.AdaboostM1, Functions.SIMPLE. logistic, Functions.Smo, Meta.filteredclassifier, Treed.HoeffdingTree, RandomTree, Rules.part and Tree.j48 in the following order: 5, 7, 4, 9, 10, 6, 12, 11, 17, 16, 20, 18 and 19, respectively. Therefore, the first best three index classification models in both internal and external GDM are equal, whereas the last worst two index classification models are equal as well. The fourth classification models in different medium scores indices were equal, whereas the rest of the classification models showed different score indices. From this point forward, the internal and external aggregation decision making rank can be considered the final ranking results and will be used in validation processes. The next section will describe in detail the validation results.
Validation processes and results
Decision selection of multiclass classification model is considered a difficult task because it relies on conflicting multiple criteria in one side. Differences in accuracy, performance and other features make the task difficult. The results are validated for the proposed benchmarking framework by utilising objective validations.
Objective validation
Statistical methods of mean and standard deviation (SD) were used in this study to ensure that multiclass classification models were ranked according to the proposed benchmarking framework. Towards this goal, three groups were created and separated because of the results ranking for multiclass classification models [2, 82]. Each group’s results are expressed as mean ± SD. The mean is the average results. Its calculation is performed by the sum division of the observed results over the resulting number and by the following equation:
SD is used to determine the dispersion or variation amount in the set of values and is calculated by the following equation:
The utilisation of mean ± SD ensures that the three multiclass classification models sets are subject to systematic ordering. The multiclass classification models scoring was divided into three groups to validate the results ranking by using the above test. Division took place based on the ranking result obtained from the proposed benchmarking framework. An equal number (seven) are included for the first and second multiclass classification models. Eight classification models (2) are included in the third group depending on the scoring values from the ranking results. For this process to takes place, two statistical methods will be used. These methods must prove that the lower scoring value was achieved by the first group when both mean and SD are measured. The lower mean and SD were assumed for the first group in comparison with the other two groups to validate the results. The results for both mean and SD of the second group must be lower or equal than the ones in the third group. At the same time, they must be higher than the first group. Nevertheless, results of the mean and SD must be higher than those in the first and second groups and equal to those in the second group. The results of the first group must be statistically proven according to the systematic ranking results which have to be considered lowest among the three groups.
Validation results
This section presents the validation processes of internal and external GDM ranking. In this research, objective validation processes are used. The validation process for multiclass classification models ranking results has been obtained by dividing the ranking result into three groups. The first two groups are equal, with each one having 7 models and the third one having 8 models. The mean ± SD have been calculated for each group to ensure that the ranking multiclass classification models undergo a systematic ranking. After normalisation and weighting process for the row data of the first, second and third groups of multiclass classification models, the validation results for internal and external GDM are presented in Table 7.
Table 7 shows the results of validation for internal aggregation group decision making. The first group has a lower mean ± SD than the second group except for error rate (M = 0.0951 ± 0.0319 in the first group; M = 0.0721 ± 0.0327 in the second group). For the second group; the mean ± SD is lower than the mean ± SD in third group for all features except for error rate (M = 0.0721 ± 0.0327 in the second group; M = 0.0450 ± 0.0231 in the third group). Accordingly, first group has a lower value compared with the second group. The second group has a lower value compared with the third group. Regarding the results of validation for external aggregation GDM, the mean ± SD in the first group is lower than the mean ± SD in the second group except for error rate (M = 0.1010 ± 0.0272 in the first group; M = 0.0662 ± 0.0309 in the second group). In the second group, the mean ± SD is lower than the mean ± SD in the third group for all features except for error rate (M = 0.0662 ± 0.0309 in the second group; M = 0.0450 ± 0.0231 in the third group). Accordingly, the first group has a lower value compared with the second group, whereas the second group has a lower value compared with the third group. Therefore, the internal and external GDM rank is valid and undergoes systematic ranking.
Research limitation and future study
The proposed evaluation and benchmarking framework can address the evaluation and benchmarking issues for multiclass classification models. However, it cannot deal with classification models that work under multi-labelled or hierarchical cases because the evaluation criteria used for evaluation and benchmarking the multi-labelled or hierarchical cases are different and the procedures to calculate those criteria are different. The future study directions are as follows:
-
The proposed framework can evaluate and benchmark the multiclass classification models that classify other types of leukaemia.
-
The new framework can be applied for classification models with applications that involve the use of multi-labelled or hierarchical classification models through proposing new decision matrices that include related evaluation criteria for multi-labelled classification models or hierarchical classification models.
Conclusion
Studies related to the automated detection and classification of acute leukaemia have been notably increasing. Nevertheless, studies relevant to the evaluation and benchmarking of automated detection and classification tasks with unaddressed limitations are scarce. Several aspects are associated with the evaluation and benchmarking aimed for automated detection and classification. Such aspects warrant further analysis and investigation. Towards this end, comprehensive review and research on automated classification of acute leukaemia have been done while considering its evaluation and benchmarking aspects. The aim for the latter was to identify open challenges, research issues and gaps linked to the process of evaluation and benchmarking. After a thorough review of studies, a serious gap was identified. The gap resides in the failure of previous studies to perform a process of evaluation and benchmarking for all major detection and classification requirements. Evaluation and benchmarking were partially performed, which render incomplete results because they failed to reflect the overall performance for detection and classification. Such weakness raises a challenge for comparing numerous systems or models for the detection and classification to determine which of the system or model is the best because the evaluation criteria vary and are incomplete. Moreover, all the major criteria and sub-criteria aimed for benchmarking multiclass detection and classification were reviewed. Towards addressing challenges, resolving issues and fulfilling the research gap, we proposed an evaluation and benchmarking framework based on MCDM techniques. Its goal is to evaluate and benchmark the acute leukaemia multiclass classification models. The description of the procedures and steps of the proposed framework are described. Construct decision matrix was based on crossover between evaluation criteria and 22 multiclass classification models. The proposed framework for evaluation and benchmarking are developed based on an integration of BWM and VIKOR. The ranking of classification models results are based on three experts’ opinions on criterion preference. Firstly, the VIKOR was applied in the individual context to provide ranking for each expert, though the results show variances among the three experts’ ranking. Therefore, VIKOR with GDM was applied, including internal and external aggregating methods. By contrast, internal and external aggregations have shown almost similar performance. Lastly, the validation for the results has been achieved objectively in this research. The statistical results indicate that the multiclass classification models ranking results based on internal and external aggregation GDM undergo a systematic ranking.
References
Salman, O., Zaidan, A., Zaidan, B., Naserkalid, and Hashim, M., Novel methodology for triage and prioritizing using “big data” patients with chronic heart diseases through telemedicine environmental.Int. J. Inf. Technol. Decis. Mak. 16(05):1211–1245, 2017.
Kalid, N. et al., Based on real time remote health monitoring systems: A new approach for prioritization “large scales data” patients with chronic heart diseases using body sensors and communication technology.J. Med. Syst. 42(4):69, 2018.
Mohsin, A. H. et al., Based medical systems for patient’s authentication: Towards a new verification secure framework using CIA standard.J. Med. Syst. 43(7):192, 2019.
Mohsin, A. H. et al., Real-time medical systems based on human biometric steganography: A systematic review.J. Med. Syst. 42(12):245, 2018.
Mohsin, A. H. et al., Real-time remote health monitoring systems using body sensor information and finger vein biometric verification: A multi-layer systematic review.J. Med. Syst. 42(12):238, 2018.
Albahri, O. S. et al., Systematic review of real-time remote health monitoring system in triage and priority-based sensor technology: Taxonomy, open challenges, motivation and recommendations.J. Med. Syst. 42(5), 2018.
Abdulnabi, M. et al., A distributed framework for health information exchange using smartphone technologies.J. Biomed. Inform. 69:230–250, 2017.
Zaidan, A. A. et al., Challenges, alternatives, and paths to sustainability: Better public health promotion using social networking pages as key tools.J. Med. Syst. 39(2):7, 2015.
Mat Kiah, M. L. et al., Design and develop a video conferencing framework for real-time telemedicine applications using secure group-based communication architecture.J. Med. Syst. 38(10):133, 2014.
Shuwandy, M. L. et al., Sensor-based mHealth authentication for real-time remote healthcare monitoring system: A multilayer systematic review.J. Med. Syst. 43(2):33, 2019.
Talal, M. et al., Smart home-based IoT for real-time and secure remote health monitoring of triage and priority system using body sensors: Multi-driven systematic review.J. Med. Syst. 43(3):42, 2019.
Zaidan, B. B. et al., A security framework for Nationwide health information exchange based on telehealth strategy.J. Med. Syst. 39(5):51, 2015.
Hussain, M. et al., The landscape of research on smartphone medical apps: Coherent taxonomy, motivations, open challenges and recommendations.Comput. Methods Prog. Biomed. 122(3):393–408, 2015.
Zaidan, B. B. et al., Impact of data privacy and confidentiality on developing telemedicine applications: A review participates opinion and expert concerns.Int. J. Pharmacol. 7(3):382–387, 2011.
Kiah, M. L. M. et al., MIRASS: Medical informatics research activity support system using information mashup network. J. Med. Syst. 38(4):37, 2014.
Mohsin, A. H. et al., Based Blockchain-PSO-AES techniques in finger vein biometrics: A novel verification secure framework for patient authentication.Comput. Stand. Interfaces, 2019.
Hussain, M. et al., Conceptual framework for the security of mobile health applications on android platform.Telematics Inform. 35(5):1335, 2018.
Hussain, M. et al., A security framework for mHealth apps on android platform.Comput. Secur. 75:191–217, 2018.
Iqbal, S. et al., Real-time-based E-health systems: Design and implementation of a lightweight key management protocol for securing sensitive information of patients.Health Technol. (Berl):1–19, 2018.
Alanazi, H. O. et al., Meeting the security requirements of electronic medical records in the ERA of high-speed computing.J. Med. Syst. 39(1):165, 2015.
Nabi, M. S. A. et al., Suitability of using SOAP protocol to secure electronic medical record databases transmission.Int. J. Pharmacol. 6(6):959–964, 2010.
Kiah, M. L. M. et al., An enhanced security solution for electronic medical records based on AES hybrid technique with SOAP/XML and SHA-1.J. Med. Syst. 37(5):9971, 2013.
Nabi, M. S. et al., Suitability of adopting S/MIME and OpenPGP email messages protocol to secure electronic medical records. In:Second International Conference on Future Generation Communication Technologies (FGCT 2013), 2013, 93–97.
Kiah, M. L. M. et al., Open source EMR software: Profiling, insights and hands-on analysis.Comput. Methods Prog. Biomed. 117(2):360–382, 2014.
Alsalem, M. A. et al., A review of the automated detection and classification of acute leukaemia: Coherent taxonomy, datasets, validation and performance measurements, motivation, open challenges and recommendations.Comput. Methods Prog. Biomed. 158:93–112, 2018.
Srisukkham, W., Zhang, L., Neoh, S. C., Todryk, S., and Lim, C. P., Intelligent leukaemia diagnosis with bare-bones PSO based feature optimization.Appl. Soft Comput. 56:405–419, 2017.
Labati, R. D., Piuri, V., Scotti, F., and Ieee, All-IDB: The acute lymphoblastic leukemia image database for image processing. In:2011 18th Ieee International Conference on Image Processing, 2011.
Lei, X., and Chen, Y., Multiclass classification of microarray data samples with flexible neural tree. In:2012 Spring Congress on Engineering and Technology, 2012, 1–4.
Agaian, S., Madhukar, M., and Chronopoulos, A. T., Automated screening system for acute myelogenous leukemia detection in blood microscopic images.IEEE Syst. J. 8:995–1004, 2014.
Mohapatra, S., Patra, D., and Satpathi, S., Image analysis of blood microscopic images for acute leukemia detection. In:2010 International Conference on Industrial Electronics, Control and Robotics, 2010, 215–219.
Bagasjvara, R. G., Candradewi, I., Hartati, S., and Harjoko, A., Automated detection and classification techniques of acute leukemia using image processing: A review. In:2016 2nd International Conference on Science and Technology-Computer (ICST), 2016, 35–43.
Rawat, J., Singh, A., Bhadauria, H. S., and Virmani, J., Computer aided diagnostic system for detection of leukemia using microscopic images.Procedia Computer Science 70:748–756, 2015.
Snousy, M. B. A., El-Deeb, H. M., Badran, K., and Khlil, I. A. A., Suite of decision tree-based classification algorithms on cancer gene expression data.Egyptian Informatics Journal 12:73–82, 2011.
Goutam, D., and Sailaja, S., Classification of acute myelogenous leukemia in blood microscopic images using supervised classifier. In:2015 IEEE International Conference on Engineering and Technology (ICETECH), 2015, 1–5.
Mishra, S., Majhi, B., Sa, P. K., and Sharma, L., Gray level co-occurrence matrix and random forest based acute lymphoblastic leukemia detection.Biomedical Signal Processing and Control 33:272–280, 2017.
Nguyen, T., and Nahavandi, S., Modified AHP for gene selection and Cancer classification using Type-2 fuzzy logic.IEEE Trans. Fuzzy Syst. 24:273–287, 2016.
Hossin, M., and Sulaiman, M., A review on evaluation metrics for data classification evaluations.International Journal of Data Mining & Knowledge Management Process 5:1, 2015.
Sokolova, M., and Lapalme, G., A systematic analysis of performance measures for classification tasks.Inf. Process. Manag. 45:427–437, 2009.
Krappe, S., Benz, M., Wittenberg, T., Haferlach, T., and Munzenmayer, C., Automated classification of bone marrow cells in microscopic images for diagnosis of leukemia: A comparison of two classification schemes with respect to the segmentation quality. In: Hadjiiski, L. M., Tourassi, G. D. (Eds),Medical Imaging 2015: Computer-Aided Diagnosis. Vol. 9414, 2015.
Cui, Y., Zheng, C.-H., Yang, J., and Sha, W., Sparse maximum margin discriminant analysis for feature extraction and gene selection on gene expression data.Comput. Biol. Med. 43:933–941, 2013.
Mohapatra, P., Chakravarty, S., and Dash, P. K., Microarray medical data classification using kernel ridge regression and modified cat swarm optimization based gene selection system.Swarm and Evolutionary Computation 28:144–160, 2016.
Wang, H.-Q., Wong, H.-S., Zhu, H., and Yip, T. T. C., A neural network-based biomarker association information extraction approach for cancer classification.J. Biomed. Inform. 42:654–666, 2009.
Zhang, L., and Xiaojuan, H., Multiple SVM-RFE for multi-class gene selection on DNA microarray data. In:2015 International Joint Conference on Neural Networks (IJCNN), 2015, 1–6.
Yongqiang, D., Bin, H., Yun, S., Chengsheng, M., Jing, C., Xiaowei, Z. et al., Feature selection of high-dimensional biomedical data using improved SFLA for disease diagnosis. In:2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2015, 458–463.
Salem, H., Attiya, G., and El-Fishawy, N., Gene expression profiles based human cancer diseases classification. In:2015 11th International Computer Engineering Conference (ICENCO), 2015, 181–187.
Campos, L. M. d., Cano, A., Castellano, J. G., and Moral, S., Bayesian networks classifiers for gene-expression data. In:2011 11th International Conference on Intelligent Systems Design and Applications, 2011, 1200–1206.
Bhattacharjee, R., and Saini, L. M., Detection of acute lymphoblastic leukemia using watershed transformation technique. In:2015 International Conference on Signal Processing, Computing and Control (ISPCC), 2015, 383–386.
Chandra, B., and Gupta, M., Robust approach for estimating probabilities in Naïve–Bayes classifier for gene expression data.Expert Syst. Appl. 38:1293–1298, 2011.
Singhal, V., and Singh, P., Local binary pattern for automatic detection of acute lymphoblastic leukemia. In:2014 Twentieth National Conference on Communications (NCC), 2014, 1–5.
Rashid, S., and Maruf, G. M., An adaptive feature reduction algorithm for cancer classification using wavelet decomposition of serum proteomic and DNA microarray data. In:2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW), 2011, 305–312.
Ludwig, S. A., Jakobovic, D., and Picek, S., Analyzing gene expression data: Fuzzy decision tree algorithm applied to the classification of cancer data. In:2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2015, 1–8.
Saritha, M., Prakash, B. B., Sukesh, K., and Shrinivas, B., Detection of blood cancer in microscopic images of human blood samples: A review. In:2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), 2016, 596–600.
Tai, W. L., Hu, R. M., Hsiao, H. C. W., Chen, R. M., and Tsai, J. J. P., Blood cell image classification based on hierarchical SVM. In:2011 IEEE International Symposium on Multimedia, 2011, 129–136.
Kumar, P. G., Aruldoss Albert Victoire, T., Renukadevi, P., and Devaraj, D., Design of fuzzy expert system for microarray data classification using a novel genetic swarm algorithm.Expert Syst. Appl. 39:1811–1821, 2012.
He, Y., and Hui, S. C., Exploring ant-based algorithms for gene expression data analysis.Artif. Intell. Med. 47:105–119, 2009.
Yusen, Z., and Liangyun, R., Two feature selections for analysis of microarray data. In:2010 IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA), 2010, 1259–1262.
Rosa, J. L. D., Magpantay, A. E. A., Gonzaga, A. C., and Solano, G. A., Cluster center genes as candidate biomarkers for the classification of leukemia. In:IISA 2014, the 5th International Conference on Information, Intelligence, Systems and Applications, 2014, 124–129.
Lu, X., Peng, X., Liu, P., Deng, Y., Feng, B., and Liao, B., A novel feature selection method based on CFS in cancer recognition. In:2012 IEEE 6th International Conference on Systems Biology (ISB), 2012, 226–231.
Kumar, M., and Kumar Rath, S., Classification of microarray using MapReduce based proximal support vector machine classifier.Knowl.-Based Syst. 89:584–602, 2015.
Dash, S., Hill-climber based fuzzy-rough feature extraction with an application to cancer classification. In:13th International Conference on Hybrid Intelligent Systems (HIS 2013), 2013, 28–34.
Wahbeh, A. H., Al-Radaideh, Q. A., Al-Kabi, M. N., and Al-Shawakfa, E. M., A comparison study between data mining tools over some classification methods.Int. J. Adv. Comput. Sci. Appl. Special Issue on Artificial Intelligence:18–26, 2011.
Rangra, K., and Bansal, D. K. L., Comparative study of data mining tools.International Journal of Advanced Research in Computer Science and Software Engineering 4(6), 2014.
Yas, Q. M., Zaidan, A. A., Zaidan, B. B., Rahmatullah, B., and Karim, H. A., Comprehensive insights into evaluation and benchmarking of real-time skin detectors: Review, open issues & challenges, and recommended solutions. Measurement 114:243–260, 2018.
Wang, Z., and Palade, V., A comprehensive fuzzy-based framework for Cancer microarray data gene expression analysis. In:2007 IEEE 7th International Symposium on BioInformatics and BioEngineering, 2007, 1003–1010.
Nazlibilek, S., Karacor, D., Ercan, T., Sazli, M. H., Kalender, O., and Ege, Y., Automatic segmentation, counting, size determination and classification of white blood cells.Measurement 55:58–65, 2014.
Bhattacharjee, R., and Saini, L. M., Robust technique for the detection of acute lymphoblastic leukemia. In:2015 IEEE Power, Communication and Information Technology Conference (PCITC), 2015, 657–662.
Torkaman, A., Charkari, N. M., Aghaeipour, M., and Hajati, E., A recommender system for detection of leukemia based on cooperative game. In:2009 17th Mediterranean Conference on Control and Automation, 2009, 1126–1130.
Escalante, H. J., Montes-y-Gómez, M., González, J. A., Gómez-Gil, P., Altamirano, L., Reyes, C. A. et al., Acute leukemia classification by ensemble particle swarm model selection.Artif. Intell. Med. 55:163–175, 2012.
Madhloom, H. T., Kareem, S. A., and Ariffin, H., A robust feature extraction and selection method for the recognition of lymphocytes versus acute lymphoblastic leukemia. In:2012 International Conference on Advanced Computer Science Applications and Technologies (ACSAT), 2012, 330–335.
Cornet, E., Perol, J. P., and Troussard, X., Performance evaluation and relevance of the CellaVision (TM) DM96 system in routine analysis and in patients with malignant hematological diseases.Int. J. Lab. Hematol. 30:536–542, 2008.
Rota, P., Groeneveld-Krentz, S., and Reiter, M., On automated flow cytometric analysis for MRD estimation of acute lymphoblastic Leukaemia: A comparison among different approaches. In:2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2015, 438–441.
Keeney, R. L., and Raiffa, H.,Decisions with Multiple Objectives: Preferences and Value Trade-Offs. Cambridge: Cambridge university press, 1993.
Zaidan, A., Zaidan, B., Al-Haiqi, A., Kiah, M. L. M., Hussain, M., and Abdulnabi, M., Evaluation and selection of open-source EMR software packages based on integrated AHP and TOPSIS.J. Biomed. Inform. 53:390–404, 2015.
Khatari, M. et al., Multi-criteria evaluation and benchmarking for active queue management methods: Open issues, challenges and recommended pathway solutions.Int. J. Inf. Technol. Decis. Mak.:S0219622019300039, 2019.
Zaidan, A. A. et al., Multi-criteria analysis for OS-EMR software selection problem: A comparative study.Decis. Support. Syst. 78:15–27, 2015.
Zaidan, B. B. et al., A new digital watermarking evaluation and benchmarking methodology using an external group of evaluators and multi-criteria analysis based on ‘large-scale data.Softw. Pract. Exp. 47(10):1365–1392, 2017.
Yas, Q. M. et al., Towards on develop a framework for the evaluation and benchmarking of skin detectors based on artificial intelligent models using multi-criteria decision-making techniques.Int. J. Pattern Recognit. Artif. Intell. 31(03):1759002, 2017.
Belton, V., and Stewart, T.,Multiple Criteria Decision Analysis: An Integrated Approach. Boston: Kluwer Academic Publishers, 2002.
Zaidan, B., Zaidan, A., Abdul Karim, H., and Ahmad, N., A new approach based on multi-dimensional evaluation and benchmarking for data hiding techniques.Int. J. Inf. Technol. Decis. Mak.:1–42, 2017.
Zaidan, B., and Zaidan, A., Software and hardware FPGA-based digital watermarking and steganography approaches: Toward new methodology for evaluation and benchmarking using multi-criteria decision-making techniques.Journal of Circuits, Systems and Computers 26(07):1750116, 2017.
Abdullateef, B. N., Elias, N. F., Mohamed, H., Zaidan, A., and Zaidan, B., An evaluation and selection problems of OSS-LMS packages.SpringerPlus 5(1):248, 2016.
Qader, M. A. et al., A methodology for football players selection problem based on multi-measurements criteria analysis.Measurement 111:38–50, 2017.
Rahmatullah, B. et al., Multi-complex attributes analysis for optimum GPS baseband receiver tracking channels selection. In:2017 4th International Conference on Control, Decision and Information Technologies, CoDIT 2017. Vol. 2017, 2017, 1084–1088.
Jumaah, F. M. et al., Technique for order performance by similarity to ideal solution for solving complex situations in multi-criteria optimization of the tracking channels of GPS baseband telecommunication receivers.Telecommun. Syst.:1–19, 2018.
Petrovic-Lazarevic, S., & Abraham, A., Hybrid fuzzy-linear programming approach for multi criteria decision making problems.Neural Parallel & Scientific Comp., 11:53-68, 2003.
Malczewski, J.,GIS and Multicriteria Decision Analysis. New York: Wiley, 1999.
Alsalem, M., Zaidan, A., Zaidan, B., Hashim, M., Albahri, O., Albahri, A. et al., Systematic review of an automated multiclass detection and classification system for acute Leukaemia in terms of evaluation and benchmarking, open challenges, issues and methodological aspects.J. Med. Syst. 42(11):204, 2018.
Yas, Q. M. et al., Comprehensive insights into evaluation and benchmarking of real-time skin detectors: Review, open issues & challenges, and recommended solutions.Measurement 114:243–260, 2018.
Zaidan, B. B., and Zaidan, A. A., Comparative study on the evaluation and benchmarking information hiding approaches based multi-measurement analysis using TOPSIS method with different normalisation, separation and context techniques.Measurement 117:277–294, 2018.
Zaidan, A. A. et al., A review on smartphone skin cancer diagnosis apps in evaluation and benchmarking: Coherent taxonomy, open issues and recommendation pathway solution.Health Technol. (Berl). 8(4):223–238, 2018.
Zionts, S., MCDM-if not a Roman numeral, then what?Interfaces 9:94–101, 1979.
Baltussen, R., and Niessen, L., Priority setting of health interventions: The need for multi-criteria decision analysis.Cost effectiveness and resource allocation 4:1, 2006.
Thokala, P., Devlin, N., Marsh, K., Baltussen, R., Boysen, M., Kalo, Z. et al., Multiple criteria decision analysis for health care decision making—An introduction: Report 1 of the ISPOR MCDA emerging good practices task force.Value Health 19:1–13, 2016.
Oliveira, M., Fontes, D. B., and Pereira, T., Multicriteria decision making: A case study in the automobile industry.Annals of Management Science 3:109, 2014.
Tariq, I. et al., MOGSABAT: A metaheuristic hybrid algorithm for solving multi-objective optimisation problems.Neural Comput. & Applic. 30:1–15, 2018.
Enaizan, O. et al., Electronic medical record systems: Decision support examination framework for individual, security and privacy concerns using multi-perspective analysis.Health Technol., 1-18, 2018.
Salih, M. M. et al., Survey on fuzzy TOPSIS state-of-the-art between 2007–2017.Comput. Oper. Res., 104:207–227, 2019.
Kalid, N. et al., Based real time remote health monitoring systems: A review on patients prioritization and related" big data" using body sensors information and communication technology.J. Med. Syst. 42(2):30, 2018.
Jumaah, F. M. et al., Decision-making solution based multi-measurement design parameter for optimization of GPS receiver tracking channels in static and dynamic real-time positioning multipath environment.Measurement 118:83–95, 2018.
Jadhav, A., and Sonar, R., Analytic hierarchy process (AHP), weighted scoring method (WSM), and hybrid knowledge based system (HKBS) for software selection: A comparative study. In:2009 Second International Conference on Emerging Trends in Engineering & Technology, 2009, 991–997.
Albahri, A. S. et al., Real-time fault-tolerant mHealth system: Comprehensive review of healthcare services, opens issues, challenges and methodological aspects.J. Med. Syst. 42(8):137, 2018 Springer US.
Albahri, O. S. et al., Real-time remote health-monitoring systems in a Medical Centre: A review of the provision of healthcare services-based body sensor information, open challenges and methodological aspects.J. Med. Syst. 42(9):164, 2018.
Talal, M. et al., Comprehensive review and analysis of anti-malware apps for smartphones.Telecommun. Syst., 1-53, 2019.
Zaidan, A. A. et al., Based multi-agent learning neural network and Bayesian for real-time IoT skin detectors: A new evaluation and benchmarking methodology.Neural Comput. & Applic., 2019.
Albahri, A. S. et al., Based multiple heterogeneous wearable sensors: A smart real-time health monitoring structured for hospitals distributor.IEEE Access 7:37269–37323, 2019.
Albahri, O. S. et al., Fault-tolerant mHealth framework in the context of IoT-based real-time wearable health data sensors.IEEE Access 7:50052–50080, 2019.
Whaiduzzaman, M., Gani, A., Anuar, N. B., Shiraz, M., Haque, M. N., and Haque, I. T., Cloud service selection using multicriteria decision analysis.Sci. World J. 2014:459375, 2014.
Aruldoss, M., Lakshmi, T. M., and Venkatesan, V. P., A survey on multi criteria decision making methods and its applications.American Journal of Information Systems 1:31–43, 2013.
Singh, A., & Malik, SK., Major MCDM techniques and their application-a review.IOSR Journal of Engineering, 4(5):15-25, 2014.
Opricovic, S., and Tzeng, G.-H., Compromise solution by MCDM methods: A comparative analysis of VIKOR and TOPSIS.Eur. J. Oper. Res. 156:445–455, 2004.
Guo, S., and Zhao, H., Fuzzy best-worst multi-criteria decision-making method and its applications.Knowl.-Based Syst. 121:23–31, 2017.
Rezaei, J., Best-worst multi-criteria decision-making method.Omega 53:49–57, 2015.
Tavana, M., and Hatami-Marbini, A., A group AHP-TOPSIS framework for human spaceflight mission planning at NASA.Expert Syst. Appl. 38:13588–13603, 2011.
Zaidan, A. A., Zaidan, B. B., Albahri, O. S., Alsalem, M. A., Albahri, A. S., Yas, Q. M. et al., A review on smartphone skin cancer diagnosis apps in evaluation and benchmarking: Coherent taxonomy, open issues and recommendation pathway solution.Heal. Technol. 8:223–238, 2018.
Azeez, D., Ali, M. A. M., Gan, K. B., and Saiboon, I., Comparison of adaptive neuro-fuzzy inference system and artificial neutral networks model to categorize patients in the emergency department.SpringerPlus 2:416, 2013.
Ashour, O. M., and Okudan, G. E., Fuzzy AHP and utility theory based patient sorting in emergency departments. International Journal of Collaborative Enterprise 1:332–358, 2010.
Mills, A. F., A simple yet effective decision support policy for mass-casualty triage.Eur. J. Oper. Res. 253:734–745, 2016.
Adunlin, G., Diaby, V., and Xiao, H., Application of multicriteria decision analysis in health care: A systematic review and bibliometric analysis.Health Expect. 18:1894–1905, 2015.
Jumaah, F., Zadain, A., Zaidan, B., Hamzah, A., and Bahbibi, R., Decision-making solution based multi-measurement design parameter for optimization of GPS receiver tracking channels in static and dynamic real-time positioning multipath environment.Measurement, 118:83-95, 2018.
Yas, Q. M., Zaidan, A., Zaidan, B., Rahmatullah, B., and Karim, H. A., Comprehensive insights into evaluation and benchmarking of real-time skin detectors: Review, open issues & challenges, and recommended solutions.Measurement, 114:243-260, 2018.
Nilsson, H., Nordström, E.-M., and Öhman, K., Decision support for participatory forest planning using AHP and TOPSIS.Forests 7:100, 2016.
Kornyshova, E., and Salinesi, C., MCDM techniques selection approaches: State of the art. In:2007 IEEE Symposium on Computational Intelligence in Multi-Criteria Decision-Making, 2007, 22–29.
Kaya, İ., Çolak, M., and Terzi, F., Use of MCDM techniques for energy policy and decision-making problems: A review.Int. J. Energy Res. 42:2344–2372, 2018.
Wan Ahmad, W. N. K., Rezaei, J., Sadaghiani, S., and Tavasszy, L. A., Evaluation of the external forces affecting the sustainability of oil and gas supply chain using best worst method.J. Clean. Prod. 153:242–252, 2017.
Gupta, H., and Barua, M. K., Supplier selection among SMEs on the basis of their green innovation ability using BWM and fuzzy TOPSIS.J. Clean. Prod. 152:242–258, 2017.
Rezaei, J., Best-worst multi-criteria decision-making method: Some properties and a linear model.Omega 64:126–130, 2016.
Yang, Q., Zhang, Z., You, X., and Chen, T., Evaluation and classification of overseas talents in China based on the BWM for intuitionistic relations.Symmetry 8:137, 2016.
Opricovic, S., and Tzeng, G.-H., Extended VIKOR method in comparison with outranking methods.Eur. J. Oper. Res. 178:514–529, 2007.
Mahjouri, M., Ishak, M. B., Torabian, A., Abd Manaf, L., Halimoon, N., and Ghoddusi, J., Optimal selection of Iron and steel wastewater treatment technology using integrated multi-criteria decision-making techniques and fuzzy logic.Process Saf. Environ. Prot. 107:54–68, 2017.
Ren, J., Selection of sustainable prime mover for combined cooling, heat, and power technologies under uncertainties: An interval multicriteria decision-making approach.Int. J. Energy Res., 42(8):2655-2669, 2018.
Gupta, H., Evaluating service quality of airline industry using hybrid best worst method and VIKOR.J. Air Transp. Manag. 68:35–47, 2018.
Serrai, W., Abdelli, A., Mokdad, L., and Hammal, Y., An efficient approach for web service selection. In:2016 IEEE Symposium on Computers and Communication (ISCC), 2016, 167–172.
Shojaei, P., Seyed Haeri, S. A., and Mohammadi, S., Airports evaluation and ranking model using Taguchi loss function, best-worst method and VIKOR technique.J. Air Transp. Manag. 68:4–13, 2018.
Serrai, W., Abdelli, A., Mokdad, L., and Hammal, Y., Towards an efficient and a more accurate web service selection using MCDM methods.J. Comput. Sci. 22:253–267, 2017.
Pamučar, D., Petrović, I., and Ćirović, G., Modification of the best–worst and MABAC methods: A novel approach based on interval-valued fuzzy-rough numbers.Expert Syst. Appl. 91:89–106, 2018.
Tian, Z.-p., Wang, J.-q., and Zhang, H.-y., An integrated approach for failure mode and effects analysis based on fuzzy best-worst, relative entropy, and VIKOR methods.Appl. Soft Comput., 72:636-646, 2018.
Chiu, W.-Y., Tzeng, G.-H., and Li, H.-L., A new hybrid MCDM model combining DANP with VIKOR to improve e-store business.Knowl.-Based Syst. 37:48–61, 2013.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P. et al., Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring.Science 286:531–537, 1999.
Zhou, C., Wan, L., and Liang, Y., A hybrid algorithm of minimum spanning tree and nearest neighbor for classifying human cancers. In:Advanced Computer Theory and Engineering (ICACTE), 2010 3rd International Conference on, 2010, V5-585–V5-589.
Chakraborty, S., Simultaneous cancer classification and gene selection with Bayesian nearest neighbor method: An integrated approach.Computational Statistics & Data Analysis 53:1462–1474, 2009.
Chunbao, Z., Liming, W., and Yanchun, L., A hybrid algorithm of minimum spanning tree and nearest neighbor for classifying human cancers. In:2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE), 2010, V5-585–V5-589.
Horng, J.-T., Wu, L.-C., Liu, B.-J., Kuo, J.-L., Kuo, W.-H., and Zhang, J.-J., An expert system to classify microarray gene expression data using gene selection by decision tree.Expert Syst. Appl. 36:9072–9081, 2009.
Garro, B. A., Rodríguez, K., and Vazquez, R. A., Designing artificial neural networks using differential evolution for classifying DNA microarrays. In:2017 IEEE Congress on Evolutionary Computation (CEC), 2017, 2767–2774.
Al-Sahaf, H., Song, A., and Zhang, M., Hybridisation of genetic programming and nearest neighbour for classification. In:2013 IEEE Congress on Evolutionary Computation, 2013, 2650–2657.
Deegalla, S., and Boström, H., Improving fusion of dimensionality reduction methods for nearest neighbor classification. In:2009 International Conference on Machine Learning and Applications, 2009, 771–775.
Hasan, A., and Akhtaruzzaman, A. M., High dimensional microarray data classification using correlation based feature selection. In:2012 International Conference on Biomedical Engineering (ICoBE), 2012, 319–321.
Huang, P. H., and Moh, T.-t., A non-linear non-weight method for multi-criteria decision making.Ann. Oper. Res. 248:239–251, 2017.
Aboutorab, H., Saberi, M., Asadabadi, M. R., Hussain, O., and Chang, E., ZBWM: The Z-number extension of best worst method and its application for supplier development.Expert Syst. Appl. 107:115–125, 2018.
Almahdi, E. M. et al., Based mobile patient monitoring systems: A prioritization framework using multi-criteria decision making techniques. J. Med. Syst. 43, 2019.
Almahdi, E. M. et al., Mobile patient monitoring systems from a benchmarking aspect: Challenges, open issues and recommended solutions. J. Med. Syst. 43, 2019.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Ethical approval
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institution and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the Topical Collection onSystems-Level Quality Improvement
Appendices
Appendix 1 pairwise comparisons
Section 1: Expert questionnaire
Dear Dr.,
The aim behind this questionnaire is to compare preferences between evaluation metrics of multiclass classification models of acute leukaemia for determining the importance for each metric. This questionnaire is a part of the research activities at Universiti Pendidikan Sultan Idris (UPSI)/Malaysia.
Background:
Name:
Years of experience:
E-Mail:
Position:
Prior to answering the questions, understanding the criteria assessed is important in arriving at a decision.
The criteria that usage for measurement the performance of a trained model on the test dataset. The evaluation criteria of acute leukaemia were divided into two main groups, namely, (1) reliability group, (2) time complexity;
The reliability group includes four subgroups of criteria, namely, (1) matrix of parameters has four metrics (i.e., confusion matrix: True positive, True negative, False negative, False positive), relationship of parameters has five metrics (i.e., Average Accuracy, Precision (Micro), Precision (Macro), Recall (Macro),, behaviour of parameters (F-score) and Error rate. The following Fig. 6 illustrates the levels:
Comparison questions
Comparison measurement scale
The comparisons (relative importance) of each criterion are measured according to a numerical scale from 1 to 9. These relative scales (1 to 9), as shown in Table 8, Please use this scale in comparison.
-
1.
Main Criteria
-
A.
Reliability: the degree of quality or state of being fit to be reliable value for any parameter. It is considered one of the main criteria in our study. This criterion includes four subsections will discuss in the next stage.
-
B.
Time Complexity: is the time consumed by the input and output sample images, that’s mean is the time required to complete the classification task of that algorithm.
Questions
-
1.1.
Could you indicate, which of these two criteria you find is the MOST important and which one you find the LEAST important by marking the box? Please in Table 9, marking the cell of in front of the MOST important criterion and marking the cell of in front of the LEAST important criterion.
You have selected X criterion as the most important criterion.
-
1.2.
Please determine your preference of this criterion (X) over the other least important criterion by using 1 to 9 measurement scale.
Please write the X criterion that you selected as most important criteria in green cell and the least important criterion in the grey cell in Table 10, and then write your preferences value.
-
2.
The sub-criteria (Level 2)
-
A.
Matrix of parameter:
It provides the statistics for the number of correct and incorrect predictions made by a classification system compared with the actual classifications of the samples in the test data
-
B.
Relationship of parameter:
Relationship of parameters also included three parameters that are more important criteria typically used to measure the quality ratio for any case will discuss in the next
stage.
-
C.
Behaviour of parameter:
Behaviour of parameters (f-score) that is to measure average harmonic mean and geometric for precision and recall perimeter will discuss in the next stage.
-
D.
Error rate
Error rate within dataset: Basically, the procedure of dataset is to obtain the minimum error rate of the data during the implementation process of the training and validation applied in machine learning.
Questions
-
2.1.
Could you indicate which one of these criteria (sub-criteria (Level 2)) consider the MOST important and which one you find the LEAST important? Please in Table 11, marking the cell of in front of the MOST important criterion and marking the cell of in front of the LEAST important criterion.
You have selected X criterion as the MOST important criterion and Y criterion as the LEAST important criterion
-
2.2.
Please determine your preference of the criterion (X) over the other criteria by using 1 to 9 measurement scale.
Please write the X criterion that you selected as most important criterion in green cell and the other criteria in the grey cells in Table 12, and then write your preferences value.
-
2.3.
You have selected Y criterion as the LEAST important criterion.
Please determine your preference of all criteria over the Y criteria that you selected as LEAST important criterion by using 1 to 9 measurement scale.
Please write the Y criterion that you selected as LEAST important criteria in green cell and the other criteria in the grey cells in Table 13, and then write your preferences value.
-
3.
The sub-criteria (A) of Matrix of parameter (level 3)
True positive | The number of elements correctly classified as positive by the test. When cancer cells are correctly identified |
True negative | The number of elements correctly classified as negative by the test. When non-cancer cells are correctly identified |
False positive | The number of elements classified as positive by the test, but they are not. When non-cancer cells are identified as cancerous |
False negative | The number of elements classified as negative by the test, but they are not. When cancer cells are identified as noncancerous |
Questions
-
3.1.
Could you indicate which one of these criteria (sub-criteria A(Level 3)) consider the MOST important and which one you find the LEAST important? Please in Table 14, marking the cell of in front of the MOST important criterion and marking the cell of in front of the LEAST important criterion.
You have selected X criterion as the MOST important criterion and Y criterion as the LEAST important criterion
-
3.2.
Please determine your preference of the criterion (X) over the other criteria by using 1 to 9 measurement scale.
Please write the X criterion that you selected as most important criterion in green cell and the other criteria in the grey cells in Table 15, and then write your preferences value.
-
3.3.
You have selected Y criterion as the LEAST important criterion.
Please determine your preference of all criteria over the Y criteria that you selected as LEAST important criterion by using 1 to 9 measurement scale.
Please write the Y criterion that you selected as LEAST important criterion in green cell and the other criteria in the grey cells in Table 16, and then write your preferences value.
-
4.
The sub-criteria (B) of Relationship of parameter in (level 3)
Average Accuracy | The average effectiveness of all classes |
Precision(micro) | is used to measure the positive patterns that are correctly predicted from the total predicted patterns in a positive class (Agreement of the data class labels with those of a classifiers) |
Precision(macro) | Is an average per-class agreement of the data class labels with those of a classifier (An average per-class agreement of the data class with those of a classifiers). |
Recall(Macro) | Recall is used to measure the fraction of positive patterns that are correctly classified |
Questions
-
4.1.
Could you indicate which one of these criteria (sub-criteria B (Level 3)) consider the MOST important and which one you find the LEAST important? Please in Table 17, marking the cell of in front of the MOST important criterion and marking the cell of in front of the LEAST important criterion.
X criterion selected as the best criterion and Y criterion as the LEAST important criterion
-
4.2.
Determine your own preference of the criterion (X) compare the other criteria by using 1 to 9 measurement scale.
Please write the X criterion that you selected as most important criterion in green cell and the other criteria in the grey cells in Table 18, and then write your preferences value.
-
4.3.
Y criterion selected as the worst criterion.
Determine your own preference of all criteria compare with Y criterion that you selected as worst criterion by using 1 to 9 measurement scale.
Please write the Y criterion that you selected as LEAST important criterion in green cell and the other criteria in the grey cells in Table 19, and then write your preferences value.
Should you have any inquiry or wish to know the result please contact:
Mohammed Assim Mohammed Ali
Email: Mohammed.asum@gmail.com
Mobile phone: 0060189810357
……. Thanks for Your Time …….
Section 2: List of experts
Appendix 2 results of the BWM method for second and third experts
Appendix 3 results of VIKOR for second and third experts
Rights and permissions
About this article
Cite this article
Alsalem, M.A., Zaidan, A.A., Zaidan, B.B. et al. Multiclass Benchmarking Framework for Automated Acute Leukaemia Detection and Classification Based on BWM and Group-VIKOR. J Med Syst 43, 212 (2019). https://doi.org/10.1007/s10916-019-1338-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10916-019-1338-x