Introduction

Medical informatics is the intersection of information science, computer science, and health care [1,2,3,4,5,6,7,8,9,10]. This field deals with the resources, devices, and methods required to optimize the acquisition, storage, retrieval, and use of information in health [11,12,13,14,15,16,17,18,19,20,21,22,23,24]. The decisions of the administration departments of medical organisations are critical, particularly decisions regarding the selection of automated solutions for the diagnosis and detection of complex diseases, such as acute leukaemia [25]. The importance of selecting appropriate automated solutions can be attributed to their extensive use [26]. Automated solutions based on artificial intelligence techniques can provide rapid acute leukaemia diagnosis and classification and increase the reliability and accuracy of diagnostic results [26,27,28,29,30,31,32]. Many physicians, cancer treatment centres and hospitals have started using automated models for acute leukaemia classification to address the several potential limitations of manual analysis [26, 29, 30]. However, despite the increasing number of automated classification models, finding models that deliver highly accurate results in a short time and without error remains challenging [33]. Therefore, the administration departments of health organisations have been facing difficulties in evaluating and benchmarking automated classification models for acute leukaemia and determining the best model, especially when no single model is superior [29, 33, 34]. Moreover, evaluating and comparing different classification models is difficult in the presence of multiple evaluation criteria [35, 36]. Given the existence of different classification models for acute leukaemia, the health sector has difficulty deciding which model should be used. The required processes for tasks related to the evaluation and benchmarking of automated classification models for dangerous medical cases are crucial to the identification of the classification model that delivers the best results [27]. These processes are crucial because the selection of an incorrect classification model can lead to the loss of a patient’s life, legal accountability and even financial costs for the health organisations. For example, when a model incorrectly identifies non-cancer cells as cancerous in a patient, the surgery and diagnostic tests the patient have to undergo may pose adverse effects on his or her mental health. Conversely, when a model incorrectly identifies cancer cells as non-cancerous, the disease remains untreated, and the patient may die as a result. Both cases have a negative impact on the reputation and performance of healthcare organisations. Therefore, determining the most efficient technique for selecting a suitable classification model for acute leukaemia is necessary. Given that these models are not cheap, as well as related to the medical aspect for humans, they must be evaluated and benchmarked [35]. The procedures related to the multiclass classification of acute leukaemia through evaluation and benchmarking remains challenging [29]. The tasks involved in the evaluation and benchmarking automated models for acute leukaemia are difficult decision-making tasks and requires numerous measurements [34]. Two basic sets of criteria are commonly utilised in the evaluation and benchmarking of acute leukaemia multiclass classification modes: (1) time complexity and (2) reliability group. The first group for reliability has a set of sub-criteria (TP, TN, FP, FN, ave-accuracy, precisionμ, precisionM, recallM, fscore and error rate) [37, 38]. Snousy et al. considered the main requirements for the best classification model in terms of accuracy [33], and nine classification models based on accuracy criterion were compared in their study. Despite the importance of the remaining criteria [39,40,41], several studies [32, 42,43,44,45,46] adopted the classification accuracy criterion for the evaluation and benchmarking of classification models. However, the quality assessment of acute leukaemia classification models requires additional attention. In the same context, some other aspects must be considered in the evaluation processes [33]. According to Rawat et al., although accuracy is the most widely used metric, each class of aspect is considered with equal importance, and the differences among the types of classes are neglected [32]. However, in real cases, particularly those related to medicine, the distinction among certain classified classes is important. In [47,48,49], True Positive, True Negative, False Positive and False Negative sensitivity were used as key criteria for evaluation and benchmarking, but other requirements that might have an impact on classification performance were neglected. In [35], the calculation of time complexity was found to be time consuming for classification. High computational cost causes the slowdown of classification [50]. Misha et al. indicated that the dataset size should be considered in the classification task because a large dataset affects processing time; this condition is known as time complexity [35]. Ludwig et al. stated that in the scope of cancer data analysis, speed and accuracy are the main aspects that must be considered in the evaluation of the efficiency of classification models [51]. Classification tasks are considered good if the results with low computational time are delivered and classification accuracy is simultaneously improved [52]. In other words, the main requirements that must be considered when developing any acute leukaemia multiclass classification model are as follows: (1) time complexity and (2) reliability. Reliability should have a high rate, and the time complexity for conducting the output should be low [52]. However, these requirements are competing requirements [53]; that is, high reliability cannot simultaneously be obtained with low time complexity. Thus, the developers usually focus on either increasing reliability or decreasing time complexity. If a highly reliable multiclass classification model is required, then time must be sacrificed, and vice versa. The trade-off and conflict among the evaluation criteria are reflected on the evaluation and benchmarking process. This situation leads to conflicts among criteria in the comparison, and the benchmarking process is affected. Consequently, benchmarking among multiple criteria is difficult with trade-off and conflict [54]. Reliability and time complexity should be measured in the evaluation of any classification model. However, current approaches for comparing novel and previous models in all the reviewed studies do not focus on the evaluation and benchmarking criteria; they only emphasised the evaluation aspect and neglected the rest because they are not sufficiently flexible to deal with the conflict or trade-off among the various criteria [33]. Conflict and trade-off are considered the first issue faced by the evaluation and benchmarking of multiclass classification models. The second issue is the importance of each criterion. Acute leukaemia evaluation in terms of multiclass classification models involves a set of criteria, and the importance of each criterion is distinct and depends on the objectives of the developed model. That is, the importance of one of the evaluation criteria might be boosted in exchange for the low importance of another criterion based on model objectives [34]. Therefore, trade-off and conflict exist between evaluation and benchmarking criteria due to importance differences of each criterion in different models [55]. The third issue emerges when the benchmarking process is conducted on the basis of simultaneous multiple criteria and sub-criteria [56,57,58]. This approach is considered to be difficult due to the trade-off among the criteria and their various importance; however, the reliability of a criteria set indicates that the values depend on the confusion matrix containing four parameters: True Positive, False Positive, True Negative and False Negative [47, 59]. The four parameters are prone to lose values in experiments, affecting the remaining values of other criteria in the reliability group. Despite the criticism with respect to these parameters, the studies still used these parameters for the evaluation of multiclass classification models [56,57,58, 60]. By contrast, the current evaluation and benchmarking tools have limitations. These tools cannot entirely cover the required measurements by the multiclass classification model. Moreover, these tools have limitations in terms of the overall parameter calculation of the reliability group, comparison between the two additional classification methods and matching between the classification methods because the tools cannot rank the models according to performance [61,62,63]. In the preceding discussion, the problem of evaluation and benchmarking process in multiclass classification models of acute leukaemia is defined as a multi-criteria problem. Therefore, an integrated and comprehensive platform covering all the aspects of performance in the evaluation and benchmarking of multiclass classification models for acute leukaemia should be developed. This integrated platform will serve as a tool that supports the decisions of the administrators of medical organisations in the evaluation and benchmarking of available alternatives and the identification of the best model. The main objective of the current paper is to propose a framework for evaluating and benchmarking multiclass classification models for acute leukaemia. The remaining parts of this article are divided into the following seven sections: the ‘Related Studies’ section presents related literature review. The ‘Multi-criteria decision-making’ section shows the theoretical background of the recommended solution. The ‘Methodology’ section reports the evaluation and benchmarking framework for multiclass classification models. The results and discussion are reported in the ‘Results and discussion’ section. ‘Validation’ deliberates the validation results for the proposed framework. The ‘Limitations and future study’ section highlights the limitations of the proposed framework and future studies. The ‘Conclusion’ section presents the conclusion of the research.

Related studies

The selection of a suitable classification model for acute leukaemia is considered a challenge faced by medical institutions, especially those with specialisation in cancer treatment. The essence of the challenge lies in the capacity of the selected model to allow a precise and immediate acute leukaemia classification.

Previous literature distinctly explained that classification tasks of acute leukaemia differ with respect to result accuracy provided and overall performance. Similarly [29, 33, 34], no previous classification model has been considered superior. Many studies have discussed the development of automated models for acute leukaemia analysis, as well as the way the models is used and the benefits that health organisations could gain from using them [29, 32, 34, 47, 49, 64,65,66,67,68,69]. However, studies that aimed to provide an evaluation and benchmarking of available classification models and determine the best one are limited. Existing academic literature featuring topics related to the evaluation and benchmarking of acute leukaemia multiclass classification models are scarce and scattered; some studies are only limited to the evaluation and benchmarking of one aspect of performance. In [70], automated microscopy was analysed with DM96TM. Snousy et al. compared nine classification models under decision tree family in terms of accuracy and explored their performance in determining blood cells, then compared their accuracy with that of the manual method and XE-2100TM. The study attempted to examine experimental effects to different methods of feature selection with respect to accuracy [33]. An ALL-IDB, which is a public image dataset of peripheral blood samples for normal people and patients with leukaemia, was proposed in [27]; supervised classification and segmentation of the data were provided by the image dataset, which is particularly designed for comparing and evaluating algorithms for segmentation and classification. In [71], three automatic detection approaches for leukaemic cells were compared. The first approach is based on support vector machine, the second is based on a neural network and the third is Gaussian mixture model estimation. The comparison relied on three criteria, namely, accuracy, precision and recall. In addition to the effect of various segmentations on classification results, in [39], two classification schemes were compared in terms of segmentation quality. The first scheme is based on support vector machine, whereas the second is based on random forest. Evaluation and benchmarking methods must be utilised to cover all main requirements and substantively determine the performance and quality of classification models for acute leukaemia. In addition to reduced processing time and small error rate,Saritha et al. assured that the automated classification model has high accuracy and efficiency. Suitable treatment to patients can be provided with the early identification of leukaemia [52]. Despite the substantial effort in the evaluation and benchmarking of acute leukaemia classification tasks, no study has provided an integrated solution that covers the key evaluation criteria for evaluating and benchmarking multiclass classification models and helping the administrators of medical organisations and various users to determine a suitable model. This study attempts to fill the evaluation and benchmarking research gap with respect to acute leukaemia classification tasks.

Multi-criteria decision making (MCDM)

Numerous MCDM definitions are available in academic literature. However, MCDM was defined by Keeney and Raiffa [72] as decision theory extension, which is aimed to cover the decision of any multiple objectives. MCDM is used as a methodology to aid in cases, such as those assessing alternatives on individuals, which are often followed by conflicting criteria and combined into one overall appraisal [73,74,75,76,77]. Among the other definitions of MCDM, [78] defined MCDM as an umbrella term, which describes the collection of formal approaches. These processes decide to take explicit account of multiple criteria to assist individuals or groups exploring important decisions which matter [79,80,81,82,83,84]. Among the most well-known decision techniques, MCDM is known for its decision-making capabilities, enabling it to address complicated decision problems whilst handling multiple criteria [85, 86]. Furthermore, MCDM demonstrates a systematic method to address decision problems on the basis of multiple criteria [86,87,88,89,90]. The goal is to help decision makers deal with this kind of problems [91]. MCDM procedure often relies on approaches with quantitative and qualitative nature and frequently concentrates on simultaneously dealing with multiple and conflicting criteria [92, 93]. MCDM also has the capabilities to increase decision quality based on the approach via effective and rational ways more than traditional processes [94]. Furthermore, MCDM intends to acquire the following: categorise suitable alternatives among a group of available ones and rank the alternatives according to performance in decreasing order [95,96,97,98,99]. The last is the selection of these alternatives [100,101,102,103,104,105,106]. Suitable alternatives will be scored based on the previous goals. Essential terms are required in any MCDM solution, namely, the decision or evaluation matrix, which are also called decision criteria [107]. Decision matrix must be created using elements, including n criteria and m alternatives. Each criteria intersection and alternative is specified as x_ij. Therefore, matrix (x_ij) _ (m*n) is expressed as follows:

$$ {\displaystyle \begin{array}{l}\kern2.00em {C}_1\kern1.25em {C}_2\kern1em \cdots \kern1em {C}_n\\ {}D=\begin{array}{c}{A}_1\\ {}{A}_2\\ {}\vdots \\ {}{A}_m\end{array}\left\lceil \begin{array}{cccc}{x}_{11}& {x}_{12}& \cdots & {x}_{1n}\\ {}{x}_{21}& {x}_{22}& \cdots & {x}_{2n}\\ {}\vdots & \vdots & \vdots & \vdots \\ {}{x}_{m1}& {x}_{m2}& \cdots & {x}_{mn}\end{array}\right\rceil, \end{array}} $$

where A_1, A_(2)…...,A_m are possible alternatives to be ranked by the decision makers (i.e. classification models); C_1,C_(2)...…,C_n are the criteria against which the performance of each alternative is evaluated and x_ij is the rating of alternative A_i with respect to criterion C_j, and W_j is the weight of criterion C_j. Special processes must be accomplished to score the alternatives. Normalisation is included in some of these processes. Maximisation indicator, addition of weights and other processes are based on the method. For example, suppose that D is the decision matrix utilised in scoring the Ai performance of the alternative, where based on Cj. Enhancing the decision-making process is important and possible by involving decision makers and stakeholders. Using appropriate decision-making methods towards handling multi-criteria problems is also necessary. Healthcare is one of the extensively utilised domains of MCDM [93, 108]. Improving decision making in healthcare is possible through a systematic method and by determining the best decision through different MCDM methods [109, 110]. Especially, many decisions in the healthcare and medical fields are complex and unstructured [108]. Numerous MCDM techniques have been developed, and the most commonly used MCDM techniques are th best-worst method (BWM), weighted product method (WPM), hierarchical adaptive weighting (HAW), simple additive weighting (SAW), multiplicative exponential weighting (MEW), weighted sum model (WSM), analytic network process (ANP), analytic hierarchy process (AHP), technique for order of preference by similarity to ideal solution (TOPSIS) and VlseKriterijumska Optimizacija I Kompromisno Resenje’ (VIKOR), which uses different notations [1, 73, 79,80,81, 108, 110,111,112,113,114,115,116,117,118,119,120,121]. Available MCDM techniques are diverse, and this diversity makes the selection of suitable techniques difficult. Each technique has its own limitations and strengths [81, 109, 112, 122, 123]. Thus, selecting the most suitable MCDM method is important. To the best of the our knowledge, none of the analysed methods have been used to rank multiclass classification models for acute leukaemia. In our previous work [87], we found that BWM and VIKOR are the two of the best MCDM methods.

The current study utilised ‘best–worst’ methods because it can provide more consistent results than AHP and other MCDM weighting methods. Moreover, the BWM-based pairwise comparisons are fewer than those in other methods [112, 124,125,126]. The pairwise comparison based on BWM also focuses on reference comparisons. This condition means that this comparison executes the most important preference of criterion over all the other criteria in addition to the preference of all the other criteria of least important criterion [111, 112, 127]. Conversely, MCDM methods are frequently used to rank alternatives, and the most common is VIKOR. The method utilises the approach for compromise priority for multiple response optimisation [110, 128, 129]. VIKOR is based on an aggregating function that represents ‘closeness to the ideal’. The index for VIKOR ranking is based on a particular measure of ‘closeness’ to the ideal solution. Furthermore, VIKOR has the capability towards the ranking of the alternatives to accurately and rapidly determine the best [128]. The style for recent VIKOR studies changed, and VIKOR is usually integrated with another MCDM method. Reviewed studies identified and provided different examples for applying VIKOR with BWM to improve consistency for subjective weights. A similar integration between VIKOR with BWM realises a robust method. Given the advantages of the two methods in overcoming uncertainties associated with the problem described in [130,131,132,133,134,135,136], using VIKOR and BWM is easy and clear even for those with no background on MCDM [136]. Utilising VIKOR with different cases (e.g., individual and groups) has been recommended. Two main cases of decision making are basically emphasised: the first case is decision making based on a single decision maker; the second involves many decision makers and is called group decision making (GDM), in which individuals collectively select alternatives from the ones presented to them. The decision is not attributed to any single group member because of the individual and social processes, such as social influence, which contribute to the outcome. The GDM techniques systematically collect elements and combine components from experts, including their knowledge and judgement from different fields. In relation to a group case, the judgement criteria of each expert, which require subjective judgement, are provided. The same expert assigns weight for every criterion [110, 137]. Finally, evaluation and benchmarking for acute leukaemia multiclass classification suggests a need to integrate BWM and VIKOR methods. The suggestion is based on assigning weights for criteria (reliability, time complexity rate) according to BWM and on the basis of the evaluation of an expert. The utilisation of VIKOR is recommended in the ranking of multiclass classification models.

Methodology

This section introduces the evaluation and benchmarking methodology of the automated multiclass classification models. In addition, the section will introduce the procedures and steps of the proposed framework. The output ranked multiclass classification models based on the set of criteria using the BWM and VIKOR for weighting and ranking, respectively. All the overall conceptual elements of the present study are illustrated in Fig. 1.

Fig. 1
figure 1

Benchmarking methodology of the multiclass classification models for acute leukaemia

Construction of decision matrix

Decision matrix considers the main component in the evaluation and benchmarking framework. The main parts of decision matrix are decision criteria and alternatives. In the present case, the criteria represent the metrics used for measuring the quality of multiclass classification models. The next subsection describes the procedures followed to develop and evaluate the multiclass classification models and construct the decision matrix.

Data source

The dataset proposed by [138] for acute leukaemia microarray was adopted in this study. The dataset is recognised for its popularity and usage in the academic literature and the most frequently utilised in the papers (References [139,140,141], which is available for the public). The dataset has three categories for acute leukaemia: acute myelogenous leukaemia (AML), ALL B cell and ALL T cell. The dataset comprises 5327 genes and 72 samples, of which 38 are AML, 9 are ALL-B and 25 are ALL-T types.

Development of multiclass classification models

Developing multiclass classification models requires a three-step process. Firstly, the target dataset, which include the selection of relevant features, is prepared. Secondly, training (learning process), which involves the establishment of a class through machine learning, is achieved by analysing the instances of a training dataset. Each individual instance, should belong to a predefined class, and each instance is assumed to belong to a predefined class. Thirdly, machine learning algorithms are executed with other independent datasets, which are also known as testing datasets. This step is in line with the aim of performing machine learning estimation. If the performance for multiclass classification model appears to be ‘acceptable’, then the model can be utilised for future classification cases when the class label is unknown. Ultimately, the multiclass classification models, which supply an acceptable result, can be considered an acceptable multiclass classification model.

The microarray data generally contain dozens of sample sizes (small) and high dimensionality (thousands of genes). Nevertheless, the results of the classification can be affected by the genes, more specifically, few parts of these genes. This condition means that most genes have no classification value. Genes with no relevancy, apart from their negative effects on the classification performance, can cause conflict in the classification model. Moreover, given that irrelevant genes can lead to over-fitting, a positive effect can be attained by reducing the number of genes. This approach can minimise the computed input. This positive effect affects the overall performance and results for the classification [28, 142, 143]. In this study, the genes that are highly relevant with classification classes, which are known as informative genes, are selected. The chi-square (X2) [33] method was used for the individual evaluation of the features. The X2 value is computed as follows [28, 142, 143]:

$$ {x}^2(a)=\sum \limits_{v=V}\sum \limits_{i=1}^n\frac{\left[{A}_i\left(a=v\right)-E\left(a=v\right)\right]2}{E_i\left(a=v\right)}, $$
(1)
$$ {x}^2(a)=\sum \limits_{v=V}\sum \limits_{i=1}^n\frac{\left[{A}_i\left(a=v\right)-E\left(a=v\right)\right]2}{E_i\left(a=v\right)} $$
(2)

where V is the set of possible values for a, n is the number of class, Ai (a = V) is the number of samples in the ith class with a = v and Ei (a = v) is the expected value of Ai (a = v); Ei(a = v) = P (a = v) P (ci) N, where P(a = v) is the probability of a = v, P(ci) is the probability of one sample labelled with the ith class and N is the total number of samples [33].

A total of 22 models for multiclass classification are built based on 22 well-known machine learning algorithms available in Weka software, which have been extensively used in prior studies [33, 42, 51, 60, 142, 144,145,146] and demonstrated satisfactory results when used in the classification of microarray dataset. These algorithms include the following: Rule.zero, Bayes_Net, Bayes.NaiveByesUpdateable, Lazy.IBK, Meta.AdaboostM1, Meta.Bagging, Meta.filteredclassifier, Meta.logitboost, Tree.j48, REPTree, RandomTree, RandomForest, Rule. Decision Table, Rules.part, Meta.RandomCommittee, Trees.LMT, Treed.HoeffdingTree, Kstar, Functions.Smo, Functions.SIMPLE.Logistic, Byes.NaiveBayes and Decision Stump. The dataset is divided into two parts to develop multiclass classification models. The first part is utilised for training purposes, and the other is used for testing purposes. The set for training is used in training the machine learning algorithms, and the other part of the dataset (testing set) is utilised to test the trained machine learning algorithms. The test dataset is classified into three categories, namely, AML, ALL-B and ALL-T, using the 22 multiclass classifications.

Establishment and evaluation of the decision matrix

The establishment of the decision matrix is dependent on the crossover between the evaluation criteria, namely, Ave accuracy, error rate, precisionM, precisionμ, recallM, FP, FN, TP, TN, fscore and time complexity, and the 22 developed multiclass classification models. Figure 2 presents the structure of the proposed decision matrix.

Fig. 2
figure 2

Structure of decision matrix

Figure 2 shows the structure of the proposed decision matrix; the top row represents the main evaluation criteria, and the first column on the left represents different developed multiclass classification models as alternatives. The values (data) in this DM denote the evaluation results of all developed multiclass classification models according to all evaluation criteria. Each multiclass classification model is evaluated based on all evaluation criteria, where the matrix of parameter, relationship of parameters, parameter behaviour and error rate represent the four sub-criteria sets in the group of reliability. Firstly, the matrix of parameter is generated (TP, TN, FN and FP), and the basic sub-criteria are represented by these parameters in the reliability group of criteria. Given that this study addressed the multiclass classification problem, one-verse all approach is used in the calculation of the reliability set of the criteria. According to these criteria, the multiclass confusion matrix is converted to three confusion matrices, and each of matrix describes the parameters for a certain class of acute leukaemia (AML, ALL-B and ALL). Based on the three confusion matrices, the remaining sub-criteria within the reliability group are calculated for each matrix by using a specific formula. Therefore, values for each multiclass classification model will be separately calculated to generate the values considering the input of the decision matrix. Finally, the calculation procedure for time complexity is based on the consumed time by two elements: the input of the dataset sample and result output. The calculation process for the sample process relies on the number and size of samples as indicated in the following equation:

$$ {T}_{process}={T}_o-{T}_i $$
(3)

where T_o is the processing time to obtain outputs, and T_i is the time of inputting the sample. The time complexity is calculated by Weka software through the experimental process. As mentioned in Section 2, the three specific issues encountered by the proposed decision matrix are as follows: (1) trade-off and conflict among the evaluation criteria, (2) multiple evaluation criteria and (3) the importance of criteria. A weight difference is observed between the main criteria and sub-criteria. MCDM is used to address this issue, as presented in the next section.

Development of the evaluation and benchmarking framework

The proposed evaluation and benchmarking framework are developed based on MCDM techniques. The framework is developed based on the integration of BWM and VIKOR for weighting and ranking the best alternatives in the proposed decision matrix and selecting the best one. The subsequent steps are presented below.

Development of evaluation and benchmarking/selection integrated methods of BWM and VIKOR using MCDM

The suitable methods for benchmarking and ranking multiclass classification models are BWM and VIKOR. The VIKOR method is a mathematical model recommended for ranking and solving specific issues related to (1) trade-off and conflict and (2) multi-evaluation criteria encountered by the proposed decision matrix. BWM is also used for weighting the criteria to solve (3) the importance of criteria in relation to the proposed decision matrix.

Accordingly, the combination of BWM and VIKOR methods is justified for benchmarking and ranking the multiclass classification models.

Calculation of the weights of criteria based on BWM method

Assigning proper weights for multi-service criteria using BWM requires several steps. The procedure for BWM includes the following steps [112, 147]:

  1. Step 1.

    Determining a set of decision criteria

For BWM, the first step is to determine the criteria set, C1, C2,.… Cn, which should be considered by the decision maker when selecting the best alternative. In the present study, the set of criteria is obtained from the conducted analysis in the literature.

  1. Step 2.

    Determination of the best and worst criteria

Considering the best criterion as the most desirable or most important decision criteria is possible, and the worst criterion represents the less desirable or important criteria to the decision. This step involves the description of the best and the worst criteria depending on the perspective of the three decision makers/evaluators. Appendix 1 Section 2 presents the BWM comparison questions and the list of experts.

  1. Step 3.

    Conduct the pairwise comparison between the best criterion and the other criteria

The pairwise comparison process occurs between the identified best criterion and the other criteria. The aim of this step is to determine the best criterion preference over all the other criteria. The value must be determined by an evaluator/expert and must be from 1 to 9 to represent the importance of the best criterion over the other criteria. This step will result in a vector identified as ‘Best-to-Others’, which is

$$ AB=\left({a}_{B1},{a}_{B2},\dots, {a}_{Bn}\right), $$

where aBj indicates the importance of the best criterionB over criterion j, and aBB = 1.

  1. Step 4.

    Pairwise comparison process between the other criteria and the worst criterion

The aim of comparison is to identify the preference for all the criteria over the least important criterion. The importance is determined by an evaluator/expert of all the criteria over the worst criterion, and the numbers from 1 to 9 are used towards indicating the importance. The result for this step is a vector recognised as ‘Others-to-Worst’. The vector result of ‘Others-to-Worst’ is represented as Aw = (a1w, a2w, …, aaw), where ajw represents the preference of the criterion j over the worst criterion W. Clearly, aww = 1. Two types of reference comparisons, namely, Best-to-Others and Others-to-Worst criteria, are illustrated in Fig. 3.

Fig. 3
figure 3

Reference comparisons in the BWM method

  1. Step 5.

    Elicit the optimal weights (W*1, W*2, …W*n)

The optimal weight for the criteria is the one where for each pair of WB/Wj and Wj/Ww, WB/Wj = aBJ and Wj / Ww = ajw.

To fulfil these conditions for all j, a solution where the maximum absolute differences for all j are minimised must be obtained:

$$ \left|\frac{W_B}{W_j}-{a}_{Bj}\right| and\ \left|\frac{W_i}{W_w}-{a}_{Bj w}\right| $$
(4)

Considering the non-negativity and sum condition for the weights, the following problem is created:

$$ \min {\max}_j\left\{\frac{W_B}{W_j}-{a}_{Bj}|,\left|\frac{W_j}{W_w}-{a}_{jw}\right|\right\} $$
(5)
$$ {\displaystyle \begin{array}{l}{W}_{j\kern0.5em }\ge 0,\mathrm{for}\ \mathrm{all}\ j\\ {}\sum \limits_j{W}_{j=1}\end{array}} $$

The aforementioned problem can be transferred to the following problem:

$$ {\displaystyle \begin{array}{l}\mathrm{min}\upxi\ \\ {}\mathrm{s}.\mathrm{t}.\end{array}} $$
$$ \left|\frac{W_B}{W_J}-{a}_{Bj}\right|\le \upxi, for\ all\ j $$
(6)
$$ \left|\frac{W_j}{W_w}-{a}_{jw}\right|\le \upxi, for\ all\ j $$
(7)
$$ {\displaystyle \begin{array}{c}{\sum}_j{W}_{j=1}\\ {}{W}_j\ge 0,\mathrm{for}\ \mathrm{all}\ j\end{array}} $$

By finding a solution for the last problem, the optimal weights (w*1; w*2;…; w*n) and ξ n are obtained. The value for ξ* reflects the outcomes’ reliability, depending on the extent of consistency in the comparisons. A value close to zero represents high consistency, and thus, high reliability [112, 126, 127, 148]. After that, the ratio for consistency calculated by using ξ* and the corresponding consistency index is as follows (Table 1):

$$ \mathrm{Consistency}\ \mathrm{Ratio}=\frac{\xi^{\ast }\ }{Consistency\ Index} $$
(8)
Table 1 Index of consistency

As proposed by [112], the bigger the ξ* is, the more consistent the vectors are.

Ranking the multiclass classification models based on VIKOR method

Owing to the suitability of VIKOR for many alternatives and multiple conflicting criteria decision cases, it is used to rank multiclass classification models. VIKOR can provide rapid results, thereby determining the most suitable option at the same time. The weights for all the criteria will be gathered from the BWM and will be utilised in VIKOR. The results for the decision alternative are ranked in ascending order. The models of multiclass classification are ranked based on values of weighted criteria that employ the VIKOR method. VIKOR steps are presented below [149, 150].

  • Step 1: Identify the best fi and worst fi values of all criterion functions, i = 1; 2; ...; n. If the ith function represents a benefit, then

$$ {f}_i^{\ast }=\underset{j}{\max }{f}_{ij},\kern0.5em {f}_i^{-}=\underset{j}{\min }{f}_{ij}. $$
(9)
  • Step 2:

Based on the BWM method, the weights for each criterion are computed. A set of weights w = w1, w2, w3, ⋯, wj, ⋯, wn from the decision maker is accommodated in the DM. This set is equal to 1. The resulting matrix can also be computed as demonstrated in following equation.

$$ WM= wi\ast \frac{f^{\ast }i- fij\kern0.5em }{f^{\ast }i-{f}^{-}i\ } $$
(10)

This process will produce a weighted matrix as follows:

$$ \left[\begin{array}{cccc}{w}_1\left({f}^{\ast }1-f11\right)/\left({f}^{\ast }1-{f}^{-}1\right)& {w}_2\left({f}^{\ast }2-f12\right)/\left({f}^{\ast }2-{f}^{-}2\right)& \cdots & {w}_1\left({f}^{\ast}\mathrm{i}- fij\right)/\left({f}^{\ast}\mathrm{i}-{f}^{-}\mathrm{i}\right)\\ {}{w}_1\left({f}^{\ast }1-f21\right)/\left({f}^{\ast }1-{f}^{-}1\right)& {w}_2\left({f}^{\ast }2-f22\right)/\left({f}^{\ast }2-{f}^{-}2\right)& \cdots & {w}_1\left({f}^{\ast}\mathrm{i}- fij\right)/\left({f}^{\ast}\mathrm{i}-{f}^{-}\mathrm{i}\right)\\ {}\vdots & \vdots & \vdots & \vdots \\ {}{w}_1\left({f}^{\ast }1-f31\right)/\left({f}^{\ast }1-{f}^{-}1\right)& {w}_2\left({f}^{\ast }2-f32\right)/\left({f}^{\ast }2-{f}^{-}2\right)& \cdots & {w}_1\left({f}^{\ast}\mathrm{i}- fij\right)/\left({f}^{\ast}\mathrm{i}-{f}^{-}\mathrm{i}\right)\end{array}\right] $$
(11)
  • Step 3:

Compute the values of Sj and Rj, j = 1,2,3,….,J, i = 1,2,3,…,n by using the following equations:

$$ Sj=\sum \limits_{i=1}^n wi\ast \frac{f^{\ast }i- fij\kern0.5em }{f^{\ast }i-{f}^{-}i\ } $$
(12)
$$ Rj=\underset{i}{\max } wi\ast \frac{f^{\ast }i- fij\kern0.5em }{f^{\ast }i-{f}^{-}i\ } $$
(13)

where wi indicates the criterion weights expressing their relative importance.

  • Step 4:

Compute the values of Qj,j = (1, 2, ⋯, J) by the following relation:

$$ {Q}_{\mathrm{j}}=\frac{\mathrm{v}\left({S}_{\mathrm{j}}-{S}^{\ast}\right)}{S^{-}-{S}^{\ast }}+\frac{\left(1-\mathrm{v}\right)\left({R}_{\mathrm{j}}-{R}^{\ast}\right)}{R^{-}-{R}^{\ast }} $$
(14)

where

$$ {\displaystyle \begin{array}{cc}{S}^{\ast }=\underset{j}{\min }{S}_j,& {S}^{-}=\underset{j}{\max }{S}_j\\ {}\kern0.5em {R}^{\ast }=\underset{j}{\min }{R}_j& {R}^{-}=\underset{j}{\max }{R}_j\end{array}} $$

v is introduced as the weight of the strategy of ‘the majority of criteria’ (or ‘the maximum group utility’); here, v = 0.5.

  • Step 5:

The alternatives can now be ranked by sorting the values of S, R and Q in ascending order. Optimal performance is indicated by the lowest value.

  • Step 6:

Propose as a compromise solution alternative (a), which ranks best by the measure Q (minimum) if the following two conditions are satisfied:

  1. C1.

    ‘Acceptable advantage’:

$$ \mathrm{Q}\left({a}^{{\prime\prime}}\right)-\mathrm{Q}\left({a}^{\prime}\right)\ge \mathrm{DQ} $$
(15)

where (a′′) is the alternative at second position in the ranking list by Q, DQ = 1/(J − 1), J is the number of alternatives.

  1. C2.

    ‘Stability’ is acceptable in the decision-making context. Alternative a should also be ranked best by S and/or R. This compromise solution is stable within the process of decision making, which can be ‘voting by majority rule’ (v > 0:5), ‘by consensus’ (v ≅0.5) or ‘with veto’ (v < 0.5). Here, v is the decision-making strategy weight of ‘the majority of criteria’ (or ‘the maximum group utility’). The Q value provides an idea of which multiclass classification model has higher values of evaluation criteria than the others. According to this technique, the multiclass classification models with high values of evaluation criteria will have the lowest Q value. Two main decision-making contexts will be applied: individual decision making and GDM. In the former, decision making will be based on a single individual decision maker, whereas GDM is based on multiple decision makers/experts. GDM will be performed in two ways: internal aggregation and external aggregation. Figure 4 illustrates the procedures that will be followed to apply the types of aggregation.

Fig. 4
figure 4

Internal and external aggregation

Figure 4 shows that the internal GDM is calculated by using the arithmetic mean of the final weights of the three experts’ preferences to eliminate the possible variation among them. VIKOR is then applied based on final weights obtained from the arithmetic mean of the three experts. By contrast, external aggregation is calculated by using the arithmetic mean of the Q values for each expert’s ranking, and then the final Q values depend on external group ranking.

Results and discussion

This section presents the results of the proposed framework of evaluation and benchmarking the multiclass classification models of acute leukaemia. Section 5.1 presents the data in decision matrix. Section 5.2 presents the results of the development in benchmarking framework that involves BWM results in subsection 5.2.1 to show the weights for the main criteria and subs-criteria and the results of the VIKOR method in subsection 5.2.2. Section 5.3 presents the validation processes and results.

Data presentation in decision matrix

The results obtained from the evaluation of the 22 multiclass classification models are presented in this section. The outcome of the implementation process of those 22 multiclass classification models generated four parameters (tp, tn, fp, fn) which are considered fundamental values to calculate the rest reliability criteria group values. The values of time complexity criterion were calculated according to its respective framework. The values of reliability group of criteria and time complexity criterion were considered an input to fill the decision matrix. Table 2 illustrates the completed decision matrix.

Table 2 The decision matrix

Table 2 shows that each multiclass classification model has been evaluated based on 11 evaluation criteria. The next section will discuss in detail the results of integration between the BWM and VIKOR.

Results of the framework of evaluation and benchmarking multiclass classification models

The results of the proposed benchmarking framework are represented in two subsections. The first section is the weight result by using the BWM, whereas the second is the result of using VIKOR. The VIKOR section is divided into the individual context and the group context. The group context includes the result of the internal and external aggregation, which will be described in detail in subsequent sections.

Results for weight using BWM method

In this section, BWM results are presented and explained. Three experts were asked to make their evaluation and benchmarking preferences on criteria of multiclass classification models via BWM comparison questions. Table 3 presents the first expert’s process results of main criteria and their sub-criteria. Appendix 2 (Tables 21 and 22) shows the detailed results of the other two experts.

Table 3 Results of the BWM method for weight preferences of the criteria of evaluation and benchmarking the multiclass classification (first expert)

R: Reliability, TM: Time Complexity, MOP: Matrix of parameter, ROP: Relationship of parameter, BOP: Behaviour of parameter, ER: Error Rate, True Positive: TP, True Negative: TN, FP: False Positive, FN: False Negative. Table 3 and Appendix 2 (Tables 21 and 22) present the three experts’ processes weighted results based on BWM. For the evaluation and benchmarking criteria, the best and worst criteria are identified, the best criteria is compared with the other criteria, and the worst criterion is determined. Lastly, the linear model of BWM solved according to Eqs. (6, 7) in Sect. 4.2.1.1 to obtain the weights. Eq. (8) has been used to calculate the consistency ratio of each expert’s preferences. To calculate the global weights of each criterion for the three experts, BWM method derives the local weights for each criteria group at each level as shown in Table 3 and Appendix 2 (Tables 21 and 22) that explains the importance of each criterion regarding the parent. Consequently, the global weights for each criterion is obtained. Each global weight explains each criterion’s importance with respect to the goal for each expert. Firstly, the weight of each criterion was determined by making a comparison between criteria based on BWM. These weights are called ‘local weights’. To find the global weights with respect to the goal, the criteria’s origin weights and their associated local weights were multiplied, as presented in Table 4.

Table 4 BWM local and global weights for three expert

Table 4 presents the overall local and global weights for the three experts for 11 evaluation and benchmarking criteria. The overall CR for the three experts scores an acceptable ratio of less than 0.1. These global weights have been used in our benchmarking framework because the global weights represent the importance of the criteria with respect to the goal. Table 4 shows that the global weight results of the first expert assigned the maximum weight for true positive with a value of 0.201. The minimum weight obtained by precisionM and recallM is 0.035 and 0.035, respectively. The second expert assigned the maximum weight for time complexity criterion with a value of 0.500. The minimum weight obtained by ave-accuracy is 0.011. The third expert assigned the maximum weight for time complexity with a value of 0.200. The minimum weight obtained by true negative is 0.015. Final weight results are used in applying VIKOR method the next section.

Ranking’s results of VIKOR method

The results after the ranking of the multiclass classification models based on weighted evaluation criteria are presented in this section. Individual decision making and GDM contexts are explained. The results of the individual and group VIKOR decision-making contexts are presented in the following subsections.

  • VIKOR Results of Individual Context for Different Experts’ Weights

VIKOR is utilised to rank alternatives based on the decision matrix results presented in Table 2 and the results of the weights presented in Table 4. The ranking show the importance of the evaluation criteria from the viewpoint of each expert. VIKOR technique depends on Q value in ranking the alternatives. The alternative with a lower Q value is considered the better alternative, whereas the alternative with a higher Q value is considered the worst alternative. Table 5 shows the VIKOR results of ranking according to the weights that reflect the viewpoint of the first expert. Tables 23 and 24 in Appendix 3 show the VIKOR results of the two other experts.

Table 5 Ranking results based on the first expert’s weights

Table 5 and Appendix 3 (Tables 23 and 24) present the three VIKOR ranking results provided by the experts. In the first rank, ‘Bayes.NaiveByesUpdateable’ had the lowest Q value of 0.0358 for and was thus the best multiclass classification model in this rank. By contrast, ‘RandomTree’ had the highest Q value of 1 and was thus the worst multiclass classification model in this rank. In the second rank, ‘Byes.NaiveBayes’ had the lowest Q value of 0 and was thus the best multiclass classification model in this rank. By contrast, ‘Rule.Decision Table’ had the highest Q value 1 and was thus the worst multiclass classification model. In the third rank, the lower Q value was 0 for ‘Bayes.NaiveByesUpdateable’, which was eventually considered the best multiclass classification model in this rank. By contrast, the higher Q value was 0.9956 for ‘Rules.part’, which was considered the worst. Differences in weight provided by the experts affected the ranking scores. Figure 5 shows the variance among the VIKOR results.

Fig. 5
figure 5

Ranking results based on the three experts’ weights. (A) First expert’s ranking, (B) second expert’s ranking, (C) third expert’s ranking

Figure 5 demonstrates the final VIKOR ranking for three experts. Ten classification models were selected from each score ranking results [2]. The selected classification models with the best score received the highest ranking (first five classification models), whereas the classification models with the worst score received the lowest ranking (last five classification models).

The first five classification models with the highest-ranking level vary with regard to the weights provided by the experts. According to the weights provided by expert one (A) and expert three (C), Bayes.NaiveByesUpdateable and BayesNet models appeared in the first and second indices, respectively. By contrast, the first and second indices based on the weights provided by expert two (B) were Byes.NaiveBayes and RandomTree. Random Forest and Decision Stump appeared in the third and fourth indices based on the weight provided by expert (A) and expert (C,) whereas the two classification models did not appear in first five indices according to the second expert. Rules.part and Rule.zero were in the third and fourth indices based on the weight provided by expert (B). Meta.AdaboostM1 was in the fifth index according to the weight given by expert (A), whereas Rule.zero appeared in the fifth index based on the weigh obtained from expert (B) and expert (C).

The last five classification models considered with the lowest-ranking level vary based on the weights provided by the expert. Accordingly, RandomTree is the worst model with index 22 according to expert (A), whereas the same model was in the third worst classification model based on expert (C). The worst one according to expert (B) is Rule. Decision Table, in additional the same model was the fifth worst model according to experts (A) and (C). Rules.part appeared as the worst classification model based on expert (C) and the second worst classification model according to expert (A). Trees.LMT was the second worst classification model according to expert (B). In the same last classification model, it was the fourth worst classification model according to experts (A) and (C). Tree.j48 is the third worst model according to expert (A) and the second worst model according to expert (C). Lastly, Meta.AdaboostM1 and Meta.logitboost were the fourth and fifth worst classification models, respectively, based on expert (B).

The results of the individual context clearly show variances among the rankings of three experts. Therefore, the utilisation of group VIKOR decision-making context, which aims to provide ranking alternatives which in turn considers overall decision makers, is necessary. The following sections present the results of group VIKOR decision-making context.

  • Group VIKOR with Internal and External Aggregation

To extend VIKOR into a group decision environment, two ways were used; (1) internal and (2) external aggregation, both of which depend on multiple decision makers. Internal GDM results are calculated by using the arithmetic mean of the final weighs of the three experts’ preferences to eliminate the variance between them, then the VIKOR is applied based on final arithmetic mean results. By contrast, external aggregation results are calculated by finding the arithmetic mean of the Q values for each expert’s ranking results. The final Q values then depend on the external group ranking. Table 6 illustrates the overall ranking results of VIKOR with internal and external group decision making for 22 multiclass classification models.

Table 6 Overall ranking results of VIKOR with internal and external group decision making

As shown in Table 6, the order of the best/first three classification models are Bayes.NaiveByesUpdateable, BayesNet and Decision Stump. The order of the last worst/two classification models based on the results of internal and external GDM are Trees.LMT and Rule. Decision Table. The rest of the classification models with the same order in both internal and external decision making are Meta.RandomCommittee, Lazy.IBK, Meta.logitboost and Byes.NaiveBayes in the following order 8, 13, 14, 15, respectively. By contrast, some classification models are ranked differently between the internal and external group decision making. The order of those classification models based on internal ranking are as follows: REPTree, Rule.zero, RandomForest, Kstar, Meta.Bagging, Meta.AdaboostM1, Functions.SIMPLE. logistic, Functions.Smo, Meta.filteredclassifier, Treed.HoeffdingTree, RandomTree, Rules.part and Tree.j48 in the following order: 4, 5, 6, 7, 9, 10, 11, 12, 16, 17, 18 and 19, respectively. The order of the same classification models based on external ranking aere REPTree, Rule.zero, RandomForest, Kstar, Meta.Bagging, Meta.AdaboostM1, Functions.SIMPLE. logistic, Functions.Smo, Meta.filteredclassifier, Treed.HoeffdingTree, RandomTree, Rules.part and Tree.j48 in the following order: 5, 7, 4, 9, 10, 6, 12, 11, 17, 16, 20, 18 and 19, respectively. Therefore, the first best three index classification models in both internal and external GDM are equal, whereas the last worst two index classification models are equal as well. The fourth classification models in different medium scores indices were equal, whereas the rest of the classification models showed different score indices. From this point forward, the internal and external aggregation decision making rank can be considered the final ranking results and will be used in validation processes. The next section will describe in detail the validation results.

Validation processes and results

Decision selection of multiclass classification model is considered a difficult task because it relies on conflicting multiple criteria in one side. Differences in accuracy, performance and other features make the task difficult. The results are validated for the proposed benchmarking framework by utilising objective validations.

Objective validation

Statistical methods of mean and standard deviation (SD) were used in this study to ensure that multiclass classification models were ranked according to the proposed benchmarking framework. Towards this goal, three groups were created and separated because of the results ranking for multiclass classification models [2, 82]. Each group’s results are expressed as mean ± SD. The mean is the average results. Its calculation is performed by the sum division of the observed results over the resulting number and by the following equation:

$$ \overline{x}=\frac{1}{n}{\sum}_{i=1}^n{x}_i $$
(16)

SD is used to determine the dispersion or variation amount in the set of values and is calculated by the following equation:

$$ s=\sqrt{\frac{1}{N-1}{\sum}_{i=1}^N{\left({x}_i-\overline{x}\right)}^2} $$
(17)

The utilisation of mean ± SD ensures that the three multiclass classification models sets are subject to systematic ordering. The multiclass classification models scoring was divided into three groups to validate the results ranking by using the above test. Division took place based on the ranking result obtained from the proposed benchmarking framework. An equal number (seven) are included for the first and second multiclass classification models. Eight classification models (2) are included in the third group depending on the scoring values from the ranking results. For this process to takes place, two statistical methods will be used. These methods must prove that the lower scoring value was achieved by the first group when both mean and SD are measured. The lower mean and SD were assumed for the first group in comparison with the other two groups to validate the results. The results for both mean and SD of the second group must be lower or equal than the ones in the third group. At the same time, they must be higher than the first group. Nevertheless, results of the mean and SD must be higher than those in the first and second groups and equal to those in the second group. The results of the first group must be statistically proven according to the systematic ranking results which have to be considered lowest among the three groups.

Validation results

This section presents the validation processes of internal and external GDM ranking. In this research, objective validation processes are used. The validation process for multiclass classification models ranking results has been obtained by dividing the ranking result into three groups. The first two groups are equal, with each one having 7 models and the third one having 8 models. The mean ± SD have been calculated for each group to ensure that the ranking multiclass classification models undergo a systematic ranking. After normalisation and weighting process for the row data of the first, second and third groups of multiclass classification models, the validation results for internal and external GDM are presented in Table 7.

Table 7 Validation results of internal and external group decision making rank

Table 7 shows the results of validation for internal aggregation group decision making. The first group has a lower mean ± SD than the second group except for error rate (M = 0.0951 ± 0.0319 in the first group; M = 0.0721 ± 0.0327 in the second group). For the second group; the mean ± SD is lower than the mean ± SD in third group for all features except for error rate (M = 0.0721 ± 0.0327 in the second group; M = 0.0450 ± 0.0231 in the third group). Accordingly, first group has a lower value compared with the second group. The second group has a lower value compared with the third group. Regarding the results of validation for external aggregation GDM, the mean ± SD in the first group is lower than the mean ± SD in the second group except for error rate (M = 0.1010 ± 0.0272 in the first group; M = 0.0662 ± 0.0309 in the second group). In the second group, the mean ± SD is lower than the mean ± SD in the third group for all features except for error rate (M = 0.0662 ± 0.0309 in the second group; M = 0.0450 ± 0.0231 in the third group). Accordingly, the first group has a lower value compared with the second group, whereas the second group has a lower value compared with the third group. Therefore, the internal and external GDM rank is valid and undergoes systematic ranking.

Research limitation and future study

The proposed evaluation and benchmarking framework can address the evaluation and benchmarking issues for multiclass classification models. However, it cannot deal with classification models that work under multi-labelled or hierarchical cases because the evaluation criteria used for evaluation and benchmarking the multi-labelled or hierarchical cases are different and the procedures to calculate those criteria are different. The future study directions are as follows:

  • The proposed framework can evaluate and benchmark the multiclass classification models that classify other types of leukaemia.

  • The new framework can be applied for classification models with applications that involve the use of multi-labelled or hierarchical classification models through proposing new decision matrices that include related evaluation criteria for multi-labelled classification models or hierarchical classification models.

Conclusion

Studies related to the automated detection and classification of acute leukaemia have been notably increasing. Nevertheless, studies relevant to the evaluation and benchmarking of automated detection and classification tasks with unaddressed limitations are scarce. Several aspects are associated with the evaluation and benchmarking aimed for automated detection and classification. Such aspects warrant further analysis and investigation. Towards this end, comprehensive review and research on automated classification of acute leukaemia have been done while considering its evaluation and benchmarking aspects. The aim for the latter was to identify open challenges, research issues and gaps linked to the process of evaluation and benchmarking. After a thorough review of studies, a serious gap was identified. The gap resides in the failure of previous studies to perform a process of evaluation and benchmarking for all major detection and classification requirements. Evaluation and benchmarking were partially performed, which render incomplete results because they failed to reflect the overall performance for detection and classification. Such weakness raises a challenge for comparing numerous systems or models for the detection and classification to determine which of the system or model is the best because the evaluation criteria vary and are incomplete. Moreover, all the major criteria and sub-criteria aimed for benchmarking multiclass detection and classification were reviewed. Towards addressing challenges, resolving issues and fulfilling the research gap, we proposed an evaluation and benchmarking framework based on MCDM techniques. Its goal is to evaluate and benchmark the acute leukaemia multiclass classification models. The description of the procedures and steps of the proposed framework are described. Construct decision matrix was based on crossover between evaluation criteria and 22 multiclass classification models. The proposed framework for evaluation and benchmarking are developed based on an integration of BWM and VIKOR. The ranking of classification models results are based on three experts’ opinions on criterion preference. Firstly, the VIKOR was applied in the individual context to provide ranking for each expert, though the results show variances among the three experts’ ranking. Therefore, VIKOR with GDM was applied, including internal and external aggregating methods. By contrast, internal and external aggregations have shown almost similar performance. Lastly, the validation for the results has been achieved objectively in this research. The statistical results indicate that the multiclass classification models ranking results based on internal and external aggregation GDM undergo a systematic ranking.