1 Introduction

Over the past decades, much attention has been paid to dehazing as evidenced in the increasing number of studies that propose dehazing approaches or investigate the quality of dehazed images [1]. However, determining how to objectively assess the performance of these algorithms remains an open problem that can hinder the development of advanced image restoration methods. There is a fact that the algorithms can be evaluated by a wide range and can also be comparable with each other reliably [2]. The most reliable approach is subjectively assessing quality by human observers. However, this approach is often time-consuming and cannot be integrated into real-time image processing systems. Therefore, an alternative objective quality assessment approach needs to be developed [3].

Various foggy scenes have been made available to test the utility of image dehazing algorithms [4, 5]. Most forms of assessment are equivalent on several foggy scenes [6,7,8,9,10,11,12,13,14]. For example, in [6], the authors considered a variety of evaluation scenes, including inhomogeneous, homogeneous and dark foggy scenes, to test the efficiency of algorithms. However, when evaluating certain algorithms, their efficiency should be tested in consideration of various characteristics and foggy perspectives. Therefore, the advantages and demerits of each algorithm should be considered within each context. Under different hazy scenes, several algorithms can work properly, such as those proposed in [15,16,17]. Therefore, comparing these algorithms from only one perspective is unreasonable.

The efficiency of image dehazing algorithms also needs to be evaluated by using trustworthy approaches [8, 18]. In this case, how several algorithms can be evaluated and how the best algorithm is selected through an effective approach warrant further investigation. From the findings of state-of-the-art image quality assessment (IQA), two concerns need to be addressed. Firstly, determining the best enhanced image and validating the best dehazing results are difficult [4]. For instance, evaluators may not always have the same response regarding the quality of an improved image when using a subjective approach. At the same time, the conventional objective approach cannot effectively solve these problems. Second, [6, 19, 14] have reported that no single defogging algorithm shows an excellent performance across different foggy scenes. Therefore, selecting a defogging algorithm is difficult. The selection and benchmarking of a best image dehazing algorithm based on multiple foggy scenes are therefore identified as the major problems in this research.

Nevertheless, most objective evaluation methods, such as those introduced in [4, 8], use different metrics or criteria to measure the quality of an enhanced image [20]. The diversity of image dehazing criteria enables us to evaluate the performance of image dehazing algorithms from several perspectives. For example, the image visibility (IV) criterion identifies the distortion of hazy images based on edge, contrast and texture information [6], whereas others identify the degree of colour distortion in a hazy image based on the colour restoration (CR) criterion [21]. One requirement for the evaluation and benchmarking processes in image dehazing algorithms is the criterion that can indicate the degree of enhancement presented by a certain algorithm towards a specific type of distortion [4]. However, the ability to determine the best alternative under all conditions of uncertainty must be made obvious to achieve an effective selection process. The set of criteria and their importance can also influence the selection process [22]. As stated in [22,23,24], to evaluate and select the best alternative, the first step is to determine the appropriate criteria. For an objective assessment, several metrics have been proposed in the literature, but the use of these metrics varies from one study to another. At the same time, a model for classifying and recommending the most appropriate measurements is yet to be developed. Therefore, providing a standard image dehazing criteria is crucial to the evaluation and selection of image dehazing algorithms. The standardisation of these criteria presents a challenge for this study.

The objectives of this study are to (1) standardise image dehazing criteria based on the fuzzy Delphi method (FDM), (2) develop a new framework for benchmarking image dehazing algorithms based on hybrid multi-criteria decision analysis methods and (3) validate the proposed framework by using statistical validation methods. The rest of this paper is arranged as follows: ‘Introduction’ defines image dehazing evaluation and benchmarking. ‘Literature Review’ reviews the related studies. ‘Methodology’ reports the methodological standardisation and decision-making steps. ‘Results and Discussion’ illustrates and discusses the results. ‘Validation’ validates the results of the proposed framework, whereas ‘Limitations’ presents the restrictions of this framework. ‘Recommendations for Future Work’ provides several recommendations for future study, and ‘Conclusion’ concludes the paper.

2 Background and related works

Any new algorithm should be compared according to its perceived quality and time complexity (TC), which are considered main indicators for any comparison scenario. Particularly, image dehazing evaluation is based on quality criteria group from one side and on other side time complexity criteria. In the image dehazing domain, quality refers to the capability of a process to remove unwanted effects from a degraded image and restore its quality back to its original state [2]. However, image dehazing algorithms focus on enhancing the visual quality of a foggy image to make it recognisable by the human eye [25,26,27]. In the image dehazing domain, quality focuses on image visual features instead of signal features. However, under poor weather conditions, the acquired image shows low visibility and colour distortion, both of which can influence analysis and recognition processes [28]. An efficient image defogging algorithm needs not only enhance the visibility, edge and texture information of an image but also preserve its structure and colour [6, 29, 30]. Based on the above scenario and along with [6], evaluating image quality based on dehazing algorithms depends on three sub-criteria, namely IV, CR and image structure similarity (SSIM).

IV is evaluated by measuring the obvious edge, texture information, image contrast and image gradient [6]. Hazy images that can be analysed according to these aspects can be recognised by measuring their degree of visibility. Several metrics have also been employed to indicate the level of enhancement in terms of IV, including blind assessment indicators (e and r) [21], visual contrast measure (VCM) [31] and contrast gain [32] (see “Appendix”). Meanwhile, CR is relevant in retrieving the true colour (amount of lost information) of a certain image that is usually distorted by haze or fog effects. CR measures the rate of saturated pixels after the image defogging process or the similarity of histogram distributions between the foggy and enhanced images [6]. To assess CR performance in an enhanced image, several parameters need to be considered, including the rate of saturated pixels (σ) [21], histogram correlation coefficient (HCC) [33] and colour colourfulness index (CCI) [34] (see “Appendix”). However, an objective assessment of haze removal should consider the dehazing effect and distortions introduced during the haze removal process [5]. Image structure similarity (SSIM) is an evaluation indicator that can only measure the degree of distortion that is caused by the image dehazing process [8]. Although a clear definition of image structure is yet to be formulated, the measures of image structure are strongly correlated with subjective quality ratings, thereby suggesting that high-quality images are closely linked to their original forms in terms of their structure contents (object boundaries) [35]. In general, dehazing algorithms do not change the structural information of an image unless they lead to a serious distortion and edge effect. Moreover, the removal of fog from an image will change the image structure [6]. Several metrics have been used to measure the similarity in the structure of hazy and enhanced images, such as SSIM [36] and the universal quality index (UQI) [37] (see “Appendix”).

The existing methods for image dehazing quality assessment can be classified into visible-edges aware, modulated artificial scenes, comprehensive appraisal models and running speed evaluations [4]. Execution time or TC is an important measure for evaluating the computational complexity of an image dehazing algorithm. The computational cost is determined by calculating the average time spent on a single image in several experiments. By comparing execution time and quality feedback, one can easily determine whether an algorithm can be used as an automatic visual system in real time [4, 10, 38].

Several metrics have also been employed for IQA [39]. According to [6], only 11 metrics have been linked to the 3 previously mentioned criteria. As an extension, this study includes additional metrics and statistics that are related to the criteria previously mentioned in the literature. A total of 74 studies were reviewed to determine how many times these metrics have been used. Some of these studies have focused on the evaluation scenario, whereas others have combined the review and evaluation scenarios. However, most of these studies have developed new algorithms in real-time conditions. Other metrics, including PSNR and MSE, have been excluded because they either fail to specify which type of distortion can be measured or are unable to quantify visual distortion. Those metrics that are only used in underwater IQA are also excluded. Following the aforementioned definitions, these metrics are classified into three groups, and their frequency of usage is specified (see “Appendix”).

As shown in “Appendix,” most of the employed metrics are classified under the visibility criterion given that the main distortion caused by haze is decreasing the visibility of an image. As mentioned above, visibility is evaluated from multiple perspectives, thereby explaining the number of metrics classified under this criterion. Meanwhile, only few metrics have been classified under the other two criteria. In addition, given that a variety of metrics have been adopted in the literature and that most of these metrics lack any clear justification, selecting the most appropriate metric for image dehazing evaluation presents a challenge. In addition, only few studies have considered IV, CR, SSIM and TC altogether as evaluation criteria [6], and some studies have shown differences in how they employ these criteria. Firstly, some studies, such as [7, 40], have only used quality criteria. Secondly, other studies have only measured TC [28, 41, 42]. Thirdly, previous studies have shown differences in how they use quality sub-criteria. For instance, [43] only used SSIM, [44] only used CR and SSIM, [33] used both visibility and CR, and [45, 46] only used visibility. No previous study has used a unified set of criteria in their evaluation process. Also, most of the researchers have used evaluation metric based on their subjective view which have caused a significant conflict to highlight the most influence evaluation criteria. “Appendix” highlights the importance of considering IV, CR, SSIM and TC as criteria in image dehazing evaluation. Nevertheless, whilst TC has no sub-criteria, which can create conflict amongst scholars in their criteria selection process, this study argues that this criterion is important in evaluating image dehazing algorithms. Meanwhile, using any of the other sub-criteria remains a significant challenge despite their frequent usage in the literature given that no previous study has defined the maximum or minimum level of importance of using any metric. Therefore, these image dehazing criteria need to be standardised.

Ishikawa [47] proposed FDM, which incorporates the Delphi method into fuzzy theory. FDM is generally employed when making decisions regarding objective issues. Despite having unclear parameters, the results of FDM are deemed appropriate. FDM also provides a flexible framework that covers many barriers associated with lack of accuracy and clarity. Making decisions with incomplete or inaccurate information creates many problems. Moreover, the decisions made by experts are very subjective and uncertain. Given that uncertainty in this situation is possible and that this type of uncertainty is tailored to the fuzzy set, the data should be expressed in fuzzy numbers instead of absolute ones, and fuzzy sets should be used for analysing expert opinion [48]. The strength of FDM lies in its reduction of the length of the study period by reducing the number of Delphi rounds [49]. Dealing with a fuzzy context involves imprecise descriptions and human linguistics, and employing fuzzy numbers can leave the impression of using an appropriate method for decision making [50]. Nevertheless, FDM is suitable for assessing the importance of the criteria affecting a phenomenon on a highly flexible scale [50, 51]. Furthermore, no shortage of useful information will occur because the membership degree effectively considers all opinions [52]. FDM has been widely used for assessment, standardisation and criteria selection in different domains [24, 47, 52,53,54,55]. In this case, this study employs FDM to standardise image dehazing criteria based on expert opinions.

Numerous criteria have been applied in evaluating algorithms, but selecting the best algorithm remains difficult [4, 6, 19]. These obstacles create problems in MCA [56,57,58,59]. Four practical problems should be considered when selecting the best algorithm, namely the multiple evaluation criteria, criterion importance, criteria trade-off and data variation problems [56, 60,61,62,63]. However, when developing an image dehazing algorithm, the multi-criteria problem that considers only the colour of an image is not enough. Other features, including structure and texture [10], should also be considered in complex scenarios. According to [9] and [11], the evaluation results greatly depend on the selection of metrics or criteria. An effective evaluation and benchmarking scenario is therefore required, and multiple criteria should be considered to define the complexity of hazy scenes because an efficient image dehazing algorithm needs to deal with the characteristics of such scenes.

Multi-criteria decision making (MCDM) is a popular decision-making method and operational research area that addresses the decision criteria problem [64,65,66,67,68,69]. MCDM is also used for structuring, planning and implementing decisions [70,71,72,73,74,75,76]. Given its ability to improve the quality of decisions through a highly reliable and reasonable decision making in contrast to standard procedures, MCDM has been increasingly employed in the literature [77,78,79,80,81]. MDCM has three goals, namely: (1) to assist in the selection of the best possible alternative [82,83,84], (2) to identify the practicable alternative amongst a number of alternatives [85,86,87] and (3) to rank the alternatives in a descending order based on their performance [88,89,90,91,92]. The suitable alternative(s) are given a score [93,94,95]. The basic terms of each MCDM ranking, including the decision matrix (DM) and its criteria, should also be defined [96, 97].

MCDM employs objective and subjective weighting methods [98]. The entropy weighting method is an objective method for criteria weight determination [99]. In the image dehazing domain, a comprehensive evaluation of multiple criteria is carried out by using the entropy weighting method to obtain reliable results [13]. The entropy test calculates the weight of the criterion based on the degree of variation in its values and presents a basis for an exhaustive evaluation [99]. Entropy is a purely monotonous and uncertain function where a lower uncertainty leads to a smaller entropy, and vice versa. Therefore, by measuring entropy values, one can determine the degree of dispersal for each criterion. Increasing the dispersal degree of the index will affect the entire assessment, thereby increasing the weight of the index and leading to highly accurate evaluation results [13].

The VlseKriterijumska Optimizacija I Kompromisno Resenje (VIKOR) method is commonly employed in ranking various alternatives that conform to different criteria. Comparing with technique for order of preference by similarity to ideal solution (TOPSIS) method that does not take into account the comparative value of distances between ideal and negative solution [100]. Therefore, VIKOR is considered the most reasonable approach for addressing real-life issues and ranking alternatives. VIKOR method uses a procedure for compromise priority for numerous response optimisation [101, 102]. The alternatives are initially ranked based on their closeness to the ideal solution, and then, the best alternative is determined [101]. Recent studies have outlined multiple integration instances between VIKOR and entropy to obtain reliable and consistent objective weights [103]. In this case, reliable approaches are adopted. The advantages of both the aforementioned approaches are defined to overcome the uncertainties of a problem [103,104,105,106,107]. In the evaluation and benchmarking of image dehazing algorithms, the integration of entropy and VIKOR is essential. When the weights are allocated to various sub-criteria according to entropy, a foundation for integration is established. However, for ranking image dehazing algorithms, the VIKOR method is recommended.

3 Methodology

The adopted methodology is divided into three phases. In the first phase, the image dehazing criteria are standardised and determined by using FDM. In the second phase, the data are presented by performing an evaluation experiment. In the third phase, the weights for the standardised criteria are determined by using the entropy method, and the alternatives are ranked by using VIKOR.

3.1 FDM

FDM is used to standardise the image dehazing criteria in the following steps:

  • S1 All criteria relevant to image dehazing evaluation are described in the previous section. The selected criteria for FDM are the 26 sub-criteria for IV, CR and SSIM.

  • S2 The number of experts included in FDM is defined as shown in Table 3. These experts are interviewed to determine the importance of the evaluation criteria and to collect their opinions regarding these criteria by disseminating expert opinion forms. Linguistic variables are used in designing these forms.

  • S3 The input data collected from the previous step are transferred to a new form (data fuzzification) and used for further fuzzy data analysis as follows [49]:

    1. 1.

      All linguistic variables (Table 1) are converted into triangular fuzzy numbers (TFN) with values of \(m_{1}\), \(m_{2}\) and \(m_{3}\), where \(m_{1}\) represents the smallest value, \(m_{2}\) represents the most plausible value and \(m_{3}\) represents the maximum value.

      Table 1 Variables for the importance weight of criteria [111]
    2. 2.

      The average value is calculated based on the number of each item and is then divided by the number of experts. The fuzzy numbers are assumed to be rij variables for each criteria for expert \(k\) for \(i = 1, \ldots , m, j = 1, \ldots n, k = 1 \ldots k\) and \(r_{ij} = \frac{1}{k} \left( { \pm r_{1ij} r_{2ij} \pm r_{2ij} } \right)\).

      For every expert, the vertex method is used to calculate the average distance between \(r_{ij}\). The spacing between two fuzzy numbers, \(m = \left( {m_{1} ,m_{2} ,m_{3} } \right)\) and \(n = \left( {m_{1} ,m_{2} ,m_{3} } \right)\), is calculated as

      $$d(\tilde{m}\tilde{n}) = \sqrt {\frac{1}{3} \left[ {(m_{1} - n_{1} )^{2} + (m_{2} - n_{2} )^{2} + (m_{3} - n_{3} )^{2} } \right]}$$
      (1)

      where \(d\) represents the threshold value of \(m\) and \(n\).

    3. 3.

      According to [108], the first precondition for criteria acceptance is that if the value is less than 0.2, then a consensus has been reached amongst the experts. Meanwhile, the second precondition is that the ratio of expert consensus should be greater than or equal to 75% [109, 110].

  • S4 Average fuzzy numbers are used during the defuzzification process to obtain the fuzzy score (A). The value of fuzzy score (A) must be greater or equal than the mean value (α-cut value) of 0.5 to satisfy the third precondition [112, 113]. The following equations can be used to obtain the fuzzy score (A) [49]:

    $$A_{max} = \frac{1}{3}*\left( {m_{1} + m_{2} + m_{3} } \right)$$
    (2)
    $$A_{max} = \frac{1}{4}*\left( {m_{1} + m_{2} + m_{3} } \right)$$
    (3)
    $$A_{max} = \frac{1}{6}*\left( {m_{1} + m_{2} + m_{3} } \right)$$
    (4)
  • S5 The value of fuzzy score (A) can be used as a determinant and priority for an element according to expert opinions. The elements are ranked according to the average fuzzy score. The ranking can help decide whether certain objects should be preserved or discarded [51].

3.2 Multi-perspective DM

The multi-perspective DM is an essential part of the proposed framework for the standardisation and selection of image dehazing algorithms. This matrix comprises decision alternatives and evaluation criteria based on multiple foggy perspectives. In addition to the scenario, the evaluation criteria in DM include the main and sub-criteria, which are used to measure the quality and TC of image dehazing algorithms from three perspectives, namely inhomogeneous, homogeneous and dark foggy scenes. In other words, a user can benchmark real-time image dehazing algorithms based on these perspectives simultaneously through the proposed DM to determine the best algorithm. The DM uses the experiment data extracted from the LIVE Image Defogging Database [114] to evaluate nine algorithms, namely Dehazenet [115], MSCNN [116], Colores et al. [117], Zhu [118], multi-band [119], CO-DHWT [120], Meng et al. [121], Liu et al. [122] and Berman et al. [123]. The data are generated from the crossover between nine algorithms and the image dehazing sub-criteria that are defined based on the literature (TC) and standardisation process (quality sub-criteria) by FDM. These data, along with the identified sub-criteria, are used to evaluate each image dehazing algorithm. The complete data of this matrix are presented in Sect. 4.2. Nevertheless, to mention our evaluation criteria, algorithms were yielded by MATLAB 2018a on a personal computer with Windows 10 operating system, Intel Core (TM) i7, RAM 8 GB.

Table 2 illustrates the multi-perspective DM, who values are obtained from the evaluation of the quality and TC of nine image dehazing algorithms.

Table 2 Multi-perspective DM

3.3 Hybrid entropy–VIKOR

To develop a procedure for selecting the best real-time image dehazing algorithm based on multi-perspective DM, a hybrid entropy–VIKOR method is introduced, in which the weight of the criterion from the entropy method is amalgamated with the other steps of VIKOR (Fig. 1). VIKOR is applied to address the practical problems related to the (1) multiple evaluation criteria for each perspective, and the (2) trade-off and conflicting issues experienced by the proposed DM. Meanwhile, entropy is utilised to find the weights of the criteria and to determine (3) the importance of the criteria used by the proposed DM. The steps are summarised as follows:

Fig. 1
figure 1

Standardisation and selection framework for real-time image dehazing algorithms

3.3.1 Entropy weights for standardised criteria

Based on the evaluation data in Sect. 3.2, the weights of various standardised criteria are determined as follows by using the entropy method [124]:

  • S1 Normalise the evaluation criteria as

    $$r_{ij} = \frac{{x_{ij} }}{{\mathop \sum \nolimits_{i = 1}^{m} x_{ij} }}$$
    (5)

    A DM of the multi-criteria problem with m alternatives and n criteria, where \(x_{ij} = \left( {i = 1, 2, \ldots , m; j = 1, 2, \ldots , n} \right)\), shows the performance value of the \(i\)th alternative to the \(j\)th standardised criteria.

  • S2 For each standardised criterion, the entropy values \(e_{j}\) are calculated as

    $$e_{j} = - h\mathop \sum \limits_{j = 1}^{m} r_{ij} . \ln r_{ij} ,j = 1, 2, \ldots ..n$$
    (6)

    where \(h\) is the entropy constant and is equal to \(\left( {\ln m} \right)^{ - 1}\), and \(r_{ij} .\ln r_{ij}\) is equal to 0 if \(r_{ij} = 0\) [125].

  • S3 Define the divergence of each criterion as

    $$d_{j } = 1 - e_{j} , j = 1, 2, \ldots ..n$$
    (7)

    A higher \(d_{j }\) indicates the greater importance of the \(j\)th criterion.

  • S4 Determine the weight of each standardised criterion as

    $$w_{j} = \frac{{d_{j} }}{{\mathop \sum \nolimits_{j = 1}^{n} d_{j} }},\quad j = 1, 2, \ldots ..n$$
    (8)

    where \(w_{j}\) is the degree of importance of criteria \(j\).

A lower entropy value corresponds to a greater entropy weight, which in turn suggests that a specific criterion provides more information and is more significant than other decision-making criteria [125].

3.3.2 VIKOR for ranking real-time algorithms

In the decision-making process, the weighted matrix is used as the basis for ranking the available alternatives. In Sect. 3.3.1, standardised criteria weights are obtained and applied for each criterion in the DM to obtain a weighted DM. Based on this weighted DM, the real-time image dehazing algorithms are assessed and ranked. The ranking process is described as follows [126]:

  • Step 1 Identify the best \(f^{ *} i\) and worst \(f^{ - } i\) values of all criterion functions, \(i = 1, 2, \ldots ,n\). If the \(i^{th}\) function represents a benefit, then

    $$f_{i}^{*} = \mathop {\hbox{max} }\limits_{j} f_{ij} ,f_{i}^{ - } = \mathop {\hbox{min} }\limits_{j} f_{ij} ,$$
    (9)

    where \(f_{ij}\) is the value of the \(i^{th}\) criterion function for alternative \(x_{i}\). The ideal solution maximises the benefit criteria and minimises the cost criteria, whereas the negative ideal solution maximises the cost criteria and minimises the benefit criteria. The so-called benefits criteria are the maximisation criteria, whereas the cost criteria are the minimisation criteria [127].

  • Step 2 Calculate the criteria weights based on entropy. A set of weights \(w = w_{1} , w_{2} , w_{3 } , \ldots ,w_{j} , \ldots , w_{n}\) is accommodated in the DM and is equal to 1. The corresponding matrix can be determined as

    $$WM = wi*\frac{{f^{*} i - fij }}{{f^{*} i - f^{ - } i }}$$
    (10)

    which produces the following weighted matrix:

    $$\left[ {\begin{array}{*{20}l} {w_{1} (f^{*} 1 - f11)/\left( {(f^{*} 1 - f^{ - } 1} \right)} \hfill & { w_{2} (f^{*} 2 - f12)/\left( {(f^{*} 2 - f^{ - } 2} \right)} \hfill & \ldots \hfill & {w_{i} (f^{*} i - fij)/\left( {(f^{*} i - f^{ - } i} \right)} \hfill \\ {w_{1} (f^{*} 1 - f21)/\left( {(f^{*} 1 - f^{ - } 1} \right)} \hfill & { w_{2} (f^{*} 2 - f12)/\left( {(f^{*} 2 - f^{ - } 2} \right)} \hfill & \ldots \hfill & {w_{i} (f^{*} i - fij)/\left( {(f^{*} i - f^{ - } i} \right)} \hfill \\ \vdots \hfill & \vdots \hfill & \vdots \hfill & \vdots \hfill \\ {w_{1} (f^{*} 1 - f31)/\left( {(f^{*} 1 - f^{ - } 1} \right)} \hfill & {w_{2} (f^{*} 2 - f12)/\left( {(f^{*} 2 - f^{ - } 2} \right)} \hfill & \ldots \hfill & {w_{i} (f^{*} i - fij)/\left( {(f^{*} i - f^{ - } i} \right)} \hfill \\ \end{array} } \right]$$
  • Step 3 Calculate the Sj and Rj values and j = 1, 2, 3,….,m, i = 1, 2, 3,…,n as

    $$Sj = \mathop \sum \limits_{i = 1}^{n} w*\frac{{f^{*} i - fij }}{{f^{*} i - f^{ - } i }}$$
    (11)
    $$Rj = \mathop {\hbox{max} }\limits_{i} wi *\frac{{f^{*} i - fij }}{{f^{*} i - f^{ - } i }}$$
    (12)

    where \(Sj\) and \(Rj\) denote the utility and regret measures for alternative \(f_{i}\), respectively, and \(w_{i}\) specifies the relative weights of the criterion.

  • Step 4 Compute \(Q_{j}\) and \(j = \left( {1,2, \cdots ,j} \right)\) by using the following relation:

    $$Q_{\text{j}} = \frac{{{\text{v}}\left( {S_{\text{j}} - S^{*} } \right)}}{{S^{ - } - S^{*} }} + \frac{{\left( {1 - {\text{v}}} \right)\left( {R_{\text{j}} - R^{*} } \right)}}{{R^{ - } - R^{*} }}$$
    (13)

    where

    $$\begin{aligned} S^{*} = \mathop {\hbox{min} }\limits_{j} S_{j } ,S^{ - } = \mathop {\hbox{max} }\limits_{j} S_{j } \hfill \\ R^{*} = \mathop {\hbox{min} }\limits_{j} R_{j } ,R^{ - } = \mathop {\hbox{max} }\limits_{j} R_{j } \hfill \\ \end{aligned}$$

    v is presented as the strategy weight of ‘the majority of the criteria’ (or ‘the maximum group utility’), where \(v = 0.5\).

  • Step 5 Rank the alternatives based on \(Q_{\text{j}}\). A lower \(Q_{\text{j}}\) indicates a better alternative. In other words, the alternative (\(a^{\prime}\)) is identified as the best by the measure \(Q\) (minimum) if the following rules are satisfied:R1. ‘Acceptable advantage’

    $$Q(a^{\prime\prime}) - Q(a^{\prime}) \ge DQ$$
    (14)

    where (\(a^{\prime\prime}\)) is the alternative ranked second according to \(Q\), \(DQ = 1/\left( {j - 1} \right)\), and \(j\) is the number of alternatives.R2. ‘Stability’ is acceptable within the decision-making context. Alternative \(a^{\prime}\) should also be determined as the best by \(S\) and/or \(R\). This compromise solution is stable within the decision-making process and can be treated as ‘voting by majority rule’ \((v > 0.5)\), ‘by consensus’ \(\left( {v \cong 0.5} \right)\) or ‘with veto’ (\(v < 0.5\)), where v represents the decision-making strategy weight of ‘the majority of criteria’ (or ‘the maximum group utility’). The \(Q\) value indicates that a certain algorithm has higher evaluation criteria values compared with the other algorithms.

4 Results and discussion

This section presents the findings of the proposed standardisation and selection framework. Section 4.1 presents the standardisation results based on FDM, Sect. 4.2 presents the results of multi-perspective DM, Sect. 4.3 presents the entropy results, and Sect. 4.4 presents the VIKOR results.

4.1 FDM results

Different decision makers have varying objectives and expectations, and their judgment is influenced by the criteria for evaluating image dehazing algorithms from different perspectives. According to [128], there is no limit to the number of experts who can participate in FDM. Meanwhile, [129] suggested that 8 to 12 experts is enough only if they have homogeneous backgrounds. Nevertheless, previous studies typically employ 3 to 15 experts [130], whereas others have employed 16 [131], 20 [132] and 17 experts [50]. Therefore, a panel of 16 experts, which is within the recommended range, can be considered adequate for this study. According to [108], if the acceptable value of d is d < 0.2, then a consensus is achieved amongst the experts. Specifically, if the percentage of agreement amongst experts m × n is greater than 75% [109], then one can proceed with the other FDM steps. Otherwise, a second round of FDM should be conducted. Data were collected from these experts after two rounds. The first round involved 20 experts from different organisations and with different backgrounds, but their percentage of agreement on item agreeability was only 68% (less than 75%). In this case, a second round of FDM should be performed. The experts’ responses were gathered by interviewing them and by scaling their responses on hard copies of experts’ forms. Afterwards, by using information obtained from Google Scholar, ResearchGate and official university websites, links leading to the experts’ form were sent to those experts based overseas or those who prefer to answer this form online instead of receiving a hard copy.

By revealing their background, previous studies were categorised into three domains, namely image processing, image dehazing and IQA. As shown in Table 3, most of the feedbacks were obtained from decision makers with experience in image dehazing and processing and from some experts in the image quality domain. Most of these experts have work experience of over 15 years, with some only having 6–10 years of experience. Compared with those from Iraq, the experts from the USA, UK and China were from universities or organisations based on Malaysia.

Table 3 Experts panel

To apply fuzzy operations on the input data, the data collected from the 16 target experts were converted from linguistic forms into crisp and fuzzy numbers (Table 4). The average minimum value (m1), most appropriate value (m2) and maximum value (m3) should be considered in each reported answer. TFN aims to illustrate the fuzziness or vagueness in an expert’s opinion. Each opinion had a certain amount of uncertainty that could not be measured by using a Likert scale given its fixed score. An object called ‘CNR’ was assumed to be given a score of 5 (‘strongly agree’) by an expert. The score is translated into the lowest, most rational and the maximum 0.6, 0.8 and 1.0 ratings. It specified the expert’s agreement to the element is 60%, 80% and 100%, respectively.

Table 4 D value condition results

To check whether a certain criterion is suitable, three preconditions should be satisfied. The d-value specifies the acceptability of a criterion based on the consensus amongst experts. By finding the difference between the average fuzzy number and the fuzzy number given by each expert, the d-value for each criterion can be determined. The post-data analysis results in Table 5 show that all sub-criteria have satisfied the first precondition of acceptability (d-value ≤ 0.2), except for EBCM, which has obtained a d-value of 0.212. Therefore, this sub-criterion was discarded given the lack of consensus amongst experts. A total of 25 sub-criteria remained.

Table 5 Expert agreement and average fuzzy score

The second precondition for item acceptability is to achieve an agreement percentage of no lower than 75%. Meanwhile, the third precondition is the defuzzification process, where each fuzzy number is converted into a crisp number by obtaining the score and averaging the fuzzy number of each sub-criteria. However, same sub-criteria have included in calculation the percentage of agreement. Table 6 shows the results for these preconditions.

Table 6 Summary of findings

As shown in Table 5, the percentages of agreement on the image dehazing sub-criteria are 63%, 69%, 88% and 94%, respectively. In other words, the majority of the sub-criteria, including IVM, VCM, STD, AG, IE, GCF, AMPL, Loss, CNR, VSNR, WSNR, Sharpness, VIF, CIEDE2000s, CEF, CCI, MS-SSIM and IW-SSIM, did not reach enough agreement amongst the experts (less than 75%). In this case, these criteria were discarded along with EBCM. By contrast, e, \(\bar{r}\), CG, ∑, HCC, SSIM and UQI received adequate agreement amongst the experts (more than 75%). Most of these sub-criteria scored 94%, with only HCC receiving an agreement percentage of 88%. However, unlike the first round of FDM, the second round achieved an agreement percentage of 75%, thereby confirming that the sub-criteria were accepted by all experts. Only seven sub-criteria successfully satisfied the second precondition of item acceptability.

A defuzzification analysis was performed to check whether the sub-criteria satisfied the third precondition. As shown in Table 5, most of these sub-criteria failed to score 0.5 or above. Only e, \(\bar{r}\), CG, ∑, HCC, SSIM and UQI successfully satisfied this precondition with scores of greater than 0.5 except for HCC, which scored exactly 0.5. This result again confirms the consistency of results for these seven sub-criteria. In the last step of FDM, the target experts ranked the sub-criteria according to their average fuzzy numbers. As shown in Table 7, each of those sub-criteria that satisfied all the aforementioned preconditions were given high rankings by the experts. UQI ranked the highest, followed by SSIM, ∑, e, \(\bar{r}\), CG and HCC. Meanwhile, the 19 sub-criteria that failed to meet the preconditions were given low rankings. On the whole, the experts have given high priority to SSIM than to IV and CR.

Table 7 DM data

Table 6 summarises the FDM results and the achievement of the three preconditions.

Only seven sub-criteria, namely e, \(\bar{r}\), CG, ∑, HCC, SSIM and UQI, were accepted by the panel of experts for evaluating image dehazing algorithms, whereas the other 19 sub-criteria were rejected.

4.2 Multi-perspective DM data

Based on the evaluation data mentioned in Sect. 3.2 and the standardised sub-criteria in Sect. 4.1, the data were generated from the crossover between algorithms and standardised sub-criteria. Table 7 presents the completed DM, wherein nine real-time algorithms are evaluated based on eight evaluation criteria from three evaluation perspectives.

4.3 Entropy weighting results

Based on the DM data in Sect. 4.2, weighting was performed objectively for eight criteria for each foggy scene as shown in Table 8.

Table 8 Entropy values and weights for standardised criteria

The entropy values and weights in Table 8 are obtained by using Eqs. (5)–(8). In the three foggy scenes, HCC and UQI achieved the maximum and minimum entropy weights, respectively. The highest entropy weight criteria were considered the key criteria, whereas the lowest entropy weight criteria were deemed unimportant.

4.4 VIKOR ranking

The real-time image dehazing algorithms were ranked based on the multi-perspective weighted DM, which was obtained by using Eqs. (9) and (10). e, r, CG and HCC were selected as the benefit criteria, whereas Σ, SSIM, UQI and TC were selected as the cost criteria. By using Eqs. (11) and (12), the distances of the alternatives from the positive and negative ideal solutions were determined. Equation (13) was used to calculate the Qi values of nine real-time image dehazing algorithms. These algorithms were then ranked, and the optimal one was selected based on VIKOR. The results are shown in Table 9.

Table 9 VIKOR ranking results

MSCNN outranks the other algorithms, which have obtained minimum values for Si, Ri and Qi. Therefore, MSCNN was selected as the optimal real-time image dehazing algorithm. The ranking shown in the above table can be considered the final ranking results that will serve as the basis of the validation.

5 Validation procedure

Ranking real-time image dehazing algorithms is difficult because they depend on multiple conflicting criteria. The results of the framework were validated by using the objective approach proposed in [60, 61, 126]. Following [133], the real-time image dehazing algorithms were classified into several groups, and the results for each group were expressed in mean values to validate the rankings provided by the proposed framework. The mean for each group was calculated as follows by dividing the total perceived results by the amount of results:

$$\bar{x} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} x_{i}$$
(15)

where \(x_{i}\) = all x-values; n number of items.

The first group of algorithms obtained a lower mean compared with the two other groups. Meanwhile, the second group obtained a mean that was lower or equal to that of the third group yet was higher than that of the first group. Meanwhile, the mean of the third group was higher than that of the first group and is equal to that of the second group. The variance in the mean values of the selected image dehazing algorithms ensures that the results are consistently (systematically) ranked [60, 61, 126]. Table 10 presents the results after the normalisation and weighting of the raw data of the three groups.

Table 10 Validation results

Table 10 presents the validation results. Obviously, the first group had a lower mean than the second group, whilst the mean of the second group was lower than that of the third group. Therefore, the ranking of the real-time image dehazing algorithms is validated, and the algorithms are systematically ranked.

6 Conclusion

The main contribution of this paper lies in its development of a framework for standardising evaluation criteria based on FDM and its selection of an optimal real-time image dehazing algorithm based on hybrid MCDM methods from multi-foggy scenes. A total of 6 main criteria and 26 sub-criteria were identified based on the literature review. The classification and usage frequency of these criteria were reviewed, and the sub-criteria were categorised into three groups. The majority of the 26 sub-criteria were classified under IV. The e, \(\bar{r}\), ∑ and SSIM sub-criteria have been widely used in the literature. FDM was employed to standardise the sub-criteria according to expert opinions. All 26 sub-criteria must satisfy the three preconditions of FDM. e,\(\bar{r}\), CG, ∑, HCC, SSIM, UQI were used as criteria in the multi-perspective DM. The optimal real-time image dehazing algorithm was selected based on three foggy scenes. The processes and steps of the proposed framework were also outlined. The multi-perspective DM was constructed based on the crossover between the standardised criteria and the nine real-time image dehazing algorithms. The selection procedure was formulated according to the proposed hybrid entropy–VIKOR method. The final weights obtained from the entropy method highlighted the importance of each image dehazing criteria based on three foggy scenes. VIKOR was adopted to rank and select the best real-time image dehazing algorithm according to the quantitative information of the measured criteria. The results reveal that (1) FDM effectively solves the challenges in the standardisation of image dehazing criteria, (2) the hybridisation of the entropy and VIKOR methods can effectively solve the challenges in the selection of the optimal image dehazing algorithm, (3) 19 sub-criteria have failed to satisfy the fuzzy Delphi constraints, and e,\(\bar{r}\), CG, ∑, HCC, SSIM and UQI have successfully satisfied the acceptability preconditions, (4) the rankings of real-time image dehazing algorithms obtained from VIKOR identify MSCNN as the best algorithm, and (5) the ranking results are valid. Several technical points need to be addressed in future studies. Specifically, these studies should confirm the contributions of this work by conducting objective experiments. Additional criteria also warrant further examination and need to be included in the multi-perspective DM. However, the standardised criteria can be used in any evaluation and benchmarking scenario in the image dehazing domain. The observations presented in this work may also be considered when designing a new image dehazing metric dedicated to evaluate the performance of an algorithm.