1 Introduction

The Internet of things (IoT) is a rapidly developing innovation that substantially affects the way humans live [1]. IoT is ‘a global infrastructure for the information society, enabling advanced services by interconnecting (physical and virtual) things based on existing and evolving interoperable information and communication technologies’ [2,3,4,5,6]. IoT provides capabilities that authorise the communication of devices with one another and the automatic real-time exchange of information [7]. IoT is widely used in several fields for different applications [8,9,10,11]. Image processing is an IoT application that is reaching a distinct level and becoming a major part of people’s lives. IoT applications have created many trends in the image processing field [12]. Consequently, IoT and image processing have been independently used in different applications [5]. IoT provides the ease of monitoring and gaining image data through the Internet [13]. The need for high-quality image data is increasing to the extent that real-time skin detection is required [9, 14]. Real-time skin detection within IoT is implemented on the basis of artificial intelligence (AI) algorithms, which can detect human skin by using cameras in real time [15]. Thus, real-time skin detection within IoT based on AI algorithms involves functional and efficient mechanisms during segmentation, and many studies have applied various adaptability strategies for soft computing techniques [16, 17]. Real-time skin detection within IoT is important in various applications, such as gesture analysis, facial analysis, human–machine interface, image content filter, video surveillance, annotation and colour-balancing applications [16, 18]. With the rapid development of skin detection approaches in different applications and their recent integration with IoT, finding a reliable, effective and comprehensive evaluation and benchmarking methodologies has become essential. Evaluating the performance of an application based on IoT services is a challenge because many considerations and factors are included. The acquired data in IoT may not meet the requirements of systems because of the following factors. Firstly, data in practice are commonly noisy because of the environmental noise and sensing devices’ deviation and limitation of sensing accuracy. Deviation and sensing accuracy may differ from device to device (D2D). Secondly, data may be corrupted by malicious data. Thirdly, data can be incomplete; as such, the numerical limit of sensor devices and constrained sensing cost must be considered. Fourthly, even accurate and complete data can be outdated for the demand [14, 19]. Many studies have dealt with different criteria commonly used in IoT-based skin detection. For example, reliability and time complexity have been utilised [19, 20], whereas reliability criteria have been considered only to evaluate results in a system-based IoT [21, 22]. According to [23], the design of any skin detection approach should adapt to the criteria (reliability/time complexity/dataset). Early studies emphasised the problems of skin detection evaluation and benchmarking and mentioned three general requirements, namely (1) reliability (i.e. obtained skin detection rate and false positives [FPs]), (2) time complexity and (3) datasets (i.e. obtained equal error rate from a histogram classifier). With the aid of contrast, the time-consuming observation of histogram model performance has been observed. In another study [24], the dataset criterion is highlighted by comparing two algorithms despite the significance of the remaining criteria. For skin segmentation images of nonskin and skin pixels, the dataset is presented by training and testing. However, the classifier that creates output images is compared pixelwise with the ground truth of skin segmentation. Consistent with [25], our study reports a skin detection algorithm examined with images from independent databases. In general, image size must be considered for the evaluation of time complexity. This research shows that increasing the image accuracy results in the enhancement of training data, thereby increasing the time complexity of the experiment. Reliability is a prerequisite for the evaluation of skin detection [26]. Studies have emphasised that the reliability criterion primarily depends on accuracy, precision and recall of image colour despite the significance of the remaining criteria. Nonetheless, further high-quality assessment of skin detection is needed. Two parameters, namely reliability and dataset, have been reported [27]. A relationship between reliability and dataset criteria has also been described. In particular, the computation of reliability depends on a large dataset with a manually described ground truth by using an ROC curve based on the testing methodology. By contrast, time complexity and reliability have been studied, and their effects on each other have been emphasised [28]. Some studies have argued that their relationship depends on an adopted algorithm’s complex background, whereas other studies have highlighted that obtaining high accuracy and excellent quality throughout the duration of skin detection is based on processor computation features. A previous study [29] proposed three individually evaluated criteria for skin detection and discussed their effects. In another study [24], the basic criteria for reliability computation (i.e. reliability group) include training, testing (i.e. dataset group) and time complexity which are used to evaluate and benchmark skin detection approaches. Consequently, the major challenge in creating a skin detection approach is the conflict between skin detection evaluation criteria. Therefore, these requirements must be considered by evaluation and benchmarking. In other studies [23,24,25,26], all the proposed criteria have been adopted regardless of the trade-off amongst them. These studies have also evaluated the reliability criterion for a given time complexity dependent on diverse datasets and shown that the percentage of reliability varies in terms of the adopted algorithm. Therefore, the reliability criterion no longer has a constant level. The variety in time complexity is stated amongst algorithms that are dependent on the CPU time [30]. For any image within the scope of the present research, processing time is an important factor in evaluation process. Therefore, calculations must have the highest percentage of reliability when it is compared with the lowest time complexity of the output image. A dataset can be classified into two, namely training and validation data classes, to find a minimum detection error [31]. Thus, these studies confirmed the evaluation process for each criterion based on independent guidelines.

Previous studies also suggested the reliability criterion but did not refer to a specific level during the comparison between different criteria. Therefore, conflicting criteria or trade-off problems amongst reliability, time complexity and error rate within a dataset of a skin detector approach are clearly reported in prior studies [27]. Notably, problems on evaluation in skin detection are defined as multi-criterion problems with conflicting criteria. However, the following must be ensured. (1) All the available criteria used are in evaluation matters, and (2) all the criteria in evaluation are used as multidimensional measurements or not. Thus, the performance of criteria is highlighted by finding the relationship between criteria, calculating the correlation coefficient and identifying the behaviour of different criteria based on different thresholds with various colour spaces, and these processes are necessary to design a new methodology for the testing, evaluating and benchmarking of real-time IoT skin detection approaches. Moreover, the new methodology should be flexible and capable of handling the conflicting criterion problems after the current criterion needs are maintained [32,33,34]. Therefore, using structured and explicit approaches in decisions that involve multiple attributes can improve the quality of decision-making [35,36,37], and a set of techniques, which is classified under the collective heading multiple criteria decision analysis (MCDA), is useful for this purpose. MCDA is a sub-discipline of operational research and explicitly considers multiple criteria in decision-making conditions that occur in various actual situations in different domains [2]. Several useful techniques can be utilised to address multi-attribute decision-making or multi-criterion decision-making (MADM/MCDM) problems in the real world [38,39,40,41,42]. These methods not only help decision-makers (DMs) organise problems but also analyse, rank and score alternatives [2, 3, 43, 44]. Accordingly, the scoring of a suitable alternative should be performed. MADM/MCDM methods can solve benchmarking problems for real-time IoT skin detectors. In any MADM/MCDM benchmarking, fundamental terms should be defined in a decision matrix form that includes m alternatives and n criteria. A crossover between all criteria and alternatives is represented by xij. Ultimately, we obtain the (xij)_(m*n) matrix, which is expressed as follows:

where \(A_{1} , A_{2 } , \ldots , A_{m}\) are the possible alternatives that decision-makers must score (i.e. approaches of skin detection); \(C_{1} , C_{2 } , \ldots , C_{n}\) are the criteria against which the overall performance of all alternatives is measured (i.e. reliability, time complexity and error rate within the data set); \(W_{1} , W_{2 } , \ldots , W_{n}\) are the weight of the criteria; xij is the alternative rating and \(A_{i}\) with respect to criterion \(C_{j}\); and \(W_{j}\) is the weight of criterion \(C_{j}\). Certain techniques, such as normalisation, use of maximisation indicators and addition of weights, need to be finished to rank each alternative [45, 46].

In this study, we present a new methodology to aid the decision-making process for evaluating and benchmarking IoT real-time skin detectors based on a multi-agent learning neural network and a Bayesian model through multi-criterion analysis. The novelty of this study is in the utilisation of an evaluation matrix for the performance evaluation of real-time skin detectors that are based on IoT. The remaining sections of this paper are organised as follows. In Sect. 2, we describe the methodology. In Sect. 3, we present the results and discussion. In Sect. 4, we discuss the validation of the results of the proposed methodology. In Sect. 5, we draw our conclusions.

2 Methodology

The proposed methodology is conducted on the basis of three phases (Fig. 1). In the first phase, data are collected from a real-time camera within cloud IoT to gather different images. In the second phase, the process adaptation of the best previous case of the skin detector is identified using multi-agent learning, depending on different colour spaces, to create a dataset of different colour space samples for benchmarking and to perform a crossover between the adopted skin detector and the multi-evaluation group’s criteria (i.e. reliability, time complexity and error rate within dataset) of the groups gathered for decision-making. The evaluation of the adopted skin detector and the testing are dependent on three groups of criteria. This phase is also conducted to analyse the performance between different criteria based on the data of the criteria. This phase mainly aims to determine the importance of finding the relationship between the different data of the criteria in this phase. This operation is implemented to verify the existing statistical differences between them; otherwise, only one is used. The Pearson formula is adapted to calculate the correlation coefficient between various criteria, to investigate the relationship between the criteria and determine their degree of correlation. A performance analysis is conducted to evaluate and compare the criteria and identify the factors that affect their behaviour. In the third phase, a new MCDM approach is developed and subsequently used with the integrated TOPSIS (technique in order of preference by similarity to the ideal solution) and ML-AHP (multi-layer-analytic hierarchy process) as a basis for benchmarking several skin detection approaches by using the performed decision matrix outcome from the second phase.

Fig. 1
figure 1

Methodology of the experimental phases for evaluating and benchmarking the real-time IoT skin detector

2.1 Data collection phase

This phase includes a range of devices and sensors used to monitor and capture images within the IoT system. This system monitors and captures images across a range of cameras when motion is detected. Data are collected in the form of images by cameras scattered in specific locations. These cameras are connected to the Internet and send the captured images to the main centre, which is represented by a central server [47,48,49].

The server collects the data received from the surveillance camera to be processed and configured and then sends the data to terminals via access points deployed in certain locations [50,51,52]. These routers transfer data through Wi-Fi networks [2, 52, 53]. Computers receive the data to be stored and then perform the required processing. Subsequently, a computer vision module is applied to the captured images to reveal the skin, and skin detector approaches are evaluated and benchmarked in the succeeding phase.

2.2 Identification and performance phase

This phase includes two main stages, namely identification of a decision matrix and performance of the decision matrix. These stages are discussed in detail in the following subsections.

2.2.1 Identification of the decision matrix

Ascertaining skin detector approaches is an important stage in the creation of the decision matrix and comprises three main steps: developing the skin detector by using different colour spaces, conducting a crossover between different criteria with developed skin detector engines and evaluating the developed skin detector and testing it against the criteria of the three groups. This stage establishes the decision matrix from a practical aspect, which is explored in detail in the following subsections.

2.2.1.1 Development of skin detector by using multi-agent learning dependent on different colour spaces

This section highlights a case study adopted in the current work. The case study is developed on the basis of the selected colour spaces. This step is important in completing the identification and performance phase. The development of multi-agent learning technique for the skin detector is discussed in detail.


2.2.1.1.1 Multi-agent learning technique In a new method adopted in a previous study [54], multi-agent learning is used to resolve the three key issues in skin detection. This study solves a skin-like problem by utilising the most appropriate method in parametric skin modelling (NN model) with a segment adjacent-nested (SAN) technique. By contrast, the present work involves the most appropriate nonparametric method in skin modelling (Bayesian model) with a grouping histogram (GH) technique to resolve lighting condition problems. Subsequently, the group combines the two models to solve the reflection problem with water and glass. The NN method resolves skin problems. Conversely, the Bayesian model fails to resolve the lighting condition entirely. Therefore, this experiment is conducted using different colour spaces with the Bayesian model to improve results in a particular model.


2.2.1.1.2 Adapted colour space Fourteen samples of different colour spaces obtained from previous studies are used in the present study [55]. The procedure applied to the colour groups provides a solution to the lighting condition problem, which depends on the removal of the illumination component from these samples by using the Bayesian model of the proposed multi-agent learning technique. Colour spaces are widely used in studies on skin detection because they address many problems in this field [32, 54, 56,57,58,59,60,61,62,63,64]. In our study, each colour space is built depending on the separation of the illumination element from the chroma element, and this process is a key step in the development phase. Therefore, the luminance element is ignored, whereas the chroma element is retained because it is necessary to determine the skin colour during skin detection. Hence, skin detection is conducted because of the elimination of the luminance component, which is an important aspect in the size reduction in skin clusters in a colour space [55, 65].

Different colour spaces are discussed in detail in the following section.

  • Normalised RGB

Different colour spaces can be altered to represent RGB easily. The components of RGB can represent colour and luminance and are subsequently used to represent skin colour in their chromatic colour space. Luminance can be isolated from the colour space through normalisation. Chromatic colours, which are also known as ‘pure’ colours, in the absence of luminance are defined as follows:

$${\text{R}} = {\text{R}}/\left( {{\text{R}} + {\text{G}} + {\text{B}}} \right),\quad {\text{G}} = {\text{G}}/\left( {{\text{R}} + {\text{G}} + {\text{B}}} \right).$$

The process removes B, which represents a component of luminance [66,67,68].

  • YCbCr

YCbCr is ‘an encoded nonlinear RGB signal generally used by European television studios and utilised for image compression patterns’. This colour space is a clear option for skin detection because it efficiently separates luminance and easily converts from RGB and vice versa.

$${\text{Y}} = 0.299{\text{R}} + 0.587{\text{G}} + 0.114{\text{B}},\quad {\text{Cb}} = {\text{B}} - {\text{Y}},\quad {\text{Cr}} = {\text{R}} - {\text{Y}}$$

Y represents the excluded luminance component whilst using consistency ratio (Cr) and Cb only [69, 70].

  • YCgCr

YCgCr is ‘a colour space derived from YCgCr and has a Y channel that provides the luminous component, that is, light intensity; meanwhile, Cg and Cr channels represent the green and red differences in chromaticity components, respectively’. This colour space is used for digital video encoding. Y is averted, and only the chrominance fraction is adopted in the proposed integrated approach [71, 72].

The YCgCr colour space is generated through the transformation of the RGB values by using the following equations:

$$\begin{aligned} & {\text{Y}} = 16 + 65.481{\text{R}} + 128.553{\text{G}} + 24.966{\text{B}}, \\ & {\text{Cg}} = 128 - 81.085{\text{R}} + 112{\text{G}} - 30.915{\text{B}}, \\ & {\text{Cr}} = 128 + 112{\text{R}} - 93.768{\text{G}} - 18.214{\text{B}}. \\ \end{aligned}$$
  • YCgCb

YCgCb is ‘another colour space derived from YCbCr, and the RGB image determines the fitting skin region for each Y as the luminance component and the two chrominances as Cg and Cb’. The luminescent factor is excluded, whereas the chrominance element is retained. If the colour components of a pixel are within the boundaries of a fitting skin region, then this pixel is classified as a skin pixel. The Cg–Cb colour space for skin tone detection is represented by a circular model [73].

Therefore, the circular model for the skin tone in the transformed Cg–Cb space is portrayed as follows:

$$\frac{{\left( {x - {\text{Cg}}} \right)^{2} + \left( {y - {\text{Cb}}} \right)^{2} }}{{12.25^{2} }},\quad \left( {\begin{array}{*{20}c} {x = 12.25 \cos + 107} \\ {y = 12.25\sin - 110} \\ \end{array} } \right).$$
  • YUV

In this colour space, Y is ‘the luminance component, and UV is represented as the chrominance component’. In YUV, a colour space removes the luminance-related component Y to enhance the performance of the skin detection process. As such, the definition of the luminance component in this colour space is a good step towards obtaining invariance to luminance. Human vision is critical to the luminance and chrominance factors in images. The colour space confirms this sensitivity by increasing the bandwidth of the luminance to be close to human perception. YUV is derived from the original RGB source. Thus, this colour space can be converted to RGB through linear transformations [72, 74].

The YUV colour space is generated by the transformation of the RGB values by using the following equations:

$$\begin{aligned} & {\text{Y}} = + 0.299{\text{R}} + 0.587{\text{G}} + 0.114{\text{B}}, \\ & {\text{U}} = - 0.14713{\text{R}} - 0.28886{\text{G}} + 0.436{\text{B}}, \\ & {\text{V}} = + 0.615{\text{R}} - 0.51499{\text{G}} - 0.10001{\text{B}}. \\ \end{aligned}$$
  • YIQ

YIQ is nearly identical to the YUV colour space and composed of luminance Y and chrominance components I and Q. These components are represented as the second pair located on the axes of the diagram (Fig. 2). As such, I and Q denote various coordinate systems on the self-same plane. Thus, these components can be represented in the RGB values, where I is matched to range B, and Q is matched to range G. This colour space can also be converted to the RGB format through linear transformations and is represented by the following expressions [72, 75], which are used to transform RGB to the YIQ model:

$$\begin{aligned} & {\text{Y}} = \, 0.299{\text{R }} + \, 0.587{\text{G }} + \, 0.114{\text{B}}, \\ & {\text{I}} = \, 0.595716{\text{R}} - 0.274453{\text{G}} - 0.321263{\text{B}}, \\ & {\text{Q}} = \, 0.211456{\text{R}} - 0.522591{\text{G }} + \, 0.311134{\text{B}}. \\ \end{aligned}$$
Fig. 2
figure 2

Proposed skin detectors based on ANN methods involving SAN and a Bayesian model by using GH with different 12 colour spaces classes

  • HSV, HSI and HSL

‘Perceptual colour spaces are popular samples in skin detection’. In these colour spaces, I, V and L represent luminance components, and H and S denote chrominance components. Three colour spaces separate the components’ saturation S, hue H and luminance components I, V and L. The colour spaces are distortions of the RGB colour cube and can be mapped from the RGB space by nonlinear transformation. Moreover, these colour spaces allow users to identify the boundary of the skin colour class intuitively in terms of hue and saturation, and this capability of colour spaces is considered to be the most important feature of these samples. ‘As I, V, or L provides the information of brightness, these components are usually dropped to mitigate the illumination dependency of skin colour’ [75,76,77].

  • IHLS

IHLS is ‘also known as improved hue, luminance and saturation (IHLS) colour space’. It is enhanced with respect to identical colour spaces, such as HLS, HSI and HSV, through normalisation to remove the luminance component. Therefore, this method overcomes all difficulties limited by colour components, thereby providing a good distribution of the features of the space [78, 79].

  • CIEXYZ

CIEXYZ is ‘one of the perceptual uniformity systems that show that a small perturbation to a component value is approximately equally perceptible across the range of a value’. The Commission International de l’Eclairage (CIE) colour system has been dependent on the CIE primaries since 1931. The CIEXYZ colour space forms a cone-shaped space with Y as the luminance component and X and Z as chrominance components. The luminance component of each colour space is dropped to form a 2D colour. Therefore, the values of each component of the colour spaces are adapted to the range of 0–255 and quantised in 256 levels [72, 80].

  • CIELAB

CIELAB is ‘a reasonably perceptually uniform colour space proposed by the CIE’. This colour space has two components, which are presented as a and b values according to the 1976 CIELAB colour space. Thus, a and b are the chrominance components in the colour space, and L is the luminance component. In general, CIELAB does not use the luminance component of the colour because it varies frequently across the human skin. Typically, chrominance is used to separate the skin from surrounding nonskin regions [81].

  • CIELUV

CIELUV is ‘another colour space derived from a perceptually uniform colour space proposed by the CIE’. V and U are the chrominance components, and L is the luminance component. In general, the nonlinear transformation of CIELUV and CIELAB is applied to correct the problem that the RGB colour space is not perceptually uniform. Thus, the CIELUV colour space is a suitable sample when the luminance component is disregarded [69, 82].

  • CIELCH

CIELCH is ‘a colour space that is also derived from perceptual uniformity systems created by CIE in which L represents luminance, and c and h denote the chrominance components’. Conducting Utans for the first time usually causes the illumination components to drop, resulting in multiple errors in the light cluster. Thus, lighting pixels depend on illumination components, and dark clusters (absence of light) perform well even when illumination components are omitted. The needed minimal components cannot be achieved by utilising Utans, possibly mitigating the features, training and testing on the network [83]. Thereafter, different colour spaces developed using AI models based on the literature are adopted in the present work. The total number of colour spaces is 14, which is divided into 12 classes, namely normalised RGB, YCbCr, YCgCr, YCgCb, YUV, YIQ, (HSI, HSV, HSL), IHLS, CIEXYZ, CIELAB, CIELUV and CIELCH. The reason is that the colour space class of HSI, HSL and HSV, where luminance element is deleted from each colour space (I, L and V) and the chroma element (HS), is retained, which is common between colour space groups. Thus, the processing is applied only once. This process aims to acquire different alternatives (Fig. 2).


2.2.1.1.3 Training operation of a neural network model In a previous study [54], the case study applies a neural network model in the proposed multi-agent learning of the skin detector that comprises three main layers, namely input, hidden and output layers. The architecture of this model comprises nine neurons in the input layer, four neurons in the hidden layer and a single neuron in the output layer. The model aims to conduct segmentation for different samples and is implemented using the RGB colour space, which is considered an essential vector for an image sample. The dataset is distributed into three parts, namely 1200 samples for training, 300 samples for validation and 300 samples for testing. A validation process is implemented in the set-up of the training process, which is conducted to calculate the mean square error (MSE) and represent the performance function of the ANN model.

The training operation for ANN adopts the back-propagation pattern, which is a suitable way to train the feed-forward neural network model. The training operation is achieved by 331,282,971 pixels based on 1200 images from the dataset of the skin and nonskin pixels. The default training function adopted a Levenberg–Marquardt approach in the back-propagation function (trainlm) [84]. The feed-forward network training aims to create a network object. The feed-forward operation requires five steps to generate a network object. In the first step, an array for input vectors is created to implement a segment adjacent-nested (SAN) technique. In the second step, the probability elements are identified as skin and nonskin indexes by creating an array that includes an output sample considered to be a target vector. In the third step, the 114 creations of an array identify the hidden layer size in the network. In the fourth step, the cell array, including the names of transfer functions used in two layers, is determined.

The default transfer function is used for hidden and output layers to achieve three layers of the network only. The hyperbolic tangent sigmoid (tansig) is utilised for the hidden layer, whilst the linear transfer function (purelin) is employed for the output layer. In the fifth step, the name of the training function to be used is included [84]. By contrast, the activation function is used for three layers. The input layer is deactivated, and the hidden and output layers are nonlinear and linear, respectively. In the current study, the back-propagation feed-forward neural network is utilised on the basis of the reasonable results obtained by this method from previous studies [26, 85,86,87]. The RGB colour space is used only during the training of the ANN model. Table 1 describes the flow and characteristics of the neural part of the skin detector.

Table 1 List of parameters that were used for the three layers of the ANN-based skin detector

The training stage often takes a long time because of the very large amount of calculation that is required on the data. The maximum training time is 8 h, whilst 01:13:58 h is the minimum, based on trial and error.

In this process, the weight matrices between the input and the hidden and output layers are initialized with random values. After repeatedly presenting features of the input samples and desired targets, it compare the output with the desired outcome, followed by error measurement and weight adjustment until the correct output for every input is attained. Furthermore, the hidden layer neurons are estimated using an activation function that features the hyperbolic tangent sigmoid transfer function, whereas the output layer neuron is estimated using the activation function that features the linear transfer function. The training algorithm used is Levenberg–Marquardt back-propagation. To train the neural network for skin detection, a segment is extracted and entered as training input data into the ANN. The quality of the training sets that enters into the network determines how well the detector performs. The training phase of the skin detector is illustrated in Fig. 3.

Fig. 3
figure 3

Training phase of the proposed skin detector in ANN

Experimentally, the ANNs with a single hidden layer and with the parameters given in Table 1 have yielded high-accuracy results. This conclusion about the design was driven by the experiment results shown in Table 2, which are graphically depicted in Fig. 4. This figure shows the identification of the aligned trained data on the target, where the imaginary diagonal represents the target of the ideal data regression. The trained data regression is 0.96574, whilst the ideal state is 1. In other words, the results show regression analysis for a reducible ANN between two targets, namely 1 for skin and − 1 for nonskin. The results show that the relationship between the network outputs (Y) and the targets (T) is close and almost perfect; the correlation coefficient R is equal to 0.96574, which is almost an ideal fit. Moreover, from training these pixels, it can be concluded that the training pixels for both the skin and the nonskin using the segment adjacent-nested technique applied to the back-propagation neural network method do not have any overlap.

Table 2 List of all of the training experiments used in the ANN-based skin detector
Fig. 4
figure 4

Regression analysis for reducible ANN-based skin detector between two targets, i.e. − 1 for nonskin pixels and 1 for skin pixels

Figure 5 shows the error characteristic that progressively decreases until it reaches a stable stage. The error rate is calculated using the MSE function, and the error decreases to 10−4, as shown in Table 1, over 1200 iterations

Fig. 5
figure 5

Training error rates using the ANN-based skin detector

The final decision was to use the ANN as part of the proposed multi-agent learning of the skin detection, as mentioned in Table 1. Several separate training experiments and the results of their empirical tuning are given in Table 2. In the other words, Table 2 lists all of the trial experiments and the equivalent design for each experiment. These experiments were performed with different parameters, after which the best results were taken. This table shows the performance error minimisation, the linear regression of the targets relative to the outputs and other arguments that highlight the most efficient combination.

Table 2 concludes the following:

  • Increasing the number of iterations is not crucial. This statement implies that sometimes we can increase the number of iterations but fail to notice a distinct decrease in the performance (error).

  • It can be observed that there is significant improvement in the three-layer architecture compared with the equivalent two- and four-layer architectures, for the same parameters. In other words, the results show that the three-layer architecture is better than the other layered architectures for the same criteria.

  • The training function ‘trainlm’ gives the best results compared to the ‘trainrp’ and ‘learngdm’ with the same parameter selection.


2.2.1.1.4 Training operation of the Bayesian model The Bayesian model is considered important in machine learning algorithms and derived from the Bayesian rules and normalisation of the lookup table function (LUT). The procedure of the Bayesian model is based on clustering histograms by using 1200 images to calculate the LUT to identify skin and nonskin pixels. Histogram computation is typically implemented in post-training followed by the normalisation of LUT, thereby providing the distribution of separate probabilities. The training process of the Bayesian model is illustrated in Fig. 6.

Fig. 6
figure 6

Training process of the Bayesian model [57]

The probability of different colour space pixels calculated as a pixel of colour space (AdaptColorSpace) for identifying the skin pixel is observed. This probability is denoted by Pskin (AdaptColorSpace) and represented by Eq. 1.

$$P_{skin} \left( {AdaptColorSpace} \right) = \frac{{Skin\left[ {AdaptColorSpace} \right]}}{Norm},$$
(1)

where skin[AdaptColorSpace] refers to the histogram value that matches the colour vector (AdaptColorSpace). The calculation of grouping histogram values by using the normalisation coefficient is represented in the parameter Norm, which is a summation of all grouping histogram values. The normalisation value of LUT refers to the colour-matching probabilities of the skin. Thus, skin detection can be calculated as P(skin|AdaptColorSpace) by following the Bayesian rule, which is given in Eq. 2 [70, 88]:

$$P\left( {Skin\backslash AdaptColorSpace} \right) = \frac{{P\left( {Skin\backslash AdaptColorSpace} \right). P\left( {Skin} \right)}}{{P\left( {Skin\backslash AdaptColorSpace} \right). P\left( {Skin} \right) + P\left( {\sim\,Skin\backslash AdaptColorSpace} \right). P\left( {\sim\,Skin} \right) }},$$
(2)

The probability values of \(P\left( {Skin\backslash AdaptColorSpace} \right)\) and \(P\left( {\sim\,Skin\backslash AdaptColorSpace} \right)\) are directly calculated for skin and nonskin pixels by using grouping histogram processes. Conversely, the previous probabilities, such as P(skin) and P(~ skin), can be easily computed by calculating the total number of pixels of the skin and nonskin at the training step (Eqs. 3, 4, respectively) [70]:

$${\text{P}}\,\left( {\text{Skin}} \right) \, = \frac{Ts}{Ts + Tn},$$
(3)
$${\text{P}}\,\left( {\sim\,Skin} \right) = \frac{Tn}{Tn + Ts},$$
(4)

where Ts and Tn are the values of skin and nonskin pixels, respectively.

The results of the probability values should be saved within two files after the training process of the dataset is finished. These values can be represented as

$$\begin{aligned} & {\text{Ps}} = {\text{FUN}}\left( {{\text{GH}}\left( {\text{X}} \right)} \right)\;\,{\text{and}} \\ & {\text{Pns}} = {\text{FUN}}\left( {{\text{GH}}\left( {\text{Y}} \right)} \right). \\ \end{aligned}$$

Thus, a multi-agent learning technique has been implemented on the basis of the desired goal based on specific functions to achieve the best results.


2.2.1.1.5 Detection of the skin detector The detection phase begins after the training phase is completed to gather the required data from various image samples by using the proposed technique [54].

This phase evaluates the performance of the multi-agent learning technique adapted using different colour spaces. Figure 7 illustrates the process of segmentation and skin detection.

Fig. 7
figure 7

Segmentation and detection of skin pixels

The detection process depends on the parameters obtained from the training processes of ANN and Bayesian. Therefore, these parameters are determined on the basis of the final stage of the proposed system based on the output of the original image. The system generates a new image that represents the skin pixels only, whereas a white background corresponds to the nonskin pixels. The original image pixels correspond to the inputs or probabilities represented in the two files (Ps and Pns), which represent a pre-detection phase in the Bayesian model based on the aforementioned system. Thus, these pixels are saved in two variables:

$$\begin{aligned} & nn = {\text{P}}\left( {{\text{AdaptColorSpace}}{\backslash} {\text{skin}}} \right) \to {\text{skin}}\;{\text{pixels}} \\ & ww = {\text{P}}\left( {{\text{AdaptColorSpace}}{\backslash} \sim{\text{skin}}} \right) \to {\text{nonskin}}\;{\text{pixels}} \\ \end{aligned}$$

If P (ww) and P(ColorSpace\ww) are represented as (ww = {skin or nonskin}), then P(ww \ColorSpace), which is already an accepted result that allows the application of the rule of the Bayesian model, is determined:

$${\text{IF}}\;\frac{{P\left( {AdaptColorSpace\backslash skin} \right)}}{{P\left( {AdaptColorSpace\backslash \sim\,skin} \right)}} > \frac{{P\left( {\sim\,skin} \right)}}{{P\left( {skin} \right)}},$$
(5)

where (AdaptColorSpace) is classified as a skin pixel. Otherwise, it is a nonskin pixel.

Thus, the Bayesian rules can be computed at the minimum cost given [70]:

$$\frac{{P\left( {AdaptColorSpace\backslash skin} \right)}}{{P\left( {AdaptColorSpace\backslash \sim\,skin} \right)}} > \theta \to {\text{AdaptColorSpace}} \in {\text{Skin}},$$
(6)
$$\frac{{P\left( {AdaptColorSpace\backslash skin} \right)}}{{P\left( {AdaptColorSpace\backslash \sim\,skin} \right)}} < \theta \to {\text{AdaptColorSpace}} \in \,\sim\,{\text{Skin}} .$$
(7)

According to Eq. 7, the calculation is such that Bayesian (\(pix_{x,y}\)) = ww/nn; therefore, if nn is equal to 0, then nn automatically resets to nn = 0.00000000001. If ww = 0, then ww automatically resets to ww = 0.00000000001.

The threshold variable (θ) is defined in Eq. 8 as follows:

$$\theta = \frac{{P\left( {\sim\,skin} \right)}}{{P\left( {skin} \right)}}$$
(8)

The outcome of the multi-agent technique is distributed on the image pixels of I with the threshold value based on Eq. 9, which states that procedures are applied under the same conditions for both models.

For the neural network:

$$N1 \, (pix_{x,y} ) \, = \left\{ {\begin{array}{*{20}l} {\sim \,skin} \hfill & { {\text{if}}\;N1\left( {pix_{x,y} } \right) < 0} \hfill \\ {skin} \hfill & {\text{elsewhere}} \hfill \\ \end{array} } \right.$$

For Bayesian:

$$\begin{aligned} & {\text{Bayesian}}\;(pix_{x,y} ) = \left\{ {\begin{array}{*{20}l} {\sim \,skin} \hfill & { {\text{if}}\;{\text{Bayesian}} \left( {pix_{x,y} } \right) < \theta } \hfill \\ {skin} \hfill & {\text{elsewhere}} \hfill \\ \end{array} } \right. \\ & \quad \forall Pix_{x,y , } N1\left( {Pix_{x,y} } \right) < 0 \; {\text{OR}} \; {\text{Bayesian}} \left( {Pix_{x,y} } \right) < \theta \to Pix_{x,y } \in \,\sim\,skin \\ \end{aligned}$$
(9)

where N1(\(Pix_{x,y}\)) is the second parameter for function FUN (I, NI, Ps, Pns). Bayesian (\(Pix_{x,y}\)) is collected from the previous steps. The thresholds are considered on the basis of two aspects. Firstly, (N1 (\(Pix_{x,y}\)) < 0) represents the neural part, where 0 is selected based on Fig. 4. It was shown that the range of training for nonskin pixels was lower than zero, and the range of training for skin pixels was greater than zero. From these points, the boundary between the skin and nonskin pixels is considered 0. Secondly, the procedure of the Bayesian model is represented in Bayesian (\(pix_{x,y}\)) < θ. Nine threshold values (θ), namely 0.5, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9 and 0.95, are selected based on the study by Zaidan et al. [54].

Accordingly, the nine experiments for each colour space individually (as mentioned in Sect. 2.2.1.1.2) will be implemented in order to present 108 algorithms. In summary, the conditions are applied to the pixels; if the conditions are satisfied, then the pixels refer to the nonskin and [255 255 255] is assigned to refer to the white pixel. Otherwise, [R (y, x), G (y, x), B (y, x)] is returned from the original image I. The collected pixels from the set condition are skin pixels. Thus, the algorithms collect skin pixels in accordance with the conditions.

2.2.1.2 Crossing between the developed skin detector and three main groups of criteria

Table 3 shows the decision matrix that established based on the crossover between 108 algorithms and three main groups of criteria are collected from the literature. The procedure highlights the effect of the different criteria on various colour spaces. Therefore, the obtained multiple criteria include the basic elements in the established decision matrix.

Table 3 Establishment of the decision matrix

In summary, the crossover between multiple criteria and adopted colour spaces creates the decision matrix, which is used to generate the final results in the next phase.

Table 3 shows the three main groups of criteria with 108 algorithms used as alternatives in the proposed decision matrix. The procedures for each criterion are discussed in detail as follows.


2.2.1.2.1 Computation of reliability group elements The reliability group involves three basic sections, namely parameter matrix, relationship and behaviour. The relationship emphasises the importance of evaluating the skin detectors in our study through these sections [32, 56].

The procedure for each sub-criterion within a reliability group is explained in detail.

The parameter matrix comprises four key parameters, namely TP, TN, FN and FP, which are the backbone for computing the remaining criteria in the reliability group [89,90,91].

These parameters introduce the results obtained by a previously implemented multi-agent learning technique. A certain procedure is performed to calculate the basic parameters and their integral values based on the matching process. Figure 8 presents additional details about the matching process for a sample of the predicted parameters and the actual parameters used to create the confusion matrix.

Fig. 8
figure 8

Matching process for different objects

The object locations of each image are matched to calculate the image’s skin pixels and to compute the locations of pairs of images according to their object location. For example, the objects from 1 to 6 in Fig. 5 represent the base of both images. Therefore, the number of pixels in the first object is computed with the second object by a pointer, which matches each pixel of the actual parameters with every pixel of the predicted parameters. This pointer computes the skin pixels of the actual parameters by only using the predicted parameters, which present a standard measure for the number of pixels of the predicted parameters (computed as TP). The differences between the standard and calculated pixels are presented as FN. The remaining pixels are computed for the other objects. For example, the average value of the skin pixel is computed to yield the final TP for the entire image and attain the average FN as the definitive FN. Furthermore, the TN that represents the background of the image can be computed as nonskin pixels, whereas FP is considered a complement of TN. Thus, the parameter values are computed to be TN based on the values of the nonskin pixel objects from the predicted parameters, whereas FP is computed on the basis of the differences between the standard and calculated values.

By contrast, these parameters are calculated on the basis of different threshold values for each colour space to attain the final result of the DM. As such, the values of 108 algorithms are separately calculated by conducting individual experiments to generate the final parameter values for the decision matrix.


2.2.1.2.2 Computation of time complexity criterion The time complexity criterion is important in this study. The procedure and methodology for calculating time depend on the time consumption of the output and input sample images. The flowchart (Fig. 9) shows the process of computing time complexity [20, 92].

Fig. 9
figure 9

Procedure of time complexity

The image calculation processes rely on the number and size of the image samples.

$$T_{\text{process}} = T_{\text{o}} - T_{\text{i}} ,$$
(10)

where To is the output time image process and Ti is the input time image process.

$$T_{\text{total}} = \frac{{T_{\text{process}} }}{{T_{\text{average}} }},$$
(11)

where \(T_{\text{process}}\) represents the difference between the output and input image samples and \(T_{\text{average}}\) denotes the average processing time for all samples. These equations cover the computation time of different image sizes and various skin objects in the self-same image size.


2.2.1.2.3 Error rate computation within dataset elements According to the evaluation of several studies, the error rate within the dataset is the major property, which relies on machine learning algorithms. Special mechanisms in most AI models are used in these algorithms to produce results through the training and testing stages. Training is a key step in these algorithms. Therefore, the dataset is trained several times to derive a minimum error rate by selecting a particular dataset, including training and validation. However, reliability values in the testing data are addressed to gain the final outcomes. The training process is individually performed for 108 algorithms [90, 92].

2.2.1.3 Evaluation and testing of developed skin detector according to three criteria groups

The evaluated and tested decision matrix depends on the calculation the criteria procedure for 108 algorithms. This calculation is performed from nine experiments according to the threshold values for data collection, thereby providing the final results of the decision matrix.

2.2.2 Performance of decision matrix

This step is considered as the second part of the identification and performance phase. This part is conducted in two trends: Firstly, the relationship between the criteria is investigated to calculate the correlation between the criteria in order to make sure, either needs to use all of the criteria in decision matrix. Secondly, performance analysis is conducted to evaluate and compare the criteria and identify the factors that affect their behaviour in order to made sure that use all the criteria in evaluation as multidimensional measurements or not.

2.2.2.1 Correlation between criteria

This step mainly aims to investigate the relationship between the criteria and determine their degree of correlation. The case study includes three main groups of criteria that have interconnected physical characteristics. Therefore, the relationship between these criteria must be proven. Software and techniques based on mathematical and statistical methods exist to prove the relationships between variables. A Pearson method is adopted to find a correlation between the various criteria used in our study [59, 60, 65].

Thus, the method that represented Pearson’s r is shown as follows:

$$r = \frac{{\mathop \sum \nolimits_{i = 1}^{n} (X_{i } - \bar{X})*(Y_{i} - \bar{y})}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} (X_{i } - \bar{X}))^{2} *(\mathop \sum \nolimits_{i = 1}^{n} (Y_{i} - \bar{y}))^{2} } }},$$
(12)
$$\bar{x} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} x_{i} ,$$
(13)
$$\bar{y} = \frac{1}{n} \mathop \sum \limits_{i = 1}^{n} y_{i} ,$$
(14)

where n is the number of input values, X and Y are two criteria, x and y are their average values and r is the correlation coefficient.

r ranges from −1 to 1. r near 0.00 indicates uncorrelated criteria, whilst r near or equal to 1 indicates a high correlation level [66, 67].

Therefore, Pearson’s formula is applied to compute the correlation coefficient between different criteria. Multiple criteria are evaluated and tested on the basis of their features. In this study, various criteria that influence one another are independently collected. Therefore, determining the relationship between criteria and verifying the degree of correlation between them are imperative. Figure 10 shows the taxonomy of the criterion distribution in the three main layers.

Fig. 10
figure 10

Taxonomy of criteria distribution in the three layers

Figure 10 illustrates the taxonomy constructed from the three layers, which comprises three major sets of criteria in our study. The first layer includes reliability criterion (R), time complexity criterion (Tc) and error rate within the dataset criterion (ER) groups. The second layer consists of three key sections, such as parameter matrix, relationship and behaviour, which are derived from the reliability criterion. The validation and training criteria are derived from the error ratio criterion. The third layer comprises 10 criteria. Amongst the 10 criteria, four are called confusion matrices, namely true positive (TP), true negative (TN), false positive (FP) and false negative (FN), which are derived from the matrix of parameters. The four other criteria are accuracy (ACC), precision (PR), recall (RE) and specificity (SP), which are derived from the relationship of parameters. The final two criteria are F-measure (F) and G-measure (G), which are derived from the behaviour of the parameters.

2.2.2.2 Performance analysis of criteria

In order to make sure that utilise all the criteria in the evaluation process as multidimensional measurements or not, performance analysis is conducted to compare the criteria and identify the factors that affect their behaviour. Practically, a total of criteria that resulted from Sect. 2.2.2.1 should be evaluated to determine their behaviour on the basis of 108 algorithms. To show the differences between the criteria behaviour in different scenarios, the threshold values for each criterion should be distributed to the different 12 colour spaces used. Thus, the performance analysis of the criteria implemented for the three main groups of criteria used in this study is reliability group, time complexity group and error rate within the dataset group.

2.3 Development phase

The development phase involves the designing of a new methodology based on MCDM techniques. Many MCDM techniques or methods, which use different concepts, such as simple additive weighting (SAW), weighted sum model (WSM), weighted product method (WPM), multiplicative exponential weighting (MEW), multi-objective optimisation methods, hierarchical adaptive weighting (HAW), analytic hierarchy process (ANP), AHP and TOPSIS [93,94,95,96,97,98,99,100,101,102], have been investigated. None of the aforementioned methods have been used to evaluate and benchmark real-time IoT skin detection approaches. Figure 11 explains the techniques, drawbacks and recommendations of popular MCDM techniques in accordance with previous studies [32, 37, 56, 88, 95, 103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120].

Fig. 11
figure 11

Advantages and limitations of MCDM methods

Ultimately, a new research direction in the application of MCDM techniques is created by combining two or more techniques to avoid the deficiencies from relying on a single technique [121,122,123]. Extensively, the integration of TOPSIS and AHP has become acceptable within MCDM techniques for the following reasons: calculation of the relative distance by using weights and objective data; presentation of results in full ranking; provision of appropriate random analysis and prevention of trade-off through the use of nonlinear relations, which enable conversion to a programmatic procedure [124,125,126]. Many studies have provided rankings for alternatives and arranged the problems [21, 125, 127,128,129]. According to the literature, two or more techniques are commonly integrated to address various problems in many fields except skin detection applications. Therefore, a methodological approach must be identified to resolve this gap.

In summary, integrating the AHP to identify each weight for all criteria based on experts’ preferences with TOPSIS is recommended to present the total ranking of alternatives for real-time IoT skin detection approaches. Accordingly, the proposed methodology relies on the integration between TOPSIS and AHP in ranking and selecting the best alternatives. This method is based on the values of the decision matrix created in the previous phase [98]. The implementation of this phase is discussed in the next paragraphs. Figure 12 describes the new methodology for evaluating and benchmarking real-time IoT skin detection approaches.

Fig. 12
figure 12

New methodology for evaluating and benchmarking real-time IoT skin detector

2.3.1 Developing a decision-making solution for the skin detection approach based on integrated ML-AHP and TOPSIS

The integration of ML-AHP and TOPSIS is widely accepted by many researchers for the following reasons. Firstly, these methods can present the results of complete ranking and calculate the relative distance based on weights and objective data. Secondly, their results are satisfactory for random analyses and show a harmonious trade-off by using nonlinear relationships, thereby allowing easy conversion into a programmable format. Thus, in our study, ML-AHP and TOPSIS are integrated to rank numerous colour space algorithms for real-time IoT skin detection approaches.

A total of 13 weight settings that represent the three key groups of criteria under different circumstances are used in the first part. In this step, the weights are assigned according to external evaluator preferences. Thus, the weights can be measured through AHP in a pairwise form, and the outcome of this technique is subsequently used in the TOPSIS method. Different colour spaces are developed as alternatives in the decision matrix. These alternatives must be ranked to configure the selection of the best alternative in the second part. The TOPSIS method adopts the decision matrix to provide the final results.

2.3.2 Adoption of ML-AHP to investigate the weights of different evaluators

Skin detection approaches are being developed to achieve certain objectives (e.g. skin and nonskin detection in images. For each evaluation criterion, developers can assign the weight according to these objectives. The definition of weights depends on the priority preference in MCDM with interval numerals [119, 123, 130]. On the issue of skin detection evaluation and benchmarking, assigning weights is difficult. By contrast and in normal life problems, experts can help assign weights. Differences in expert opinions on the importance of skin detection criteria can increase these difficulties, thereby creating a conflict with the designer’s objective. Weights can be assigned in multiple manners and achieved by the ML-AHP algorithm to conduct pairwise comparisons amongst the criteria and adopt weights to settle these problems [131]. Moreover, the solutions of the aforementioned problems depend on experts, thereby creating a conflict between the preference of experts and that of designers. A total of 13 different weight settings are selected and assigned in the decision-making process to solve this problem. Thus, our research uses a pairwise technique (i.e. ML-AHP method). Figure 13 shows the AHP method based on multiple layers.

Fig. 13
figure 13

Weights of criteria in a multi-layer structure

2.3.2.1 Pairwise comparisons for each criterion

AHP has been proposed as a technique to obtain ratio scales from paired comparisons and is now a widely known MCDM method [117, 132].

ML-AHP allows a few inconsistencies in judgement because of human imprecision. The ratio scales are derived from principal eigenvectors, and the consistency index is derived from the principal eigenvalue. The following equation represents the required number of pairwise comparisons:

$$n*\left( {n - 1} \right)/2,$$
(15)

where n represents the number of attributes utilised during evaluation. In comparing a set of criteria, n in pairs is based on the amount of relative weights. Criteria and weights can be represented as (C1Cn) and (w1wn), respectively. This comparison can be represented in the matrix as follows:

The matrix uses a pairwise ratio whose rows provide the ratio of the weights for each element with respect to all other ratios. This method focuses on extracting the weights of various activities based on importance. Typically, importance is a judgement based on different criteria. Occasionally, these criteria correspond to objectives selected by activities for investigation [132].

Table 4 presents a comparison matrix for priority rating, which comprises three pairwise elements in the decision matrix.

Table 4 Sample pairwise comparison matrix

In this study, six experts who are from various universities in Malaysia and have sufficient background in image processing involving AI methods are invited to participate as evaluators. They are asked to compare different criteria based on the priorities identified in accordance with their perspectives. Their responses are gathered, adopted and attached with the degree of consistency according to the rules of hierarchy theory to calculate the weights. Figure 14 shows the formula of the questionnaire presented to the evaluators.

Fig. 14
figure 14

Pairwise answer from evaluators

Reference [133] proposed a new scale to calculate the degree of importance amongst different criteria. The scale uses the difference between successive scale values to compare the criteria within the scale values that range from 1 to 9. Table 5 shows an initial step towards the construction of the intensity scale of importance for activities [43].

Table 5 Intensity scale of criteria
2.3.2.2 Design of ML-AHP measurement structure

In this step, our work includes multiple layers to distribute the criteria. The ML-AHP measurement matrix is implemented to attain the weights based on the preferences of the evaluator. ML-AHP measurements use mathematical calculations that are based on pairwise comparisons to convert the experts’ judgements into weights for each criterion. A CR must also be calculated for judgements that represent the internal consistency values entered. Thereafter, the answers of the evaluators with pairwise comparisons are collected, and the reciprocal matrix is created. The reciprocal matrix provides the sub-criteria values for each level’s main criteria and identifies the importance of each feature compared with its parent. Thus, the obtained features of the main criteria represent the importance of each feature in relation to the goal. Figure 15 shows the weights computed through ML-AHP measurement based on different evaluators.

Fig. 15
figure 15

Steps of ML-AHP used to account the matrix multiple layers

2.3.2.3 Calculating the weights of criteria and checking the consistency value

Various responses gathered from different evaluators must be converted into numerical values in the decision matrix. For these values, the decision matrix implements procedures, such as normalisation and aggregation. In the next step, the weights of the criteria are determined and ranked. By contrast, the ML-AHP measurement considers an important vector to conduct a consistency test, which is normally required after the criteria weights are completely calculated. Inconsistency is often observed in the answers obtained by the ML-AHP questionnaire from individual evaluators, thereby affecting the overall consistency of the test. Hence, the CR must be tested before all responses are collected from the evaluators [134].

Finally, CR can be measured to determine the consistency of the pairwise comparison. This procedure is called a consistency index. A CR larger than 0.10 indicates inconsistency in the pairwise comparison, whereas a CR equal to or less than 0.10 indicates a reasonable comparison [98, 135].

The following formula calculates the CR:

$${\text{CR}} = {\text{CI}}/{\text{RI}}$$
(16)

CI represents the consistency index, which is derived from the following:

$${\text{CI}} = \left( {\hbox{max} {-}n} \right)/\left( {n{-}1} \right)$$
(17)

RI can be obtained from Table 6 as follows.

Table 6 Random index (RI) [117]

2.3.3 Utilisation of TOPSIS for evaluation and benchmarking of real-time IoT skin detection approaches

TOPSIS, which is favoured amongst MCDM techniques, is utilised in this procedure, which involves several steps [136, 137]. TOPSIS is applied to all alternatives based on the geometric distance from the positive and negative ideal solutions [138]. Thus, the most suitable alternative is ‘the alternative with the longest geometric distance to the negative ideal solution and the shortest geometric distance to the positive ideal solution’ [97, 139].

According to [126], TOPSIS consists of six steps: (1) constructing the normalised decision matrix, (2) constructing the weighted normalised decision matrix, (3) identifying the ideal and negative ideal solutions, (4) separating measurement calculations based on the Euclidean distance, (5) identifying closeness to the ideal solution calculation and (6) ranking the alternatives based on the closeness to the ideal solution. Each step is explained in detail as follows.

Step 1

In this step, the different criterion dimensions are transformed into a nondimensional criterion, and the comparison of each attribute is permitted. \(\left( {x_{ij} } \right)_{m*n}\) matrix is then normalised from \(\left( {x_{ij} } \right)_{m*n}\) to \(R = \left( {r_{ij} } \right)_{m*n}\) by the normalisation method.

$$r_{ij} = x_{ij} /\sqrt {\mathop \sum \limits_{i = 1}^{m} x_{ij}^{2} }$$
(18)

This step results in a new matrix R as follows:

$$R = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {r_{11} } & {r_{12} } \\ {r_{21} } & {r_{22} } \\ \end{array} } & {\begin{array}{*{20}c} \ldots & {r_{1n} } \\ \ldots & {r_{2n} } \\ \end{array} } \\ {\begin{array}{*{20}c} \vdots & \vdots \\ {r_{m1} } & {r_{m2} } \\ \end{array} } & {\begin{array}{*{20}c} \vdots & \vdots \\ \ldots & {r_{mn} } \\ \end{array} } \\ \end{array} } \right].$$
(19)

Step 2

\(w = w_{1} , w_{2} , w_{3 } , \ldots ,w_{j} , \ldots , w_{n}\) represents the group of weights from the decision-maker, which provides the normalised decision matrix. The new matrix is calculated by multiplying every column from the normalised matrix (R) with related weight \(w_{j}\). Notably, the summation of the weights should be equal to 1, which is expressed as follows:

$$\mathop \sum \limits_{j = 1}^{m} w_{j} = 1.$$
(20)

This step results in a new matrix V, which is expressed as follows:

$$V = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {v_{11} } & {v_{12} } \\ {v_{21} } & {v_{22} } \\ \end{array} } & {\begin{array}{*{20}c} \ldots & {v_{1n} } \\ \ldots & {v_{2n} } \\ \end{array} } \\ {\begin{array}{*{20}c} \vdots & \vdots \\ {v_{m1} } & {v_{m2} } \\ \end{array} } & {\begin{array}{*{20}c} \vdots & \vdots \\ \ldots & {v_{mn} } \\ \end{array} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {w_{1} r_{11} } & {w_{2} r_{12} } \\ {w_{1} r_{21} } & {w_{2} r_{22} } \\ \end{array} } & {\begin{array}{*{20}c} \ldots & {w_{n} r_{1n} } \\ \ldots & {w_{n} r_{2n} } \\ \end{array} } \\ {\begin{array}{*{20}c} \vdots & \vdots \\ {w_{1} r_{m1} } & {w_{2} r_{m2} } \\ \end{array} } & {\begin{array}{*{20}c} \vdots & \vdots \\ \ldots & {w_{n} r_{mn} } \\ \end{array} } \\ \end{array} } \right].$$
(21)

Step 3

The definition of the two artificial alternatives, namely \(A^{*}\) and \(A^{ - }\), is explained in detail. \(A^{*}\) represents the ‘ideal alternative’, and \(A^{ - }\) represents the ‘negative ideal alternative’.

$$A^{*} = \left\{ {\left( {\left( {\mathop {\hbox{max} }\limits_{i} v_{ij} |j \in J} \right), \left( {\mathop {\hbox{min} }\limits_{i} v_{ij} |j \in J^{ - } } \right)|i = 1,2, \ldots ,m} \right)} \right\}$$
(22)
$$\quad = \left\{ {v_{1}^{*} , v_{2}^{*} , \ldots , v_{j}^{*} , \cdots v_{n}^{*} } \right\}$$
(23)
$$A^{ - } = \left\{ {\left( {\left( {\mathop {\hbox{min} }\limits_{i} v_{ij} |j \in J} \right), \left( {\mathop {\hbox{max} }\limits_{i} v_{ij} |j \in J^{ - } } \right)|i = 1,2, \ldots ,m} \right)} \right\}$$
(24)
$$\quad = \left\{ {v_{1}^{ - } , v_{2}^{ - } , \ldots , v_{j}^{ - } , \ldots v_{n}^{ - } } \right\}$$
(25)

where \(J\) the benefit attribute and subset of \(\left\{ {i = 1,2, \ldots ,m} \right\}\), whereas \(J^{ - }\) is the set of cost attribute and complement set of \(J\) or \(\left( { J^{c} } \right)\).

Step 4

In this step, the separation of measurements is calculated by computing the distance between every alternative within V and the ideal vector \(A^{*}\) on the basis of the Euclidean distance as follows:

$$S_{{i^{*} }} = \sqrt {\mathop \sum \limits_{j = 1}^{n} \left( {v_{ij} - v_{j}^{*} } \right)^{2} } , \quad i = \left( {1,2, \ldots m} \right).$$
(26)

The separation for every alternative within V from \(A^{ - }\), which is a negative ideal, is computed as follows:

$$S_{{i^{ - } }} = \sqrt {\mathop \sum \limits_{j = 1}^{n} \left( {v_{ij} - v_{j}^{ - } } \right)^{2} } , \quad i = \left( {1,2, \ldots m} \right).$$
(27)

This step counts the alternatives (\(S_{{i^{*} }}\) and \(S_{{i^{ - } }}\)) that represent the distance between every alternative and between the negative ideal and ideal alternatives.

Step 5

In this process, the closeness of \(A_{i}\) to the ideal solution \(A^{*}\) can be calculated as follows:

$$C_{{i^{*} }} = S_{{i^{ - } }} /\left( {S_{{i^{ - } }} + S_{{i^{*} }} } \right),\quad 0 \le C_{{i^{*} }} \le 1, i = \left( {1,2, \ldots m} \right).$$
(28)

\(C_{{i^{*} }} = 1\) if and only if (\(A_{i} = A^{*}\)). Similarly, \(C_{{i^{*} }} = 0\) if and only if (\(A_{i} = A^{ - }\)).

Step 6

Finally, a group of alternative \(A_{i}\) can be prioritised on the basis of \(C_{{i^{*} }}\), where the lowest value means the worst performance and high values refer to a remarkable performance.

2.3.3.1 Decision-making context

Couple fundamental decision-making contexts are emphasised on the basis of the individual and group decision-makers (GDMs). This situation is encountered by individuals when they collectively select amongst given alternatives. Afterwards, the decision can no longer be attributed to any individual member of the group because all the individuals within such social group processes as social influence contribute to the outcome. GDM techniques systematically gather and integrate the awareness for the judgement of experts from various areas [112, 140,141,142]. Thus, the GDMs of experts or evaluators subjectively provide their judgement and weights for the criteria. Figure 16 shows the group decision-making process where Ex1, Ex2, …, Exn are the GDMs of experts, which can provide their judgement and weights for the criteria.

Fig. 16
figure 16

GDM process

For example, with the problems of real-time IoT skin detection evaluation and benchmarking considered in the context of group decision-making, reliability (C1), time complexity (C2) and error rate within the dataset (C3) are subjectively measured by the evaluators (Fig. 17).

Fig. 17
figure 17

Individual decision-maker process

Two different settings in the original context of decision-making, which fundamentally depend on the problem itself, are considered. Normal decision-making is a first setting where an individual provides subjective judgements and weights for all criteria. Another setting is group decision-making, where decision-makers provide their subjective judgements and their own weights for every criterion as a group. In this study, the problems of evaluation and benchmarking in real-time IoT skin detection approaches for all the data are objectively obtained (numerical values). Thus, the context of the decision-making process must be considered.

3 Result

This section presents the result based on two stages, namely analysis and comparison of multi-criteria and benchmarking of real-time IoT skin detection approaches (Fig. 18). In the first stage, the results of the proposed decision matrix, correlation coefficient and performance analysis of criteria are presented. In the second stage, the results of multi-layer weight measurement, TOPSIS performance based on different evaluator’s weights and group TOPSIS with internal and external aggregations are presented.

Fig. 18
figure 18

Structure of the result

3.1 Multi-criteria analysis and comparison

In this section, the results of proposed decision matrix, correlation and performance analysis between criteria have been presented in the following subsections.

3.1.1 Results of the proposed decision matrix

In this section, the results obtained based on the development of a case study were adapted. A total of 14 colour spaces were developed on the basis of multi-agent technique using two artificial intelligence models. The outcome of the development process generated parameters that considered fundamental values to calculate the values of the reliability group. In addition, the other basic criteria values were calculated according to their methodology. Thus, the multiple criteria values with skin detection engine values produced the decision matrix that represents the dataset in our study.

The decision matrix was constructed using the values of the 13 criteria and 14 colour spaces. The values of each colour space were calculated individually based on the nine threshold values. Thus, the criteria values were calculated according to the threshold values that generated nine different values for this colour space. In addition, new values have been obtained for these criteria according to the other colour spaces that will be implemented. Eventually, 108 algorithms were obtained on the basis of the 14 colour spaces that were adapted. However, the final number of the colour spaces that appeared in the decision matrix is 12. The reason is that the colour space groups of HSI, HSL and HSV, where luminance element is deleted from each colour space (I, L and V) and the chroma element (HS), is retained, which is common between colour space groups. Thus, the processing is applied only once. Table 7 illustrates the evaluation result for 108 colour space samples tested 13 of the criteria.

Table 7 Implemented decision matrix

The relationship between all criteria will be determined according to statistical methods using the Pearson formula, as well as conducting the performance analysis for each criterion to determine their behaviour based on various threshold values that will be discussed in detail in the next sections.

3.1.2 Correlation between criteria results

The details of the correlation tests conducted amongst these criteria are presented in the next subsections.

3.1.2.1 Correlation analysis in layer 1

This layer includes three independent criteria, namely reliability criterion, time complexity criterion and error rate criterion. The Pearson method is implemented to determine the extent of the relationship and correlation degree between the criteria. After conducting the test and selecting the desired path to determine the correlation between the criteria, we obtain the following results, as shown in Table 8.

Table 8 Comparison of reliability, time complexity and error rate criteria

Table 8 illustrates the relationship and degree of correlation between the criteria based on the rules of the correlation coefficient according to the Pearson test. Therefore, the correlation coefficient between the reliability and time complexity group was − 0.239, indicating that a reverse correlation exists when (r < 0) at a degree of significance of 0.013 for 108 samples. In addition, a high correlation exists in the negative aspect whilst the correlation value in the error rate group was − 0.260, in which a reverse correlation exists when (r < 0) at a degree of significance of 0.007 for 108 samples. Furthermore, the correlation degree between the time complexity and error rate groups was 0.892, indicating that a positive correlation exists when (r > 0) at a degree of significance of 0.000 for 108 samples. Thus, a high correlation exists in the positive aspect. Overall, the existing correlation between each criterion is proven on the basis of the rules of Pearson test results.

3.1.2.2 Correlation analysis in layer 2

The second layer includes two sub-criteria groups. The first group includes three basic sub-criteria generated by the reliability group. The second group includes two sub-criteria derived from the error rate group. Pearson test will be used to calculate the correlation coefficient between each group (Table 9).

Table 9 Comparison amongst sub-criteria as parameter matrix, relationship and behaviour

Table 9 highlights the result of correlation between three sub-criteria, namely the matrix, relationship and behaviour of parameters, which are all derived from the reliability group.

Table 10 highlights the correlation between validation and training sub-criteria, which are derived from the error rate within the dataset group. Overall, the existing correlation between each criterion in Tables 9 and 10 is proven on the basis of the Pearson test results.

Table 10 Comparison training and validation sub-criteria
3.1.2.3 Correlation analysis in layer 3

Layer 3 comprises three groups of sub–sub-criteria. The first group includes four parameters generated from the parameter matrix, namely TP, FP, TN and FN. The second group includes four parameters derived from the parameter relationship, namely accuracy, precision, recall and specificity. The third group includes two parameters, namely F-measure and G-measure, which are generated from the parameter behaviour. Table 11 shows that the Pearson test is implemented to calculate the correlation coefficient for each group.

Table 11 Comparison amongst TP, FP, TN and FN sub–sub-criteria

Table 11 highlights the correlation between four sub–sub-criteria, namely TN, FP, TN and FN, which are derived from the parameter matrix.

Table 12 highlights the correlation between four sub–sub-criteria, namely accuracy, recall, precision and specificity, which are derived from the relationship of parameters.

Table 12 Comparison of accuracy, precision, recall and specificity sub–sub-criteria

Table 13 highlights the correlation between two sub–sub-criteria, namely the F-measure and the G-measure, which are derived from the behaviour of parameters. Overall, the existing correlation between each sub–sub-criterion in Tables 11, 12 and 13 is proven by the Pearson test results.

Table 13 Comparison between F-measure and G-measure sub–sub-criteria

Overall, the Pearson test shows a strong correlation between the criteria based on the results. According to the results in Tables 8, 9, 10, 11, 12 and 13, the results demonstrate statistically significant differences between the criteria within each layer, emphasising the need to use all the criteria and sub-criteria in the proposed decision matrix that supported the proposed methodology.

3.1.3 Performance analysis of criteria results

Details of the performance analysis are presented below:

3.1.3.1 Reliability group

In this step, we highlight the performance of reliability group based on nine threshold values for each colour space used. Reliability group includes three key sections: (1) the matrix of parameters [true negative (TN), true positive (TP), false positive (FP) and false negative (FN)], (2) the relationship of parameters (accuracy, recall, precision and specificity) and (3) the behaviour of parameters (F-measure and G-measure).

  1. A.

    Matrix of parameters

The matrix of parameters, which is the first and most important section in the reliability group, includes four key parameters, namely TN, TP, FP and FN. The matrix of parameters is also one of the evaluation techniques for skin detection approaches [69].

  1. A.1

    True negative criterion

In this section, the behaviour of the first criterion within the matrix of parameters is discussed and analysed. Figure 19 illustrates the behaviour of TN criterion based on nine threshold values with different colour spaces.

Fig. 19
figure 19

Behaviour of true negative criterion with different colours

Figure 19 illustrates the behaviour of the TN criterion at different threshold values with various colour spaces used as alternatives in the study. The graph shows the behaviour of this criterion to appear similar at each threshold value. However, the figure shows the lowest threshold value of the criterion at 79.5%, whereas the highest value is at 99.04%. The path of each threshold is nearly similar according to the colour spaces, except for thresholds 1 and 6. The behaviour of the criterion is shown at the value of threshold 1, which starts to slightly increase from Norm-RGB to YIQ. The value slightly drops at the HIS, HSV and HSL, slightly increases until CIELUV and drops again in CIELCH. By contrast, thresholds 2 and 3 exhibit the same behaviour, which starts rising from Norm-RGB to YCbCr and then stabilises their track to HIS, HSV and HSL. Such behaviour begins to slightly decline, slightly increases until CIEXYZ and gradually increases its track even CIELCH. Thresholds 4, 5, 7 and 9 start to slightly increase from Norm-RGB to YCbCr and begin to drop and increase as well as CIELCH. Meanwhile, threshold 6 has a reverse behaviour wherein the threshold starts to slightly increase from Norm-RGB to YCgCr, starts to fall and then increases to YIQ. The threshold slightly increases and then stabilises its track until CIELUV and then falls until CIELCH. Finally, threshold 8 starts to slightly increase from Norm-RGB to YCbCr and then begins to increase, thereby dramatically falling and sharply dropping in HIS, HSV and HSL, and starts to slightly increase in CIEXYZ where its track settled down to the end.

  1. A.2

    True positive criterion

Figure 20 shows the behaviour of TP criterion according to the changes in the threshold values and colour space. The behaviour is different from the previous standard as shown below.

Fig. 20
figure 20

Behaviour of the true positive criterion with different colour spaces

Figure 20 illustrates the behaviour of the TP criterion at different threshold values with various colour spaces used as alternatives in this research. The figure shows the lowest threshold value of the criterion at 80.06%, whereas the highest value is 99.84%. Nearly all threshold values start at one point for Norm-RGB. Thus, the behaviour of this criterion is affected according to the changes in the threshold track. Threshold 1 starts to slightly increase until YCgCr, and its track evenly stabilises to YIQ, slightly increases until IHLS and stabilizes until the end. Thresholds 2 and 3 begin to slightly increase until YCbCr, stabilise their track until YIQ, slightly drop at HIS, HSV and HSL and then slightly increase until the end of its track. Thresholds 4, 5 and 7 have nearly similar tracks from start to end. These thresholds start to increase to YCbCr, where tracks change between high and low until YIQ and then drop to a minimum value of HIS, HSV and HSL. The thresholds increase again until CIEXYZ; thus, their track stabilises until CIELAB, slightly increase and decrease to the end of their tracks. Threshold 9 has a similar track with the previous threshold but sharply drops at HIS, HSV and HSL; the threshold sharply increases to CIEXYZ, settles in CIELAB and then increases to its end. However, threshold 6 has a different track, which evenly increases to YCbCr, settles in YCgCr and changes its track between high and low until YIQ. The track then drops to the lowest value at HIS, HSV and HSL, increases to CIELAB and slightly decreases until the end. Finally, threshold 8 represents the highest threshold value, which begins to increase until YCbCr, stabilises until it sharply drops to the lowest value at HIS, HSV and HSL, and increases again until CIEXYZ; its track stabilises until the end.

  1. A.3

    False positive criterion

Figure 21 shows the behaviour of FP criterion based on different threshold values with various colour spaces as shown in the graph below.

Fig. 21
figure 21

Behaviour of the false positive criterion with different colour spaces

Initially, this criterion is complementary to the true negative criterion within probabilistic parameters of the reliability group. However, the figure shows that the lowest threshold value of the criterion is 0.07%, whereas the highest value is 20.5%. Thus, the track of each threshold is nearly similar according to the colour spaces, except for thresholds 1 and 6.

Threshold 1 has the highest value, which slightly declines until YIQ and then slightly increases even in HIS, HSV and HSL, gradually dropping until CIELAB and then slightly increasing again to the end. Thresholds 2 and 3 have a similar track where they slightly decline even in YCbCr and then continue to increase until YUV; the thresholds slightly increase even in HIS, HSV and HSL, slightly drop again and then straighten their track until the CIELUV where they drop to the end. In addition, thresholds 4 and 5 have the same track, where they start to slightly drop even in YCbCr and then fluctuate until YIQ. These tracks slightly increase even in HIS, HSV and HSL and slightly decline. Finally, their tracks settle down to CIEXTZ and then slightly increase and decrease again until the end. Threshold 6 has a different behaviour, in which its track drops until YCgCr and gradually increases and decreases, slightly increases in YIQ and slightly drops until it stabilises to the end. By contrast, thresholds 7 and 9 have tracks similar to that of threshold 5. Finally, threshold 8 records the lowest threshold value, where its track slightly drops at YCbCr and then gradually increases and decreases until YIQ. Its track sharply increases and then drops even in CIEXYZ until its track remains stable to the end. The general characteristic of this criterion is that its behaviour is similar to the TN criterion behaviour but in the opposite direction.

  1. A.4

    False negative criterion

Figure 22 illustrates the behaviour of the FN criterion using nine thresholds with various colour spaces as shown in the chart below.

Fig. 22
figure 22

Behaviour of the false negative criterion with different colour spaces

Initially, this criterion is complementary to the TP criterion within probabilistic parameters of the reliability group. However, the lowest threshold value of the criterion is 0.16%, whereas the highest value is 24.63%. Figure 22 shows that nearly all threshold values start at one point for Norm-RGB. Thus, the behaviour of this criterion is affected according to the changes in the threshold track. Threshold 1 slightly declines until YCgCr and slightly increases to stabilise even at YIQ and drops to IHLS. The threshold slightly rises and declines at CIELUV and rises again to the end. Thresholds 2 and 3 nearly exhibit a similar trend as threshold 1, but they sharply rise at HIS, HSV and HSL, gradually drop until CIELUV and slightly rise to the end afterwards. Thresholds 4 and 5 start similarly as the previous thresholds, in which they sharply rise even to HIS, HSV and HSL, gradually decline until CIELUV and increase again to the end. Threshold 6 exhibits a behaviour that is different from others when it starts from the same point and gradually declines until YcbCr. This threshold moderates its track at YcgCr, slightly drops to YcgCb, rises and declines again. Finally, it sharply rises to the top even at HIS, HSV and HSL afterwards, gradually dropping to the end. Thresholds 7 and 9 exhibit similar behaviour where they start to decline from the starting point until YCgCr, change their track up and down and rise to the top even at HIS, HSV and HSL. Threshold 7 gradually drops to the end whilst threshold 9 continues to decline. Finally, threshold 8 records the lowest threshold value, where it starts at the same point and continues to decline even at YCbCr and then stabilises its track to YUV. This threshold sharply rises to the top until HIS, HSV and HSL, sharply declines until CIEXYZ and gradually continues to the end. We conclude that the behaviour of this criterion is similar to that of TP criterion but in the opposite direction.

  1. B.

    Relationship of parameters

In this step, the second group of sub-criteria within the reliability group, namely relationship of parameters which includes (accuracy, recall, precision and specificity), is discussed and analysed.

  1. B.1

    Accuracy criterion

Accuracy refers to the exactness of the analytical method in correctly identifying outliers and inliers from total data [66].

Figure 23 shows that the criterion behaves similarly with that of TP criterion due to the convergence of the values between the two criteria. The accuracy criterion is an important measure of the relationship of parameters of the reliability group. However, the figure shows the lowest threshold value of the criterion at 79.90%, whereas the highest threshold value is 100.40%. Thus, the track of each threshold is nearly similar according to the colour spaces, except for thresholds 1 and 6.

Fig. 23
figure 23

Behaviour of the accuracy criterion with different colour spaces

Threshold 1 records the lowest value, where it starts to slightly rise from Norm-RGB to YIQ and slightly drops at the HIS, HSV and HSL afterwards. It starts to slightly rise until CIELUV and drops again in CIELCH. Thresholds 2 and 3 exhibit the same behaviour, where they start to rise from Norm-RGB to YCbCr and then stabilise their track to HIS, HSV and HSL. These thresholds begin to slightly decline, gradually rise until CIEXYZ and gradually rise their tracks even to CIELCH. Thresholds 4, 5, 7 and 9 start slightly high from Norm-RGB to YCbCr and begin to drop and rise even at CIELCH. Threshold 6 demonstrates a reverse behaviour, where it starts to slightly rise from Norm-RGB to YCgCr, gradually falls, rises to YIQ, slightly rises and stabilises its track until CIELUV, thereby falling until CIELCH afterwards. Finally, threshold 8 starts at the highest value, slightly rises from Norm-RGB to YCbCr, begins to rise and dramatically falls, thereby sharply dropping at HIS, HSV and HSL. This threshold starts to slightly rise to CIEXYZ, and its track settles down to the end.

  1. B.2

    Recall criterion

The recall is an important criterion to measure the wholeness or quantity values. This criterion refers to the TP rate for the probability of complete retrieval of the parameter values. [71]. Figure 24 shows all threshold values starting at approximately the same point from Norm-RGB. This criterion behaves fairly similar to that of the TP criterion due to the matching in the track of thresholds, as shown in the graph. The reason behind that recall is the ratio of TP components to elements inherently ranked as positive. Thus, recall represents the number of correctly classified positive examples divided by the number of positive examples in the data. Finally, the contribution of the recall represents almost similar to the TP.

Fig. 24
figure 24

Behaviour of the recall criterion with different colour spaces

Recall is a measure of completeness or quantity. It is the average probability of a complete retrieval referred to as the TPR. Recall is the ratio of TP components to elements inherently ranked as positive. Thus, recall represents the number of correctly classified positive examples divided by the number of positive examples in the data. Finally, the contribution of the recall also focuses only on the positive examples and predictions.

Figure 24 shows that the lowest threshold value of the recall criterion is 0.791%, whereas the highest threshold value is 0.997%. This figure illustrates the behaviour of the recall criterion at different threshold values with various colour spaces used as an alternative in the study. The chart demonstrates that nearly all threshold values start at one point from Norm-RGB. Thus, the behaviour of this criterion is affected according to the changes in the threshold track. Threshold 1 starts to slightly increase until YCgCr; its track stabilises even in YIQ and then slightly increases until IHLS; finally, its track stabilises to the end. Meanwhile, thresholds 2 and 3 begin to slightly increase until YCbCr and then stabilise their track until YIQ. These thresholds slightly drop at HIS, HSV and HSL and then slightly increase to the end. Thresholds 4, 5 and 7 nearly have a similar track from start to end. These thresholds start to increase to YCbCr. Their tracks change between up and down until YIQ and then drop to a minimum value of HIS, HSV and HSL. Finally, their tracks increase again until CIEXYZ and stabilise until CIELAB, slightly increase and decrease to the end. Threshold 9 has a track similar to that of the previous threshold. However, its track sharply drops at HIS, HSV and HSL and sharply increases to the CIEXYZ. The track remains stable until CIELAB and reaches its maximum height to the end. Threshold 6 has a different track, where it increases even in YCbCr and settles up until YCgCr. Its track changes between up and down until the YIQ and drops to the lowest value at HIS, HSV and HSL. The track of this threshold increases to CIELAB and then slightly decreases until the end. Finally, threshold 8 records the highest threshold value, which begins to increase until YCbCr and stabilises until it sharply drops to the lowest value at HIS, HSV and HSL. Its value increases again until CIEXYZ, and its track stabilises until the end.

  1. B.3

    Precision criterion

Precision is also an important measure and refers to the ratio of information relevant to all information gathered through networks or different sensors and services [72].

Figure 25 depicts that precision has a behaviour similar to accuracy due to the convergence of the respective values. The precision criterion is an important measure of the relationship of parameters of the reliability group. However, the figure shows that the lowest threshold value of precision is 0.796%, whereas the highest value of the threshold is 0.996%. Thus, the track of each threshold is nearly similar according to the colour spaces, except for thresholds 1 and 6. Threshold 1 records the lowest value, which slightly starts to increase from Norm-RGB to YIQ. The value slightly drops at the HIS, HSV and HSL, slightly increases until CIELUV and decreases again in CIELCH. By contrast, thresholds 2 and 3 have the same behaviour, in which their values start to increase from Norm-RGB to YCbCr and then stabilise their track to HIS, HSV and HSL. Their tracks begin to slightly decline, slightly increase until CIEXYZ and gradually increase even in CIELCH. Thresholds 4, 5, 7 and 9 start to slightly increase from Norm-RGB to YCbCr and then begin to drop and increase even in CIELCH. Threshold 6 has a reverse behaviour, in which it starts to slightly increase from Norm-RGB to YCgCr and starts to decrease and increase to YIQ. This threshold slightly increases and stabilises its track until CIELUV and decreases until CIELCH. Finally, threshold 8 starts at the highest value, which slightly increases from Norm-RGB to YCbCr, increases and dramatically and sharply drops at HIS, HSV and HSL. This threshold starts to slightly increase to the CIEXYZ until its track settles down to the end.

Fig. 25
figure 25

Behaviour of the precision criterion with different colour spaces

  1. B.4

    Specificity criterion

Specificity is an important measure of the relationship of parameters that represents the capability of a classifier to distinguish between classes of the negative value from 0 to 1 [70]. The specificity criterion behaviour is shown in Fig. 26.

Fig. 26
figure 26

Behaviour of the specificity criterion with different colour spaces

Figure 26 illustrates that the behaviour of specificity is similar to that of precision at the threshold values according to different colour spaces.

  1. C.

    Behaviour of parameters

The last part of the reliability group represents the behaviour of parameters and includes two key parameters, namely F-measure and G-measure. This group measures and tests the behaviour of the parameters that are closely related to those of the precision and recall.

  1. C.1

    F-measure criterion

F-measure is a simple method to represent the precision and recall, which is considered as the most popular criterion for evaluating classification quality [73].

Figure 27 shows the behaviour of F-measure criterion based on nine threshold values with different colour spaces as shown below. The figure shows that the lowest threshold value of the criterion is at 0.798%, whereas the highest threshold value is at 0.991%. Threshold 1 starts to slightly increase at the lowest value from Norm-RGB until YCgCr and then stabilizes its track until the figure shows the lowest threshold value of the criterion at 0.796%. Meanwhile, the highest threshold value is at 0.996%, which slightly increases and stabilises until CIELUV and then slightly drops until the end track. Thresholds 2 and 3 start at the same point and slightly increase up to YCbCr. Their track stabilises until YIQ and descends at HIS, HSV and HSL, thereby slightly increasing up to CIEXYZ and then stabilises their track again until the end. Thresholds 4 and 5 begin to slightly increase from YCbCr and then stabilize their path until YIQ. These thresholds decline their tracks at HIS, HSV and HSL and then increase until CIELUV, slightly dropping to the end. Finally, thresholds 6, 7, 8 and 9 begin to increase from the same point where threshold 6 starts to increase until YCbCr and then settles down to YIQ. The thresholds sharply fall at HIS, HSV and HSL, increase to CIEXYZ and settle down their track until the end. Thresholds 7 and 9 have the same track as that of the previous threshold. They sharply decrease at HIS, HSV and HSL and sharply increase at CIEXYZ. The track of threshold 9 stabilises to the end whilst threshold 9 increases from CIELUV. Finally, threshold 8 records the highest value, which starts to increase to YCbCr and stabilises its track until YCGCb. The value slightly drops at YUV and continues to sharply drop at HIS, HSV and HSL. Finally, its value sharply increases at CIEXYZ and stabilises its track to the end.

Fig. 27
figure 27

Behaviour of the F-measure criterion with different colour spaces

  1. C.2

    G-measure criterion

The second criterion within the behaviour of parameters defined as G-measure, which refers to the geometric mean for values of precision and recall or the general measure for classification of algorithms based on the performance and accuracy of the measured sample classification.

Figure 28 shows that G-measure exhibits a behaviour similar to that of F-measure due to the symmetry in their final values, which affect the behaviour of G-measure according to the threshold distribution values at each colour space used.

Fig. 28
figure 28

Behaviour of the G-measure criterion with different colour spaces

3.1.3.2 Time complexity criterion

Time complexity criterion is the time processing algorithm for all image samples needed during image segmentation process based on the size of the image [74]. In this section, the key criterion of time complexity is discussed according to the distribution of threshold values with different colour spaces.

Figure 29 shows that the criterion has the lowest threshold value at 8.01%, whereas the highest threshold value is at 12.42%, which indicates that the threshold values for this criterion are nearly identical. Consequently, the matching in its values clearly influences the behaviour of this criterion. Thus, the track of threshold values appears to be identical from the starting point to the end of its track. The threshold values start at the lowest value, which sharply increases from Norm-RGB until YCgCr and then continues to slightly increase until YIQ. The values sharply drop until HIS, HSV and HSL. Finally, they start to slightly increase until CIEXYZ, and their tracks stabilise until CIELAB and slightly decrease until the end of its track.

Fig. 29
figure 29

Behaviour of the time complexity criterion with different colour spaces

3.1.3.3 Error rate within dataset

The error rate is the minimum of error rate measured during the implementation of the training process based on the dataset used by an irreducible classifier which implemented the training and validation process [69].

  1. A.

    Error rate of validation criterion

A cross-validation process, which is widely used in this study, is used to set the error rate for the training data. The dataset is divided into three sections, namely the majority of data for training and validation and testing.

Figure 30 illustrates the measurement of the error rate for validation criterion based on nine thresholds according to different colour spaces. The threshold values are identical for each colour space. Thus, the track of the threshold values starts at the lowest value from Norm-RGB and sharply increases until YCgCr. It slightly increases to YIQ and sharply declines until HIS, HSV and HSL. It sharply increases to CIEXYZ. It slightly rises until CIELAB and slowly drops to the end of the track.

Fig. 30
figure 30

Behaviour of the error rate of validation criterion with different colour spaces

  1. B.

    Error rate of training criterion

The training set is also an important stage in calculating the error rate during the dividing process of the dataset. Figure 31 shows the behaviour of the training criterion.

Fig. 31
figure 31

Behaviour of the error rate of training criterion with different colour spaces

Figure 31 shows the behaviour of the error rate of training criterion according to the distribution of threshold values at each colour space. This figure exhibits a status similar to that of the validation criterion through matching threshold values according to each colour space. The values of the threshold start from Norm-RGB, sharply rise until YCgCb and slightly drop until YUV. These values slightly rise to YIQ, sharply decline to HIS, HSV and HSL and sharply rise again to CIEXYZ. They continue to slightly rise to the CIELAB, slightly decline to CIELUV and slightly rise until the end of their track.

In conclusion, the behaviour of criteria in all scenarios is affected by the distribution of threshold values for each criterion according to the different colour spaces used. The reliability group has three main sections. The first section represents a matrix of parameters, which includes probabilistic parameters that have nearly identical behaviour as in the charts. The TN criterion and FP criterion have a similar behaviour but in the opposite direction due to the values of the FP criterion, which are complementary to the TN criterion. The TP criterion and FN criterion have a similar behaviour for the same reason. The second section represents a relationship of parameters including the four main criteria. The behaviour of the criteria is affected by the values of the matrix of parameters according to their measurement. The third section includes two criteria, namely F-measure and G-measure. Both criteria have similar behaviour due to the convergence of their values. By contrast, the time complexity has a specific behaviour as shown in the diagram. The behaviour of the time complexity is clearly affected by the large convergence between its values, which is distributed according to the threshold values compared with the different colour spaces. Finally, the error rate within the dataset has two basic criteria, namely validation and training. The behaviour of the two criteria is clearly affected by the matching between the threshold values at each colour space for these criteria. Therefore, the results show differences between the behaviours of criteria, thereby emphasising the need to use all the criteria as multidimensional measurements in the methodology proposed.

3.2 Benchmarking of real-time IoT skin detectors

In this stage, the process of integration is implemented between the best MCDM techniques according to the proposed methodology. AHP is performed to generate different weights according to the criteria used in our study. This method is based on the preferences of six evaluators from different universities in Malaysia using pairwise question method to provide the results of the final weights. By contrast, the TOPSIS method is employed to generate the final results based on the obtained weights from the method and the performed decision matrix. The results are based on the two main contexts, namely individual and group contexts. Thus, the selection of a suitable context is recommended based on experiment implementation and different aggregation processes conducted to achieve the selection procedure. These contexts are emphasised based on individual decision-making for decision-makers and group decision-making for multiple decision-makers. The results are obtained through the integration process for selecting the best alternatives, which is considered the main objective of this section. Consequently, the results for the six evaluators are presented and discussed in detail.

3.2.1 Multi-layer weight measurement using AHP

Table 14 presents the results based on the preferences of the six evaluators from the Malaysian universities. Questions are presented according to the rules of the hierarchical analysis process. ML-AHP is applied to generate standard weights according to the preferences of the evaluators. The first evaluator presents the percentage of the reliability group at 57.3%, the time complexity group percentage at 35.3, and the error rate group at 7.4%. The second evaluator presents the percentage of the reliability group at 28.1%, the time complexity group at approximately 8.1% and the error rate group at 63.8%. The third evaluator presents the percentage of the reliability group at 33.3%, the time complexity group at 33.3% and the error rate group at 33.4%. The fourth evaluator presents the percentage of the reliability group at 62.2%, the time complexity group at 30.2% and the error rate group at 7.6%. The fifth evaluator presents the percentage of the reliability group at 23.9%, the time complexity group at 13.8% and the error rate group at 62.3%. The last evaluator presents the percentage of the reliability group at 21.1%, the time complexity group at 68.6% and the error rate group at 10.2%. The results of the criteria weights gathered from the six evaluators are utilised to complete the decision matrix that will be used in decision-making.

Table 14 ML-AHP measurement for weight preferences

3.2.2 TOPSIS performance based on different evaluators’ weights

In this section, TOPSIS is used to evaluate and benchmark the skin detector approach on the basis of 108 colour space algorithms from the perspective of the six evaluators. Thus, this procedure is applied to select the best method. Table 14 provides the preference weights of the features with the appropriate colour space algorithm from the perspective of the evaluators. Six experts conduct pairwise comparisons to evaluate the degree of importance of each evaluation criterion. TOPSIS is implemented to distinguish the worst and best performances of the skin detection approach for each experiment. The results from these experiments are compared with the ideal and worst performances. \(S^{ - }\) and \(S^{ + }\) represent the approaches whose performances are closest to the worst and ideal performances, respectively. Based on the TOPSIS rules, the approach closest to the best performance and farthest from the worst performance is selected as the ideal approach. The preferences of the six evaluators are as follows.

According to the results of ML-AHP, the weights given by the first evaluator for the reliability group, represented in different parameters, are 41.5%, 12.8% and 3% for TP, FP, TN and FN (accuracy, recall, precision and specificity) and F-measure and G-measure, respectively. The time complexity group is given a weight of 35.3%, whereas the error rate group represented in training and validation is given weights of 6.2% and 1.2%, respectively. Each colour space algorithm is evaluated using different attributes. The results of the TOPSIS ranking indicate that the first evaluator gives an average of 0.49763 ± 0.13776. The value of the highest rank is 0.8080, whereas the lowest is 0.1527. Table 17 in Appendix presents the complete results from the first evaluator.

The weights given by the second evaluator for the reliability group represented in TP, FP, TN and FN (accuracy, recall, precision and specificity) and F-measure and G-measure are 2.5%, 8.5% and 17.1%, respectively. A weight of 8.1% is given for the time complexity group, whereas the error rate group is given the same weight of 31.9% for training and validation. The results from the second evaluator yield an average of 0.38137 ± 0.28451. The highest rank value is 0.955, whereas the lowest is 0.117. Table 18 in Appendix presents the complete results from the second evaluator.

For the reliability group, the third evaluator gives a weight of 11.1% for each parameter. The time complexity group is given a weight of 33.3%, whereas the error rate group represented in training and validation is given a weight of 16.7% for each parameter. The results from the third evaluator yield an average of 0.42512 ± 0.22686. The highest rank value is 0.8588, whereas the lowest is 0.0453. Table 19 in Appendix presents the complete results from the third evaluator.

The weights given by the fourth evaluator for the reliability group represented in TP, FP, TN and FN (accuracy, recall, precision and specificity) and F-measure and G-measure are 33.7%, 9.9% and 18.6%, respectively. The time complexity group is given a weight of 30.2%, whereas the error rate group represented in training and validation is given weights of 6.8% and 0.8%, respectively. The results from the fourth evaluator yield an average of 0.49409 ± 0.13362. The highest rank value is 0.8009, whereas the lowest is 0.1475. Table 20 in Appendix presents the complete results from the fourth evaluator.

The weights provided by the fifth evaluator for the reliability group represented in TP, FP, TN and FN (accuracy, recall, precision and specificity) and F-measure and G-measure are 3.3%, 13.7% and 6.8%, respectively. A weight of 13.8% for the time complexity group is given. Training and validation parameters representing the error rate group are given weights of 15.6% and 46.8%, respectively. The results from the fifth evaluator attain an average of 0.37469 ± 0.28847. The highest rank value is 0.9678, whereas the lowest is 0.0093. Table 21 in Appendix presents the complete results from the fifth evaluator.

The weights given by the sixth evaluator for the reliability group represented in TP, FP, TN and FN (accuracy, recall, precision and specificity) and F-measure and G-measure are 14.5%, 1.9% and 4.8%, respectively. The time complexity group is given a weight of 68.6%, whereas the error rate group represented by training and validation is given 2.6% and 7.7%, respectively. The results of the colour space algorithms achieve an average of 0.49286 ± 0.22406. The highest rank value is 0.8373, whereas the lowest is 0.0552. Table 22 in Appendix presents the complete results from the sixth evaluator.

Further details are discussed in Fig. 32, representing the preferences of the evaluators’ rankings based on internal and external aggregation values.

Fig. 32
figure 32

Virtualised ranking for six evaluators

According to the tables generated by applying the TOPSIS method for selecting ideal alternatives, the final results of internal aggregation vary. Therefore, these results are evaluated by each evaluator based on the similarities of their values, as shown in Fig. 32. The results for the first and fourth evaluators show a similar preference in the YCbCr colour space. The results for the second and fifth as well as the third and sixth evaluators yield similar results by selecting the normalised RGB colour space. Therefore, most of the evaluators’ results confirm the selection of the normalised RGB colour space. The details in the next section show the results for the internal and external aggregations within the group decision.

In summary, the results from the evaluators indicate a lack of consensus for making a joint decision because of the difficulty and complexity of individual decision-making. Thus, the group decision-maker context discusses and compares individual decisions based on the results. The next section shows the results for the internal and external aggregations within the group decision-making.

3.2.3 Group TOPSIS with internal and external aggregation

Many decision-making issues are resolved through cooperative efforts within organisations. However, according to the two methods mentioned in the literature, TOPSIS expands to the group decision environment, either through internal or external aggregations. Internal aggregations aim to adopt the aggregation process in the separation phase. In such situations, group separation is conducted amongst various aggregating decision values based on the distance of the positive and negative ideals followed by the following process. Thus, the internal aggregation calculation based on the summation values of the negative separation is divided by the summation of negative separation values and the summation of positive separation values for all evaluators (\(S^{ - }\)/(\(S^{ - }\) + \(S^{ + }\))), whereas external aggregation is determined by calculating the averages of all ranked values for all evaluators. Table 15 shows the results of group TOPSIS with the applied internal and external aggregations. The results of external aggregation show that high values are obtained in the normalised RGB colour space and demonstrate relatively identical internal and external aggregation values.

Table 15 Group decision-maker of TOPSIS method with external and internal aggregations

Findings show that the ideal values are 0.8160529 and 0.7867003 for the internal and external aggregations, respectively. The results of the colour space algorithms according to the external and internal aggregations reach averages of 0.4442933 ± 0.4827987 and 0.432439 ± 0.4776001, respectively. The outcome of aggregation indicates similarities between the internal and external rankings of various colour space algorithms based on the perspective of the six evaluators. Figure 33 depicts the post-aggregation performance of the tested colour space algorithms based on the results of the six external evaluators.

Fig. 33
figure 33

Internal and external aggregation ranking

4 Validity process

This section provides an insight into the validation and comparison of the different colour spaces obtained for this research. The validation process is an important measure for various empirical researchers in validating the accuracy of the results. Therefore, the results obtained must be validated according to the MCDM techniques. For this procedure, the initial step is the comparison of the different colour spaces based on the results. The statistical measurement method is used to calculate the values for the mean and standard deviation of each threshold value used. Figure 34 shows the overview of the design and implementation of a validation process.

Fig. 34
figure 34

Overview of the design and implementation of a validation process

4.1 Colour space measurement

The selection of the appropriate alternative based on the different criteria in skin detection can be implemented using the TOPSIS method. Thus, two categories of results, namely external and internal aggregations, are collected. This study aims to obtain results from external aggregation because it includes all values of comparison between the criteria and alternatives based on the calculation of averages amongst different ranking values. Figure 35 shows the behaviour of different colour spaces according to the criteria at specific threshold values.

Fig. 35
figure 35

Colour space measurement

Figure 35 shows that the behaviour of all the colour spaces is relatively identical based on their original values. The behaviour of YIQ starts at the lowest value at threshold 1 and increases gradually until it reaches threshold 2. It stabilises to threshold 3, gradually rises to threshold 5 and stabilises until it reaches threshold 6. It gradually rises to threshold 8 and stops at threshold 9. The behaviour of YUV starts with a gradual rise from threshold 1 to threshold 4. It continues to rise to thresholds 6 and 7 and stabilises until threshold 8 before it declines at threshold 9. The behaviours of YCgCb and CIELAB start from thresholds 1 and 2 and gradually rise to threshold 5. They stabilise at threshold 6, rise to threshold 8 and decline to threshold 9. YCgCr and CIEXYZ show identical behaviours and start a gradual rise from threshold 1 to threshold 5. They stabilise slightly at threshold 6 and gradually rise to threshold 8. Finally, they drop slightly to threshold 9. By contrast, CIELUV shows a distinct behaviour as it exhibits a gradual rise from threshold 1 to threshold 8. It declines slightly to threshold 9. CIELCH rises from threshold 1 to threshold 2 and slightly rises to threshold 5. It stabilises slightly to threshold 6, gradually rises to threshold 8 and drops to threshold 9. IHLS shows a distinct behaviour as it to gradually rise from threshold 1 until threshold 9 at the same level. YCbCr starts from threshold 1 to threshold 2 and slightly rises to threshold 5. It stabilises slightly at threshold 6, gradually rises to threshold 8 and declines to threshold 9. The behaviours of HIS, HSV and HSL start from threshold 1 and rise gradually until they reach threshold 5. Afterwards, they slightly rise to threshold 6, stabilise at threshold 7 and slightly decline slightly until threshold 9. Finally, normalised RGB starts from threshold 1 and gradually rises until threshold 5. It slightly rises to threshold 6 before it stabilises until threshold 9.

Therefore, the behaviour of all the colour spaces varies corresponding to their obtained values from external aggregation. The colour space YIQ yields the lowest value, whereas normalised RGB records the highest value amongst the colour spaces. The rest of the colours come sequentially as shown in Fig. 35.

4.2 Threshold measurements

This section presents the calculation of mean and standard deviation value of the threshold values for all colour spaces. Nine threshold values are adopted based on the case study in this research. The threshold values are distributed as 0.5, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9 and 0.95 for each colour space (see Table 16).

Table 16 Mean and standard deviation for threshold values

Table 16 highlights the results obtained for each threshold value. The results show that the mean value is the lowest with threshold 0.5 and gradually increases with the threshold until it reaches the highest value at the threshold of 0.9 and decline at threshold 0.95. The findings also indicate that the best result is at the threshold of 0.9.

5 Conclusion

In this work, the IoT system is adopted to collect data required for analysis, evaluation and comparison, and MCDM techniques are utilised to evaluate and benchmark skin detection approaches. Firstly, different images are collected from the real-time camera in an IoT environment. Processes are applied on the basis of developed skin detection approaches by using multi-agent learning in different colour spaces to create the decision matrix. The decision matrix is evaluated on the basis of different criteria with various skin engines. The performance of multidimensional criteria for skin detection engines is verified by identifying the correlation between different criteria through the Pearson method and by determining their behaviour based on nine thresholds at each colour space. The results confirm that the existing statistical significant differences and differentiated behaviours between criteria ensure that all the available criteria used are needed in the evaluation because multidimensional measurements are necessary. Moreover, the integration between MCDM techniques for the evaluation and benchmarking of skin detection approaches is presented. This research utilises the ML-AHP method based on the pairwise technique, which relies on the differences in evaluator preferences. This approach is the best technique for the identification of the weights of the criteria. The TOPSIS method used is the ideal MCDM technique for the selection of the best alternative based on group decision-making in skin detection approaches. Accordingly, the overall comparison of external and internal aggregation values presented that the normalised RGB at the sixth threshold is the highest value and was the most recommended of all spaces, whereas the YIQ colour space had the lowest value and was the worst case. Moreover, the lowest threshold was obtained at 0.5, whereas the best value was 0.9. Further studies should focus on time complexity algorithms by considering the criteria of the evaluation and benchmarking of real-time IoT skin detectors and defining their procedure objectively.