1 Introduction

Wilson’s disease (WD) is due to excessive copper accumulation in the liver and brain [1]. The National Organizations for Rare Disorders (NORD) reported that 1 in every 30,000 to 40,000 people in the world are affected by WD [1]. It is estimated that there were nearly 600 cases of WD in the USA in the year 20071 and there will be 9000 people affected by WD in the USA by the end of 2020.

WD causes severe disability and death, if not treated early. The present diagnosis of WD uses anatomical tests, but they are not reliable [2, 3]. MRI has shown promising signs for diagnosing WD since it shows white matter hyperintensity (WMH) in the brain [4,5,6,7]. However, due to the volumetric nature of the MRI and subtle nature of the hyperintensity between WD and controls, human bias and interobserver variability may complicate diagnosis. To overcome these challenges, computer-aided diagnosis (CADx) methods can play a vital role in improving the classification and characterization of WD [8, 9].

Artificial intelligence (AI) is a branch of computer science that can handle classification effectively as it can map nonlinearity between input variations and disease severity [10]. AI methods can be broadly divided into two supervised learning categories namely machine learning (ML) and deep learning (DL). Machine learning methods [11,12,13,14,15,16] like decision tree (DT), k-nearest neighbor (k-NN), support vector machine (SVM), and random forest (RF) can be applied for classification, but they use manually identified features and can yield low performance. In contrast, deep learning (DL) [17,18,19,20] methods are more reliable because they can automatically learn features using hidden layers within a dataset. One such example is a deep convolution neural network (DCNN) that has been well-adopted by industry for image classification [11, 19].

Furthermore, DL systems can augment the size of the input data to ensure a balance between classes leading to stronger learning protocols.

The DL systems are having several parameters, in particular the number of layers in the DCNN architecture. These layers are typically (i) convolution layers, (ii) max-pooling layers, and the (iii) combination of dense layers that constitutes the neural network, along with the softmax layer. Since the low-level and high-level MRI features of the WD are extracted by the convolution and max-pooling set of layers, it is, therefore, important to define as to how many such sets are needed for the optimization of the DCNN system [21]. Furthermore, since DCNN is sensitive to the training data size, it is customary to understand what should be the total size of the input training data. This training data size can be altered by the “augmentation procedure” well-developed in the AI industry [22]. For the best combination of “augmentation folds” and the “number of layers” of the DCNN, we, therefore, need to optimize the DL system for best classification and characterization of the WD. Thus, there is a direct bearing between the WD classification and (a) number of DCNN layers and (b) augmentation folds of the training data size. Such an optimization paradigm is being attempted the first time for the WD application. Furthermore, because conventional DCNN (cDCNN) uses the rectified linear unit (ReLU) as an activation function that is not continuously differentiable at the origin, we have designed an improved DCNN (iDCNN) which is smooth near the origin thereby improving its performance. Finally, we, validate the hypothesis that there is WMH in WD MRI images. The system uses two novel characterization strategies by taking advantage of the CNN layers and the Bispectrum signal processing framework [12, 19, 23]. This is another unique feature of our paradigm. Furthermore, we benchmark our two DCNN systems against the transfer learning–based “Inception V3” [24] framework and four types of machine learning systems. Overall, we designed, applied, and compared seven kinds of AI approaches for the diagnosis of WD.

As part of performance evaluation, we conducted several new experiments: (i) computing diagnostics odds ratio (DOR) and correlating this to classification accuracy were conducted, which were never attempted previously. (ii) Furthermore, a power study was also conducted to estimate the optimum dataset size. (iii) Since all implementation were conducted on a supercomputer having 8 GPUs, timing analysis was performed to demonstrate the horsepower of our design. (iv) For best operating point characteristics of the DL system in terms of training data size, we optimize the CNN models by computing classification accuracy on varying the percentage of training data.

This study is the first study of its kind having the following novel approaches:

  1. (i)

    Design of 3D optimization of two deep learning systems by classification accuracy with (a) changing DL layers and (b) folds of augmentation. The DL layers were varied between 5, 7, 9, and 11, and each design was tested with different dataset sizes created using the augmentation protocol. Augmentation was utilized to increase the dataset size by 2×, 3×, 4×, and 5× folds to create 5 sets of data including the original size. Three-dimensional optimization was carried amongst these 4 DL designs and 5 augmented dataset sizes creating 20 combinations, which could then be used for choosing the best DCNN design for optimized classification. The final estimation would then be the number of layers and augmentation fold for the highest accuracy.

  2. (ii)

    Design of inter-comparison benchmarking between the three kinds of DL systems and four types of ML classification systems.

  3. (iii)

    Hypothesis validation by characterization of WD vs. controls using a combination of AI-based feature map strength (FMS), and Bispectrum signal processing approach.

  4. (iv)

    Performance evaluation by computing (a) DOR, (b) generalization of DL paradigms, (c) reliability, (d) stability, and (e) time analysis of supercomputing.

This study is subdivided into the following sections: Section 1 details the ongoing research in the field of classification in MR neuroimaging. Section 2 explains the methods and materials used in this study. Section 3 has the details of all classification results, and Section 4 discusses the characterization of WD. Section 5 shows the performance evaluation, and Section 6 presents the discussion on the novel techniques used in this paper. The paper concludes with Section 7.

2 Background literature

The role of AI in WD classification has not taken the front stage yet. Few studies are using AI, on this topic compared with other diseases such as Alzheimer’s, Parkinson’s, or cancer and neuroimaging in general. One can therefore not ignore studies in the WD area which are not AI-based. Our recent work is the culmination of two decades of research, where biomarkers like serum or urine were used for diagnosis or segregation of patients having WD, primarily based on the threshold ranges of these biomarkers [1, 2, 25]. These methods consisted of 24-h urinary and serum laboratory tests for the identification of WD. Another class of method for diagnosis of WD consisted of eye examination for Kayser-Fischer rings and gene mutations [25]. A more recent method used all four types of biomarkers such as serum, eye, urine, and brain imaging to confirm WD [1].

Other frameworks for diagnosis of WD have been studied such as laboratory-based (blood) tests or genetic mechanisms for WD classification. Vrabelova et al. [26] described the utilization of blood tests involving DNA analysis for the ATP7B gene mutation study. Rosencrantz and Schilsky [3] used mutation of ATP7B analysis along with Kayser-Fischer ring in eyes and elevated copper level in urine. These tests were better at diagnosing disease compared to the serum/urine biomarker tests.

With the advent of MRI, several studies diverted towards neuroimaging-based classification approaches; however, they remained manual in nature. WMH has recently been explored recently in many diseases [27,28,29]. Kim et al. [30] analyzed hyperintensity in T1-weighted (T1W) and T2-weighted (T2W) MRI scans of suspected patients and found WMH in different parts of the brain such as the globus pallidus, thalamus, midbrain, and pons.

Recently, AI-community has started implementing this technology for characterization and classification of WD. In 2011, imaging took a leap towards fMRI for WD. Hu et al. [31] studied changes in the amplitude of low-frequency fluctuations (ALFF) while conducting fMRI on WD patients. Resting-state functional magnetic resonance (fMRI) images have shown promise and were employed to measure ALFF in different parts of the brain [32, 33]. Furthermore, the evolution of ML started to penetrate the imaging domain for WD classification [34, 35]. Kaden et al. [34] have demonstrated the use of support vector machine (SVM) and parameterized generalized learning vector quantization (PGLVQ) for WD classification with an accuracy of 87.5% and 90.1%, respectively. Jing et al. [35] used independent component analysis (ICA) with functional networks and SVM for WD classification and obtained the AUC of 0.94 and accuracy of 89.4% (specificity: 90.0%, sensitivity: 89.3%) with aberrant functional networks (FN). None of the above studies demonstrated automated approaches for WD classification and characterization. Our study uses a 3D optimized deep learning–based paradigm for classification of WD against controls and further extends the DL models combined with signal processing for tissue characterization. The CADx system shows three kinds of DL and four kinds of ML for WD classification and offers a novel approach to the diagnosis of WD.

3 Methodology

3.1 Patient demographics, acquisition, and data augmentation

A cohort of 46 patients T2W-TSE MRI scans (average age: 40.73 ± 11.3 years, equal M/F ratio) between the years 2011 and 2015 was analyzed (approval was obtained from the Institutional Ethics Committee, Azienda Ospedaliero Universitaria (A.O.U.), Cagliari, Italy).

Imaging examinations were performed using a 1.5-T superconducting magnet (Philips, Best, The Netherlands) with a head coil according to a standardized protocol. In each subject, the conventional diffusion-weighted imaging (DWI) was performed with single-shot spin-echo with 2 diffusion-sensitivity values of 0 and 1000 s/mm2 along the transverse axis. As part of our general brain protocol, axial and sagittal 2D FLAIR images (10000/140/2200 ms for TR/TE/TI; matrix: 512 × 512; FOV: 240 × 240 mm2; section thickness: 5 mm) were acquired. In addition to FLAIR and DWI sequences, axial spin-echo T1-weighted images (500–600/15/2 for TR/TE/excitations) and fast spin-echo T2-weighted images (2200–3200/80–120/1,2 for TR/TE/excitations; turbo factor, 2) were also obtained with the same section thickness.

3.1.1 Data augmentation

The initial MRI data were manually classified by our radiological team which was then prepared for further processing. Because the cohort consisted of 37 controls and 9 WD patients, we had an unequal number of images in both classes. As each patient MRI study had 12–13 slices, this resulted in 458 control images and 115 WD images. For optimal performance with an unbalanced dataset, the augmentation protocol using a python “Augmentor” API was applied in WD class resulting in 343 more WD images. Because deep CNN (DCNN) needs a large number of images for proper training and performance, we increased the number of images from 458 by 2×, 3×, 4×, and 5× folds in both classes, and the system was then trained and tested to find which augmented set yield optimal results. To avoid the unrealistic brain MRI scans during the augmentation protocol, we followed the acceptable protocol of rotating the image by − 10 to 10° randomly. This would prevent methods like flipping horizontal or vertical or rotating by larger angles.

3.1.2 Preprocessing: skull and background removal

Preprocessing is an essential component of the classification process. It helps extract the region-of-interest (ROI) from the MRI images. There are two important steps: (i) removal of the skull region and (ii) removal of the black background to prepare for the segmented ROI. As there are standard packages available which are well-accepted and published, we used BrainSuite [36] combined with volBrain [37] to segment and remove the background images. BrainSuite was used to read DICOM images that converted to nii files (nii file type is primarily associated with NIfTI-1 Data Format by Neuroimaging Informatics Technology Initiative) and obtain the grayscale images of the brain with the skull. volBrain helps to create a mask of the brain which can be used to remove the skull from the original MRI grayscale images. These were then further segmented and morphologically cleaned to remove the background. A sample pair of images from a patient with WD and control is shown in Fig. 1. The WD segmented brain images had brighter regions (higher WMH) in the convoluted zones of the brain (as shown inside yellow dotted rectangles Fig. 1) compared to control images.

Fig. 1
figure 1

Control (left) and WD (right). Both images show the skull and background removed

3.2 Local architecture: deep CNN configurations

Our group had developed several CNN architectures covering a wide number of applications, namely, radiological imaging [38], stroke [19, 21, 39,40,41], liver [42], and cancer [43]. We have extended this to Wilson disease first time, and this is the first study of its kind which uses deep learning. The DCNN architecture used is shown in Fig. 2. It is composed of three convolution layers, each followed by a max-pooling layer, thus a total of six layers. A flattening layer that follows after these six layers converts the 2D signal to 1D. The final layer is a hidden dense layer consisting of 128 nodes. As usual, the final output is a softmax layer that has two outputs corresponding to WD or control. This design of lesser layers was chosen as the number of classes was only two, and the DCNN was able to work with the desired accuracy. Thus, the current configuration would need less storage space and inference time as compared to pre-trained CNN models. We adapted the ReLU function for the convolution and dense layers since that helps with fast convergence to the solution as compared to sigmoid or tanh activations functions [44]. Because the DCNN had augmentation implemented, we, therefore, consider several layered options corresponding to different DCNN configurations, shown in Table 1. It shows 5 types of DCNN combination consisting of different convolution layers, max-pool layers, and dense layers. Thus, for the adaption of all experiments, it is, therefore, necessary to undergo 3D optimization between the accuracy, CNN layers, and the folds of augmentations. The block diagram of the DL-based classification and characterization system is shown in Fig. 3. As seen in Fig. 3, the MRI scans are preprocessed and split into training and testing. Training images are used to train the deep learning model along with gold-standard labels, generating the training weights. These weights transformed the test patients to predict their class labels. Bispectrum, DL model’s mean feature strength, and histogram were used for the characterization process to yield mean feature strength (MFS) and Bispectrum (B) values.

Fig. 2
figure 2

A 3D view of DCNN used for training and testing of the WD/control dataset

Table 1 Five types of cDCNN models consisting of different CNN layers
Fig. 3
figure 3

Block diagram of DL-based classification and characterization system

The definition of conventional ReLU is as given as σ = max(0, x). Here, σ is the activation value and x is the input to ReLU function. This equation was modified to σ = (max(0, x)) ^ 1.00001. Note that the differential of this equation is 0 at point x = 0, whereas the conventional ReLU is not differentiable at the origin. Since loss minimization in DCNN used gradient descent process which needs differential of various variables, it is, therefore necessary to have ReLU made continuously differentiable at x = 0. The equation for loss is given in formula 13 in Appendix. In the improved deep CNN (iDCNN) for better performance, this change of activation function was implemented in for all the convolution layers and the dense layer.

3.2.1 Transfer learning

As part of the overall CADx system, we benchmark our DL systems against the “transfer learning–based Inception V3” [24] pre-trained CNN and four machine learning paradigms such as k-NN, DT, SVM, and RF. InceptionV3 is a 42-layered deep model consisting of 11 inception modules (each comprising of multiple convolution layers and max-pooling filters), followed by three fully connected layers and a softmax activation layer. It was originally designed for a 1000 class ImageNet dataset for the famous ImageNet Large-Scale Visual Recognition Competition (ILSVRC) and the model is customized to two class problems for this study and trained further after loading the pre-trained weights of the ImageNet dataset. Inception V3 was designed to reduce the overall number of parameters to reduce network size and inference time. The reduction in parameters is done with help of factorizing convolutions. For example, a 5 × 5 filter convolution can be done by two 3 × 3 filter convolutions. The parameters in this process reduce from 5 × 5 = 25 to 3 × 3+3 × 3 = 18. Thus, it brings a 28% reduction in the number of parameters. With a smaller number of parameters, the model will less overfit and thus also increases the accuracy.

3.2.2 Machine learning

Our group has been very active in machine learning (ML) in several tissue characterization and classification applications, namely, diabetes [45], plaque [46,47,48,49,50], thyroid cancer [51, 52], ovarian cancer [53, 54], prostate cancer [23], liver cancer [42, 55, 56], lung cancer [57], skin cancer [58], bladder cancer [59], heart [60], cardiovascular disease risk [61,62,63,64], coronary artery disease [65], stroke [50, 66, 67], arrhythmia [68], and gene expression characterization [69]. We adapted similar paradigm in our current setting for Wilson disease application. Different feature selection methods were used consisting of Haralick, Hu moments, and LBP feature extraction frameworks. Table 2 ML4 (consisting of random forest) shows the highest accuracy corresponding to selected feature combination FC3 that consisted of Haralick’s and Hu’s moments. A brief description of these features is as given below:

  • Haralick features: These are based on the texture of the image and generated using Gray Level Co-occurrence Matrix (GLCM) using one of energy, entropy, or homogeneity of these matrix element values.

  • Hu moments: These are features of the object in the image and generated using centralized moments.

  • LBP features: LBP (Local Binary Pattern) is also a powerful texture-based feature calculated by comparing a pixel with 8 neighboring pixels.

Table 2 Combination of feature types for four types of ML systems (bold cell indicates maximum accuracy obtained with FC3 feature combination and random forest ML technique)

The equations for energy, entropy, and homogeneity used in Haralick features are given in Eqs. 13.

$$ \mathrm{Energy}={\sum}_i\ {\sum}_j{P}_d{\left(i,j\right)}^2 $$
(1)
$$ \mathrm{Entropy}=-{\sum}_i\ {\sum}_j{P}_d\left(i,i\right)\ \log \left({P}_d\left(i,j\right)\right) $$
(2)
$$ \mathrm{Homogeneity}={\sum}_i\ {\sum}_j\frac{1}{1+{\left(i+j\right)}^2}\ {P}_d\left(i,j\right) $$
(3)

The equation for Hu moments is given by Eq. 4.

$$ {\mu}_{pq}=\sum \limits_x\ \sum \limits_y{\left(x-\overline{x}\right)}^p{\left(y-\overline{y}\right)}^qf\left(x,y\right) $$
(4)

where μpq are the centralized moments; x, y are pixel coordinates and f(x, y) are pixel intensities at these coordinates. Here p = 0, 1, 2, 3 and q = 0, 1, 2, 3.

The block diagram of the ML-based classification system is shown in Fig. 4. As seen in Fig. 4, the preprocessing block processes the acquired MRI scans to yield the segmented brain region. This was implemented using BrainSuite and volBrain software which gives a very clean mask used to segment grayscale images very clearly. The engineering features were extracted using a combination of Haralick, Hu moments, and LBP feature–based methods. ML-based methods (k-NN, DT, SVM, or RF) already trained on labeled segmented images are used as input to the prediction process for classification of the test MRI input. The final output consisted of a binary output class consisting of either WD or control class.

Fig. 4
figure 4

Block diagram of the ML-based classification system

3.3 Performance evaluation protocol

To evaluate the model performance of the DCNN systems, different tests were adopted such as (a) Wilson Disease Segregation Index (WDSI) to estimate the feature strength of different AI techniques between two classes; (b) diagnostics odds ratio (DOR) for DCNN and ML methods; (c) power analysis to find the optimum dataset size; (d) timing analysis of supercomputer vs. local computer; (e) for best operating point characteristics, optimization of the CNN model with a percentage of the training datasets; and (f) finally, the validation of DL models against the well-accepted and published biometric facial dataset.

3.3.1 Wilson Disease Segregation Index of WD against control

We compute the WDSI, which is an indicator for the class separation between the controls and Wilson disease, expressed as percentage, and is mathematically given by Eq. 5:

$$ \mathrm{WDSI}=\left(\frac{\mid {\mu}_{WD}-{\mu}_C\mid }{\mu_C}\right)\times 100 $$
(5)

where μC is the mean feature strength of control class and μWD is the mean feature strength of WD class.

4 Results

This section primarily demonstrates three optimization experiments on DCNN, and four comparative experiments between DCNN and ML systems. The first three experiments show 3D optimization of DCNN layers during the augmentation process (DCNN9*), the effect of training on DCNN performance, and the optimal sample size selection for DCNN to be generalized. The batch is focused on benchmarking of the DL system against four ML systems, AUCs of DL vs. ML systems, and the segregation index of WD vs. controls.

4.1 Three-dimensional optimization of DCNN layers during augmentation process (DCNN9*)

The objective of this protocol is to find out the best CNN layer and augmentation combination. Since there are 20 “DCNN + Augm” combinations (5 types of CNN layers and 4 types of augmentations) and each combination is a K10 protocol (10 combinations in each of the K10 protocols), then a total of 200 different runs (or jobs), we, therefore, take advantage of supercomputer power to run five types of DCNN designs over four kinds of augmentations. Since DCNN accuracy varies depending upon the number of hidden convolution layers, it is, therefore, vital to undergo the optimization run (see Table 1). The results of this can be seen in the 3D surface plot in Fig. 5 and 3D bar graphs in Fig. 14 in Appendix, and the corresponding values for cDCNN are shown in Table 3. As seen in the 3D surface plot, with an increase in the CNN layers (down the rows R1 to R5), the accuracy increases and then gradually falls. Similarly, with an increase in augmentation, the accuracy increases initially (from C1 to C3) and then falls. The best CNN layer-augmentation combination was DCNN9*-Augm4*. All the subsequent experiments were then conducted at this combination point of DCNN9*-Augm4* or a short form as “DL9A4”. The equation of accuracy is given as formula 8 in Appendix. The equation for standard deviation is given as formula 14 of Appendix.

Fig. 5
figure 5

Three-dimensional optimization for best DCNN layers. a cDCNN showing optimization point at cDCNN = 9 layers and b iDCNN showing optimization point at iDCNN = 9 layers

Table 3 Accuracy of different cDCNN layers vs. augmentation (bold cell indicates that maximum accuracy obtained with 4× fold augmentation using 9 layered cDCNN)

4.2 Effect of training on DCNN performance using “DL9A4”

K-fold cross-validation protocols were executed on DCNN9*-Augm4* combination dataset. For this different train and test split (K2, K3, K4, K5, K10, and TT), we used different combinations, as required.

Table 4 shows the effect of training data size (with increasing K-fold) on the three types DCNN systems. For convenience, we have added the transfer learning system (tDCNN) here as well. The comparisons of cDCNN and iDCNN results are given in Fig. 15 in Appendix and Fig. 6.

Table 4 Results on the “effect of training data size” using DCNN9*-Augm4* (DL9A4) combination
Fig. 6
figure 6

Comparison of cDCNN9* and iDCNN9* for different augmentation size

As seen, accuracy slowly rises from K2 to K10 and is best for TT (training is the same as testing) protocol, which was used for validation. Note the order of performance was iDCNN > cDCNN > tDCNN.

4.3 Optimal sample selection for generalization of the DL system

The DL9A4 was tested with a different percent of training data, and K10 accuracy was found for each data size as shown in Fig. 7. The curve shows accuracy increases until the point of inflection, which is 60% of the training dataset. This shows the data size has the capacity to generalize after 60% of the dataset.

Fig. 7
figure 7

Generalization of the cDCNN9* using K10 protocol

The comparison of cDCNN and iDCNN performance for different percent of training data is shown in Fig. 8.

Fig. 8
figure 8

Comparison of cDCNN9* and iDCNN9* for different percent of training data

4.4 Benchmarking of three DL systems against four ML systems

The benchmarking was conducted for DCNN9*-Augm4* (DL9A4) combination against the transfer learning–based Inception V3 pre-trained model (tDCNN) and four types of ML systems (k-NN, DT, SVM, and RF) using K10 protocol as discussed in Section 3. The comparative results of benchmarking can be seen in Table 5.

Table 5 Comparison of 7 AI systems for WD classification (in the increasing order of AUC) (bold cell indicates the maximum AUC value obtained with iDCNN)

The best performance was achieved for iDCNN9* using a modified ReLU function.

4.5 Receiver operating characteristics curves (DL vs. ML)

Receiver operating characteristics curve shows the relationship between the false-positive rate (FPR) and the true-positive rate (TPR). The AUC validates our hypothesis. The ROC curve for 4 ML classifiers and three DCNNs are given in Fig. 9, while the AUC values are shown in Table 5 (column C2). ROC curve is a plot between TPR (y-axis) and FPR (x-axis). The equation for TPR and FPR is given as formula 11 and 12 in Appendix.

Fig. 9
figure 9

ROC curves for 7 AI systems: 3 DCNN systems (cDCNN, iDCNN, tDCNN) and 4 ML (k-NN, DT, DVM, RF). The AUC values are shown from lowest to highest, where iDCNN does the best

4.6 Wilson Disease Segregation Index

The Wilson Disease Segregation Index (WDSI) was calculated using the mean feature strengths of DCNN and ML features. The large value of WDSI shows larger segregation between WD and controls which justifies the ability of the AI method as seen in Table 6. The order of the WDSI is iDCNN > cDCNN > tDCNN > ML.

Table 6 WDSI between DCNN and ML systems

5 WD characterization

Characterization [12, 19, 23, 70] is vital for validation of our hypothesis that WD has a higher WMH compared to controls and this accounts for the increase in feature strength in the layers of the hidden layers of the DCNN. We thus evaluate the FMS for the layers and see the value of the FMS between the WD against controls.

5.1 Hypothesis validation 1: mean feature map strength using DNN9*-Augm4* (DL9A4)

Feature map strength (FMS) is the mean of activation values over all images in a class. FMS for a trained DNN9*-Augm4* model at 8 hidden layers is shown in Fig. 10a.

Fig. 10
figure 10

a Comparison of FMS values for control (green color) and WD class (red color). b Comparison of Bispectrum (B) values for control (green color) and WD class (red color)

As seen from Fig. 10a, the FMS values are consistently higher for WD class of the output layers. The mean FMS values for control and WD class are 500.14 ± 46.09 and 529.68 ± 47.23, respectively, showing an increase of 5.77% (C.I. 4.5463 to 54.5426, p value < 0.0001). This supports our 1st hypothesis that WMH of WD class is higher than the controls.

5.2 Camel hunch phenomenon

The WMH could be better visualized by understanding the histogram distribution of the brain region. The histogram is computed by considering the bin size of 4, leading to 64 bins (256 values, divided by 4). This is repeated for both classes. As seen in Fig. 11c, d, the histograms show a camel hunch-like shape in WD class from 25th to 35th bin corresponding to intensity range 100–140. This phenomenon occurs due to regions of WMH [5,6,7] in WD MRI scans, mainly at the convoluted edge of the folds of the brain.

Fig. 11
figure 11

Pixels in range 100–140 marked red in a control and b WD. c and d show the comparison of the histogram for control and WD class. The camel hunch is seen in the WD class representing the WMH

5.3 Hypothesis validation 2: bispectrum strength computation

Bispectrum (B) falls in the category of higher order spectra (HOS) [23]. To calculate HOS, the Radon transform of images was calculated at various angles from 0 to 180° in an interval of 15°. Here, the Radon transform was applied to the images where pixels of MRI scans in range 100–140 are segregated (shown in red: Fig. 11a, b). On calculating mean B-values on all images of two classes, the B-value for WD is found consistently higher than control (see Fig. 10b), with mean of 20.87 for WD and 13.47 for control class showing a rise by 54.71% (C.I. 5.5490 to 9.2526, p value < 0.0001). Figure 12 shows the 2D representation of the Bispectrum strength for WD against the control, and Fig. 13 shows more Bispectrum strength for WD in its 3D plot. The equation for Bispectrum is given as \( Bispectrum\left(B\left(\mathrm{f}1,\mathrm{f}2\right)\right)=E\left[\mathcal{F}\left(\mathrm{f}1\right)\times \mathcal{F}\left(\mathrm{f}2\right)\times \mathcal{F}\left(\mathrm{f}1+\mathrm{f}2\right)\right] \), where, B is the Bispectrum value, \( \mathcal{F} \) is the Fourier transforms and E is the expectation operator. The region Ω of computation of bispectrum and bispectral features of a real signal is uniquely given by a triangle 0 < = f2 < = f1 < = f1 + f2 < = 1.

Fig. 12
figure 12

Bispectrum 2D plot of a control and b WD. (White arrows show higher strengths of the B-values in the WD class in 2D Bispectrum plots)

Fig. 13
figure 13

Bispectrum 3D plot of a control class and b WD class. (Black arrows show more B-value for the WD class indicated by higher peaks in 3D Bispectrum plots)

6 Performance evaluation

6.1 Diagnostics odds ratio

Diagnostic odds ratio (DOR) is used to discriminate subjects with a target disorder from subjects without it. DOR is calculated according to Eq. 6 [71]. DOR can take any value from 0 to infinity. A test with a more positive value means better test performance. A test with a value of 1 means it gives no information about the disease and with a value less than 1 means it is in the wrong direction and predicts opposite outcomes.

$$ \mathrm{DOR}=\frac{\mathrm{TP}/\mathrm{FN}}{\mathrm{FP}/\mathrm{TN}}=\frac{\mathrm{sens}/\left(1-\mathrm{sens}\right)}{\left(1-\mathrm{spec}\right)/\mathrm{spec}} $$
(6)

where TP, FP, TN, and FN represent true positive, false positive, true negative, false negative. Sens and spec stand for the sensitivity and specificity, respectively, and equation is given in Formula 9 and 10 of Appendix. The DOR values for all ML and DL methods are shown in Table 7.

Table 7 Sensitivity and specificity with increasing order of DOR for the 7 AI systems (the bold rows indicate the maximum values using cDCNN and iDCNN)

6.2 Power analysis

The sample size was calculated according to Eq. 7 [72] using the mean difference between K10 mean accuracy of DCNN9* and DCNN11 while keeping the augmentation 4× (see Table 3, cell number (C3, R3), and cell number (C3, R4)).

$$ \mathrm{Sample}\ \mathrm{Size}=\frac{2\times {\left({Z}_{\alpha }+{Z}_{1-\beta}\right)}^2\times {\sigma}^2}{\varDelta^2} $$
(7)

Here, the value of Zα = 3.2905 for type 1 error having a value of 1%, and Z1-β = 1.6449 for type II error having a value of 1%. Here, σ (standard deviation) = 2.53 and Δ (mean difference) = 0.627. Substituting these values in Eq. 2, the sample size returns a value of 793. This is the required sample size. The database we adapted has 1832 samples for WD or controls using 4× augmentations. Thus, our database is 2.31 times the required limit and we are above the limit by 1039 samples.

6.3 Timing analysis

The supercomputer was adapted during the training of DCNN for the optimal performance of the CADx system. We, therefore, calculated the gain as the ratio of time taken by local computer (LC) (which was HP Desktop 2010) to the time taken by the supercomputer (SC) (which was NVIDIA). The gain values are shown in Table 8.

Table 8 Timing analysis and gains in time for the supercomputer against the local computer

As seen, the time taken by a local computer using CPU is around 7–9 times more than that of a supercomputer. Thus, it will take 72 h or 3 days for a job to run on a local computer which will take only 8 h on a supercomputer. The C.I. and p value of the timings are 113.2824 to 215.3842, p value = 0.0052.

6.4 Reliability analysis

Reliability is calculated using the formula: \( \mathrm{Reliability}\ \left(\%\right)=\left(1-\frac{\mu_{\mathrm{N}}}{\sigma_{\mathrm{N}}}\right)\times 100 \), where μN and σN are the mean and standard deviation of the classification accuracy. The variation of reliability according to data size is given in Table 9. For the system to be reliable and stable, we must meet three criteria [70, 73]: (i) If the reliability of the DCNN > 95%, then the system is reliable; (ii) If the SD < 5%, then the DCNN is considered as stable; (iii) Furthermore, if the variation in accuracy is not more than 5%, then the system is considered stable. In our case, we meet all the above 3 criteria. For data size above 20% (row R3), reliability (column C3) is above 95%, SD < 5% (row R3, column C4), and variation in accuracy < 5% for rows R4 to rows R10. This concludes the compliance of our DCNN system to be stable and reliable.

Table 9 Reliability analysis for different percent of training data

7 Discussion

The objective of this study was to classify images from patients with WD against control in unbalanced and weak brain MRI training datasets. The system design consisted of 3D optimization of the best DCNN model under best-augmented conditions. Our optimization uses the best combination of DCNN9*-Augm4*. The design of iDCNN was comparable but superior to cDCNN. We also showed that iDCNN outperformed tDCNN by 11.92% and four types of “conventional machine learning–based systems” such as k-NN, decision tree, support vector machine, and random forest by 55.13%, 28.36%, 15.35%, and 14.11%, respectively. The performance evaluation of the DCNN system was evaluated using DOR and WDSI and all showed consistent results. We also showed the effect of training data on the system accuracy for this optimal point. The hypothesis was validated using two novel strategies for WD characterization using FMS and Bispectrum analysis.

7.1 Benchmarking

We benchmarked our DCNN systems against existing systems as shown in Table 10. As found from existing research, no classification work has been done in WD. Benchmarking table also shows a comparison between the previous studies and current proposed study. Overall, this was the first paper using state-of-the-art technology to optimize several AI methods in the diagnosis of WD. The benchmarking is done with recent papers referred by author and year and 1st column C1; it is followed by the type of brain diseases such as Alzheimer (ALZ), brain tumor (BT), and mild cognitive impairment (MCI) in column C2. In column C3, the techniques used are mentioned such as SVM, CNN, and DBM. Column C4 tells the imaging modality such as MRI and fMRI. In column C5, we have mentioned where authors have used ML- or DL-based AI technique. Columns C6 and C7 describe the accuracy and AUC (p value) comparison with these referred papers. The column C4 shows the neurological applications such as spatial MRI imaging or functional MRI imaging. Majority of the studies use ML or DL models. The accuracy of the systems has an average value of 87%, while our system had about 10% improvement compared to the average value. R10 and R11 showed the proposed cDCNN and iDCNN methods which had an accuracy of ~ 97.2% and ~ 98.3%, respectively. The AUC for our proposed methods were 0.98 (p < 0.0001) and 0.99 (p < 0.0001).

Table 10 Benchmarking of our proposed DCNN strategy against the previously published literature (the bold rows indicate that best accuracy/AUC values obtained using cDCNN and iDCNN)

7.2 A short note on WD characterization

Even though we are able to characterize the WD by computing the Bispectrum values and AI models, it is important to note that the Bispectrum values are computed in the pixel zone corresponding to the camel hunch region. This was valid for our datasets, but definitely need a wider set of clinical validations. Furthermore, the spatial slices considered were on an average of 12 per patient. To begin with the number of patients on control was four times the WD, which shows a slight imbalance. Thus, a strong pool of data is required to more validate this paradigm.

7.3 A short note on the role of the gold standard for the design of the Wilson disease system

The role of the gold standard is very crucial in the design of deep learning systems. They act like the binary event, “high risk and low risk,” or “benign cancer vs. malignant cancer,” or “cardiovascular event vs. no-cardiovascular event” or “cerebrovascular event vs no-cerebrovascular event.” Such binary events can be well-detected diagnostically if they are trained using the deep learning model. Recently, our group developed a method for classification of symptomatic risk likely to have stroke vs. asymptomatic patients not likely to have the stroke [21]. The deep learning solution was very successful by training the deep CNN model using the gold standard based design by the neurologist. Such a process is also called as characterization of the disease since the deep learning system is able to use the features of the disease to classify into binary events such as Wilson disease vs. normal. Examples of several kinds of well-defined characterization systems can be seen in the machine learning section. Even though typically, the characterization can be used for binary classification, but the multiclass scenarios can also be developed when it comes to characterization [61, 79]. It just provides several levels of risk rather than two types of risk.

7.4 Strength, weakness, and extensions

This is the first study of its kind that considered DCNN architecture for classification and characterization of the WD. The architecture was optimized by changing the number of layers of the DCNN architecture and augmentation protocol. The AI system showed high performance and the results were validated. The WD characterization was conducted using two different models, first using the AI framework, and second using signal processing framework using higher order spectra by computing the Bispectrum values. The system showed consistent results and the hypothesis was validated.

Even though the pilot study showed powerful results, one can automate the manual segmentation step by automated methods [9, 80, 81]. Furthermore, more ML alternative and more features can be used in future [59]. More validations need to be conducted in the future, such as cross-modality fusion using registration methods [82, 83]. More neurological model-based techniques can be designed [84, 85].

The system can be extended to transfer learning–based approaches to avoid the heavy supercomputer processing time during training, therefore using the pre-trained weights [79]. In spite of our successful pilot study showing a set of seven successful AI models, this can be taken as a launching pad for multicenter data collection for bigger trials. The scanners used for imaging also can play a role while acquiring the MRI data, just like other modalities [86].

8 Conclusion

This is the first study of its kind to use an advanced CADx system based on seven kinds of AI combinations to classify images from patients with WD vs. controls to achieve the best possible architecture of DCNN and to attain the best accuracy of 98.28 ± 1.55%. The three DCNN methods were compared with four ML methods showing the benefit of deep learning. The study also used the characterization of WD using two hypotheses showing feature map strength and Bispectrum strength of WD higher due to regions of WMH in MRI scans. A detailed performance evaluation was also implemented using diagnostics odds ratio, power analysis, supercomputer timing analysis, and generalization analysis of DCNN performance.