The authors would like to thank Naganawa and colleagues for their interest [1]. Their primary concern is that our algorithm detects the type of MR sequence instead of Meniere’s disease (MD) itself. This was hypothesized to be the result of a difference in the distribution of fast spin-echo-based and gradient-echo MR sequences. Indeed, extracted radiomic image features depend on the sequence type and acquisition parameters, and therefore, machine learning techniques are susceptible to such forms of bias [2, 3], especially when population inequalities are present (Fig. 1).

Fig. 1
figure 1

The distribution of the MR sequences per center in the test cohort for the Meniere’s disease and control group

Our pragmatical trial was retrospective, and our sampling was based on data availability and reflected clinical practice. As concluded in our article, prospective studies need to be done to fully verify our findings and to ensure that no covert bias explains the results. Other confounding factors then imaging parameters exist and could also be relevant, such as disease duration, the clinical setup to diagnose MD, or the choice for the control group. Such factors should be taken into account in the next clinical validation phase. Nevertheless, we aimed to prevent bias in our study design as much as possible. Amongst others, by gaining a large enough sample size and by sampling four centers that had a similar clinical setup in terms of diagnostic procedures for MD and asymmetric hearing loss. All images underwent pre-processing before features were extracted [1,2,3] to minimize the influence of heterogeneities in the multiparametric dataset.

A new post hoc analysis was performed to answer Naganawa’s et al. questions regarding the distribution and accuracy of MR sequences. In total, 55 (21.2%) gradient-echo sequences and 205 (78.8%) fast spin-echo sequences were included in our study. Gradient-echo sequences were only included in centers B and C. These consisted of 19.3% (n = 11) and 40.7% (n = 44) of the total for those centers. The proportion of gradient-echo sequences in the MD and control group were 22.5% (n = 27) and 20% (n = 28). The proportion of gradient-echo sequences in the train and test group were 21.9% (n = 42) and 19.1% (n = 13). The proportion of gradient-echo sequences did not differ between patients and controls, X(1, N = 260) = 0.115, p = 0.743, nor between the training and test group, X2(1, N = 260) = 0.093, p = 0.760. Within the training cohort, 74 (49%) fast spin-echo’s existed in the MD group and 76 (51%) in the control group. The distribution in the test cohort was somewhat unequal, with 19 (35%) fast spin-echo’s in the MD group and 36 (65%) in the control group. This difference did not reach statistical significance X(1, N = 66) = 0.003, p = 0.955.6

Most importantly, we investigated if the accuracy of the diagnoses is above chance level, for the two different types of MR sequence. In the training set (n = 192), the accuracy was 76% for fast spin-echo (prevalence MD 49%) and 60% for gradient-echo sequences (prevalence MD 52%). In the test set (n = 68), the accuracy was 84% for fast spin-echo (prevalence MD 35%) and 77% for gradient-echo sequences (prevalence MD 38%). An exact binomial test was employed to determine if the accuracy was statistically significantly higher than the prior probability (prevalence). In the training set, this was the case for the fast spin-echo (p value =  < 0.0001), but not for the gradient-echo sequence (p value = 0.206). In the test set, this again was the case for the fast spin-echo (p value = 0.002), but not for the gradient-echo sequence (p value = 0.208). This marked finding could indicate that perhaps gradient-echo MRI is less suitable for inner ear radiomic evaluation or requires more training and/or more samples.

In conclusion, sampling based on data availability did not seem to result in an unbalanced distribution for patient, control, train, and test cohort. The accuracy of the radiomics algorithm with only fast spin-echo MR is similar (84%), as was presented in the original manuscript (82%), and is well above chance level (p = 0.002). The MR sequence did matter, as the algorithm seemed to perform worse on gradient-echo MRI (Fig. 2) and was not significantly above chance level (p = 0.21).

Fig. 2
figure 2

The number of the model’s incorrect predicted labels for fast spin-echo and gradient-echo MRIs in the complete dataset (training and test)

Although our study setup is not suitable to fully exclude the possibility, it is unlikely that the proposed classification model distinguishes between imaging types instead of MD vs. control. The results of the cross-validation analysis of our study [1] with various train-test iterations also support this hypothesis. Our study did not assess the effect of pre-processing; however, it might have prevented a large effect of distributional shift introduced by multiparametric images [2, 4].

Prospective and controlled studies with predefined image acquisition protocols are needed to further validate and develop the classification model, allowing for more detailed factor analyses. Another important goal for future study would be, as noted by Naganawa et al., to compare radiomics results (on conventional MRI) in patients who also received delayed contrast-enhanced MR (hydrops) imaging, considered the gold standard in our days.