Introduction

Carbonic anhydrase isozyme XII, a newly identified member of the carbonic anhydrase gene family, has been linked to von Hippel-Lindau gene-mediated carcinogenesis (Kivelä et al. 2000). These enzymes participate in buffering the pH of intra- and extracellular spaces by catalyzing the reversible hydration of carbon dioxide and water (Wang et al. 2013). Carbonic anhydrase (CA XII) is the most strongly expressed gene in response to hypoxia in human cancer cells (carcinomas: colorectal, breast, lung, etc.) (Hilvo et al. 2008). CA enzymes promote biosynthetic reactions in the body, as well as the production or expenditure of carbon dioxide and bicarbonate according to the following reaction (Yorulmaz and Eroğlu 2020):

$${\mathrm{CO}}_2+\;{\mathrm H}_2\mathrm O\longleftrightarrow\mathrm{HCO}_3^-+\mathrm H^+$$

Acetazolamide, methazolamide, ethoxzolamide, dichlorphenamide, dorzolamide, and brinzolamide are drugs for treating glaucoma by inhibiting the cytosolic isoform CAII. CA XII is considered a target for the treatment of certain cancerous tumors (Jaakkola et al. 2001). Research indicates that loss of CA XII function should be considered in individuals without CFTR mutations who have CF-like features in the sweat glands and lungs (Lee et al. 2016). Recent research has focused on inhibitors of CA XII associated with tumors. Some attempts have been made to synthesize selective inhibitors for use as drugs. CA XII has been validated as a marker for many hypoxic tumors and their inhibition reduces the growth of primary tumors and metastases (Matysiak et al. 2017). Zinc-binding compounds (depending on their binding mode) are effective for drug design; they can be like CAI inhibitors (Mishra et al. 2020). Many of the recent QSAR studies have shown a good connection between the ligands and CA XII through the relationship between the activity of the compounds and the descriptors (Eroğlu 2019). More than 1250 molecular descriptors were calculated using the reliable programs. Multiple linear regression equations were developed and validated using the validation technique (Alafeefy et al. 2015). QSAR models were built to explore the correlations between the molecular descriptors calculated on 16 compounds and their experimental inhibitory activities on CA XII (Kumar and Roy 2020). Results found showed the characteristics of mercapto quinazolinone benzene sulfonamide derivatives against hCA XII (Gopinath and Kathiravan 2022). In our research, we will develop QSAR models to find a good correlation between the activity of pyrazoline molecular series and their descriptors derivative to inhibit CA XII.

Computational methods

Using the Chemdraw3D (Mendelsohn 2004) and Chemsketch software (Li et al. 2004), we were able to calculate the thermodynamic and topological descriptors, and using the Gaussian 09 (Gaussian_09_ReferenceManual.Pdf 2022) software, the quantum descriptors were calculated. The functional and the basis used in the quantum computation are, respectively, DFT (Rivero et al. 2015)/B3LYP (Zhang and Xu 2021) and 6-3G(d). Principal component analysis (PCA) (Josse et al. 2009) and multiple linear regression (MLR) (Zhou et al. 2017) were performed using ChemOffice 2015 and XLSTAT (Vidal et al. 2020). The relationship between pIC50 for 34 compounds and the descriptors was studied by the MLR statistical technique. We made the applicability domain for the chosen model using Matlab 2015 software. This enzyme is located outside the cell and has a very high catalytic activity and is also a multidomain protein that can be inhibited by pyrazoline sulfamate derivatives.

  • Dataset and generation of molecular descriptors

Data on the activities against carbonic anhydrase of 34 pyrazoline derivatives were collected from the literature (Moi et al. 2019). IC50 is the inhibitory activity factor of the bioassay, which shows the required concentration of an inhibitor to achieving 50% inhibition of carbonic anhydrase replication. Table 1 shows the compounds studied with their activity. The experimental IC50 activity values were converted to the negative logarithm of the IC50 activity (pIC50). 11 quantum chemistry descriptors were calculated using the findings of DFT(B3LYP/6-31G(d)) computations (Table S1). Another 34 descriptors were calculated using the ChemOffice 3D and Chemsketch programs (Table S2 and S3, respectively).

  • Principal component analysis (PCA)

Table 1 pIC50 of compounds

Descriptors with a low correlation coefficient (r ≤ 0.15) value concerning the dependent variable IC50 were excluded; hence, the interest of the principal component analysis (PCA). This allows us to select the input data for the multiple linear regression studies. Nevertheless, the descriptors with a correlation coefficient higher than 0.95 are taken into account to reduce the uncertainty present in our data matrix. Thus, thanks to PCA, we were able to select 20 descriptors from 40 to use in the development of MLR models.

  • Data splits and model development

We divided the dataset randomly (80% for the training set and 20% for the test set).

The multiple linear regression (MLR) method was used for training set regression. The transparency of linear regression analysis is a significant benefit; as a result, the algorithm is easily accessible and predictions can be made.

Model validation

The QSAR model’s predictive power and fit were assessed through internal and external validation measures. Coefficient of determination \({R}^{2}\), Fischer’s value (Ftest), mean square error of the model (MSE), variance inflation factor (VIF), leave-one-out cross-validation coefficient of determination \({Q}_{cv}^{2}\), external test coefficient of determination \({R}_{test}^{2}\), and Y-randomization parameters (\({R}_{Rand}^{2}\) and \({R}_{cv (Rand)}^{2}\)) are quality validation parameters. For a model to be valid, the proposed new molecules must be within the OECD range of applicability.

Results and discussion

Model development

The dataset for the molecules was divided into two (28 for the training set and 6 molecules for the test set).

With acceptable levels of statistical parameters used to assess for internal and external validation of QSAR models, the equations models shown in Table 1 with the typical interpretation of the statistical symbols are statistically sound and predictive. High \({R}^{2}\), \({R}_{adj}^{2}\), \({Q}_{cv}^{2}\), and \({R}_{test}^{2}\) values and low MSE values indicate that all of these models are statistically sound.

The cross-validation parameter \({Q}_{cv}^{2}\) has high values, indicating that the models are statistically robust. The R2 test indicates a strong capacity of the models to predict future outcomes beyond the observed data (Fig. 1).

The values of \({R}^{2}\) and \({R}_{adj}^{2}\) are nearly close according to the results displayed in the Table 2, which means that the build models found do not contain many descriptors. That is confirmed by low values of MSE. The external predictive capacity is high because the values of \({R}_{test}^{2}\) were superior of 0.5. To test the robustness of the studied models, we calculated the value of \({Q}_{cv}^{2}\), and as we see from Table 2, the values of \({Q}_{cv}^{2}\) were highly superior to 0.5.

Table 2 Training and test set statistical parameters and model equations
Fig. 1
figure 1

Observed vs predicted activities by model 3 (train samples in blue color and test samples in orange color)

Applicability domain (AD)

The evaluation of the applicability domains of the four models shows that model 3 gives good values in the Williams diagram.

To avoid the found model not predicting the activity of another molecule, which does not belong to the dataset, we must call on the applicability domain. This allows us to determine the molecule that is out of the model QSAR.

Molecule 32 shown in Fig. 2 has a value of hi higher than that of h* means that this molecule is chemically different compared to the whole studied. All other molecules have hi less than h* even the molecules in the test set.

Fig. 2
figure 2

Williams plot of standardized residual versus leverage for the MLR model (Train samples in black color and test samples in red color)

Y-randomization test for model X

To further evaluate the constructed model, the calculations were repeated one hundred times with randomized activities for the same training set. For this, we have used the QSAR-Tools server available online at http://teqip.jdvu.ac.in/QSAR_Tools.

\({{\varvec{Q}}}_{{\varvec{c}}{\varvec{v}}({\varvec{R}}{\varvec{a}}{\varvec{n}}{\varvec{d}})}^{2}\), \({{\varvec{R}}}_{{\varvec{R}}{\varvec{a}}{\varvec{n}}{\varvec{d}}}^{2}\), and \({{\varvec{c}}{\varvec{R}}}_{{\varvec{P}}}^{2}\) represent the average values; these values are respectively − 1.608, 0.185, and 0.702. We find that all these values are less than the original model. We can say that this model was not found by chance.

The VIF values for the five descriptors (EHOMO-1, Gap, Na, LogP, Ra) of the chosen model are respectively 2.057, 1.956, 2.485, 1.410, and 1.865. All of these values are less than 5 which means that this model is robust. The limit of Golbraikh and Tropsha is very important to check the reliability of the result found.

According to the result of Table 3, all of the parameters are better compared to the threshold.

Table 3 Model 3 statistical metrics compared to Golbraikh and Tropsha’s criteria

Design of new compound

The model found is a very good model, so subsequently we can predict new molecules with good inhibitory activity by playing on the descriptors of the model. The importance of each descriptor in the built model is remarkable according to the absolute value of the t-test, which is an important value. These values for the five descriptors (EHOMO-1, Gap, Na, LogP, Ra) are, respectively, − 2.628, − 1.298, − 3.640, − 3.758, and 5.735. The influence of the descriptors on the activity differs from one descriptor to another according to the contribution coefficient of each descriptor shown in the model. According to the equation of the model, the positive sign means that the descriptor influences the activity proportionally, but the negative sign means the opposite. To increase the IC50 inhibitory activity against carbonic anhydrase, we must choose compounds with weak electronic effects to decrease the values of Gap and EHOMO-1, as well as the number of atoms must be low. Nevertheless, the water solubility must be high for the LogP value to decrease. The suggested molecules must be large so that the radius Ra of the molecule is large. In the following paragraph, we will explain for each descriptor their influence on the activity of the compound by justifying the choice of the molecules proposed to have good inhibitory activity.

EHOMO-1

The HOMO-1 energy has a relation with the ionization energy and designates the molecule’s susceptibility to electrophilic attack. The variance of this descriptor is 24%; this value means that this descriptor has a remarkable influence on the IC50 activity. The HOMO-1 orbital can participle in the creation of a ligand–protein interaction bond. If this energy is high, it means that the HOMO-1 orbital loses the electron easily. We must therefore modify our compound to be nucleophilic by adding substituents that can decrease the nucleophilic character because this descriptor has a negative value in the equation of the model.

Gap

It is the energy difference between the energy of the HOMO and LUMO orbital. In addition, this descriptor has an important contribution to IC50 considering its variance, which is 18.27%. A good inhibitory activity IC50 means a low gap value because a negative sign appears in the model equation for this descriptor; then the new compounds suggested for inhibition must contain groups that help to increase the gap.

Na

This descriptor represents the number of atoms in the molecule; it has a very low variance value (0.18%), so the influence of this descriptor on the inhibitory activity of the molecule is very low.

LogP

This descriptor has a negative sign in the model equation, which means that decreasing the value of this descriptor will increase the inhibitory activity IC50. The LogP represents the solubility of the molecule in water; by decreasing the LogP, the molecule becomes more soluble in water. The variance (8.93%) for this descriptor is modest, so the impact of this descriptor on the IC50 activity is small. We can then slightly modify the solubility of the molecule in water by adding substituents that facilitate the solubility of the compound in water.

Ra

It is the radius of the molecule; a positive sign in the model means that this descriptor varies proportionally to the IC50 activity. We must therefore add bulky substituents to increase the radius of the molecule. The variance (32.88%) of this descriptor is very large and its influence on the IC50 activity is remarkable. These results show the decrease in molecule size and replacement of pyrazoline derivative with stronger electron-accepting groups (such as -NH2 and CXn). All the results obtained by the MLR model (3) are reliably indicating the performance of the model found, so we can subsequently design new compounds with better activity values compared to the studied compounds.

We made the relevant substitutions in light of the above results and estimated activities using the suggested model equation.

$$\mathbf p\mathbf K\mathbf i=\;6,786-{0,613\mathbf E}_{\mathbf H\mathbf O\mathbf M\mathbf O-1}-0,101\mathbf G\mathbf a\mathbf p-0,132\mathbf N\mathbf a-0,354\mathbf L\mathbf o\mathbf g\mathbf P+0,557\mathbf R\mathbf a$$

Therefore, the suggested approach will accelerate the process of synthesis of pyrazoline derivatives and the determination of their anti-carbonic anhydrase activity.

In the next part, we modified some structures (1, 28, and 33) with high piC50 values and recalculated the theoretical pIC50 values by the chosen model as well as the value of h. The theoretical pIC50 values for the new proposed structures and their h are given in the following Table 4.

Table 4 Calculated anti-carbon anhydrase activity pIC50 and leverage effects (h) for the newly designed pyrazoline derivative

According to Table 4, we found very good results for the suggested structures because the majority of the theoretical pIC50 values for these structures are higher than those found experimentally as well as the h values are all lower than h.

Taking structure number 1, all the values found are higher than the pIC50 values found experimentally except molecule 1c; for example, molecule 1a has a very good pIC50 value (9.17). Moreover, according to molecule number 28, all the values found are close to or higher than the experimental values. For structure, 33 the results found are good; taking structure 33b, the value of pIC50 found for this structure is 8.06, so it is a very good value because it is higher than the experimental value.

Conclusion

To interpret the relationship between influenza virus inhibitor activity for 34 pyrazoline derivatives acting as anti-carbonic anhydrase and their structural descriptors obtained by the density functional theory calculation with Becke’s three-parameter hybrid method and the Lee–Yang–Parr B3LYP functional employing 6-31G (d) basis set, the multi MLR approaches were used as a linear feature QSAR method. The model found in this work is very reliable because all the validation values are good, and from the descriptors of the model, we could suggest some molecules to be synthesized to be anti-carbonic anhydrase.