Introduction

Benzenoid hydrocarbons are ubiquitous and are found in all classes of natural products, pharmaceuticals, and materials. Benzenoid hydrocarbons containing two or more fused benzene rings are classes of organic pollutants that are produced during the incomplete burning of coal, oil, gas, wood, garbage or other organic substances that resulting from human activities. These compounds are widely found in the environment and foods such as vegetables [14]. Since, some common benzenoid hydrocarbons have been known to be potent carcinogens, this contaminant class is generally regarded as having high priority for environmental pollution regulation and in ecological risk assessment of industrial effluent discharges. In relation to water, most hydrophobic benzenoids will typically absorb strongly to particles and become generally more resistant to bacterial degradation [5]. The most concerning matter on benzenoid hydrocarbons is that they have shown to be highly carcinogenic in laboratory animals, and it’s also involved in different types of human cancers, mainly breast, lung, and colon cancer. The metabolic activations of these compounds in mammalian cell to dioepoxides cause errors in DNA replication and mutation, which initiates the carcinogenic process [6]. Some benzenoids have chemical stability and spermatogenetic and mutagenic effects [3]. One of the most successful approaches to the prediction of physico-chemical properties and biological activities of compounds is quantitative structure–property/activity relationships modeling (QSPRs/QSARs) [7]. QSPRs/QSARs are mathematical models that attempt to correlate the molecular structure of compounds to their biological, chemical, and physical properties. The main steps comprising this method are: data collection, molecular geometry optimization, molecular descriptors generation, descriptor selection, model development and finally model performance evaluation [8, 9]. The main problems encountered in this kind of research are still the description of the molecular structure using appropriate molecular descriptors and selection of suitable modeling methods [10]. Currently, many types of molecular descriptors such as topological indices and quantum chemical parameters have been proposed to describe the structural feature of molecules [1113]. Ferreira modeled seven properties and two biological activities of polycyclic aromatic hydrocarbons (PAHs) by using partial least squares (PLS) regression and electronic descriptors such as energy of HOMO and LUMO orbitals, topological descriptors and geometric descriptors such as molecular volume and surface area and heat of formation [14]. Lu et al. have predicted photolysis half-lives of polycyclic aromatic hydrocarbons using PLS and computed quantum chemical descriptors from Gaussian03 [15]. In another research, Nikolić created linear relationship between polarographic half-wave reduction potentials of benzenoid hydrocarbons and electron affinities and E LUMO as descriptors [16]. In addition, aqueous solubility of PAHs have been predicted using GA-SVM and GAs-RBFNs and molecular connectivity indices [3].

It has long been recognized that non-covalent interactions are predominantly electrostatic in nature. Politzer et al. have shown that a variety of condensed phase macroscopic properties that depend on non-covalent interactions can be expressed analytically in terms of statistically defined quantities that characterize molecular surface electrostatic potentials and average local ionization energy and named their model, general interaction properties function (GIPF) [17, 18]. Prediction properties such as pKa that includes charge transfer need molecular surface average local ionization energy descriptors [19]. Molecular volumes consider polarizability effects because there exists a linear relationship between them [20]. Some properties depend on interactions that are non-covalent such as solubility in water, boiling point and so on; therefore, these descriptors can be used for predicting these properties. In this research, we attempted to develop QSPR/QSAR models for predicting the physico-chemical properties and biological activities of benzenoid hydrocarbons using MLR and GIPF family descriptors.

Materials and methods

Data for benzenoids

All experimentally determined physico-chemical properties and biological activities of benzenoid hydrocarbons including the boiling point at normal pressure (BP, in °C), retention index (RI) refers to the reversed-phase liquid chromatography on polymer, aqueous solubility (log S w,l), lipophilicity or n-octanol/water partition coefficient (log K OW), n-octanol/air partition coefficient (log K OA), soil sorption (log K OC), Henry’s law constant (log H), bioconcentration factor (log BFC), photo-induced toxicity (log 1/LT50), polarographic half-wave reduction potentials (E 1/2), heat (enthalpy) of formation (\( \Delta H_{\text{f}} ) \), photolysis half-lives (log t 1/2) and the molecular resonance energy (RE) were taken from the literature [1416, 2126]. The experimental values of these properties and activities are listed in Table S1 of the supplementary information.

Computer hardware and software

All calculations were performed on a 2.5 GHz Intel® CoreTM2 Quad Q 8300 CPU with 2 GB of RAM using all four available cores under Windows XP operating system. The ISIS/Draw version 2.3 software was used for drawing the molecular structures [27]. Molecular modeling and geometry optimization were employed by HyperChem (version 7.1, HyperCube, Inc.) [28]. Gaussian98 program [29] was used to re-optimize the molecular structure. SPSS software (version 16.0, SPSS, Inc.) http://www.spss.com/ was used for elimination stepwise selection MLR analysis and other computations were performed in the MATLAB (version 7.0, Math Works, Inc.) environment.

Molecular descriptors generation and calculation

First we created and optimized 48 benzenoid hydrocarbon molecules in HyperChem 7.1 using AM1 method. Then re-optimization were implemented in Gaussian98 software at B3LYP/STO-3G level. Next, these optimized geometries were used to compute the electrostatic potential \( V(r) \) on the molecular surfaces that is defined by the 0.001 au contour of the electron density \( \rho (r) \). Molecular surface electrostatic potentials were computed at B3LYP/6-31G* by Gaussian98 software. The grid control option was set to ‘‘cube = 100’’. Thus, for each molecule, Molecular surface electrostatic potentials were computed at approximately 1003 points. Then we used the WFA (wave function analysis) statistical analysis program to compute molecular surface electrostatic potential and average local ionization energy descriptor using the produced CUBE file with Gaussian98 [2932]. The electrostatic potential \( V_{\text{s}} \left( r \right) \) in the space around a molecule that is created by its nuclei and electrons is defined by Eq. (1):

$$ V_{\text{s}} \left( r \right) = \mathop \sum \limits_{A} \frac{{Z_{A} }}{{\left| {R_{a} - r} \right|}} - \mathop \int \nolimits \frac{{\rho \left( {r^{\prime}} \right){\text{d}}r^{\prime}}}{{\left| {r^{\prime} - r} \right|}} $$
(1)

where \( Z_{A} \) is the charge on nucleus A, located at \( R_{A} \). The first term on the right side of Eq. (1) is the nuclear contribution to \( V(r) \) which is positive, the second term is due to the electrons and is negative [33, 34]. The average local ionization energy, \( \overline{I} (r) \), is defined by Eq. (2):

$$ \overline{I} (r) = \frac{{\mathop \sum \nolimits_{i} \rho_{i } (r)\left| {\varepsilon_{i} } \right|}}{\rho (r)} $$
(2)

\( \rho_{i} (r) \) is the electronic density of the molecular orbital at the point \( r \), \( \varepsilon_{i} \) is its orbital energy and \( \rho (r) \) is the electronic density function. We interpret \( \overline{I} (r) \) as the required energy, on average, to remove an electron from a point \( r \) in the space of an atom or a molecule [31, 35]. \( V_{\text{s}} \left( r \right) \) is effective for non-covalent interactions, which are largely electrostatic in nature, while \( \overline{I}_{\text{s}} (r) \) is more suitable when there is transfer of charge (electron pair donor–electron pair acceptor interaction) that is one of the forces responsible for separation of compounds in chromatography [31, 33, 36]. It might seem that \( V_{\text{s}} \left( r \right) \) could also predict sites for electrophilic and nucleophilic bond-forming attack, by means of its most negative and positive regions. However \( V_{\text{s}} \left( r \right) \) is not consistently reliable in this respect, because the regions of most negative \( V_{\text{s}} \left( r \right) \) do not always correspond to the sites where the most reactive electrons are located. For example, the most negative \( V_{\text{s}} \left( r \right) \) in benzene derivatives such as aniline, phenol, fluoro- and chlorobenzene, and nitrobenzene are associated with the substituents, whereas electrophilic reaction occurs on the rings. In contrast, \( \overline{I}_{\text{s}} (r) \) correctly predicts the ortho/para- or meta directing effects of the substituents, as well as their activation or deactivation of the ring [31].

Politzer et al. developed an approach which can be summarized as Eq. (3) and named it, general interaction properties function (GIPF) [17, 31]:

$$ {\text{Property}} = f\left( {V_{\text{mv}} , A_{\text{s}}^{\text{tot}} , A_{\text{s}}^{ + } , A_{\text{s}}^{ - } , V_{\text{s,max}} , V_{\text{s,min}} , \overline{V}_{\text{s}} ,\overline{V}_{\text{s}}^{ + } , \,\overline{V}_{\text{s}}^{ - } , \pi^{\text{tot}} , \delta_{\text{tot}}^{2} , \delta_{ + }^{2} , \delta_{ - }^{2} , \vartheta ,\overline{I}_{\text{s,max}} ,\overline{I}_{\text{s,min}} ,\;\overline{\overline{I}}_{s} ,\delta_{{\overline{I}_{\text{s}} }}^{2} ,\pi_{{\overline{I}_{\text{s}} }} } \right) $$
(3)

In this equation, \( V_{\text{mv}} \) is the molecular volume and \( A_{\text{s}}^{\text{tot}} \), \( A_{\text{s}}^{ + } \), \( A_{\text{s}}^{ - } \) are total surface area and the surface area over which \( V_{\text{s}} \left( r \right) \) is positive and negative, respectively. \( V_{{{\text{s,max}} }} ,V_{\text{s,min}} \), are maxima and minima of electrostatic potential on the molecular surface and \( \overline{V}_{\text{s}} \), \( \overline{V}_{\text{s}}^{ + } \) and \( \overline{V}_{\text{s}}^{ - } \) are respectively, the overall average potentials and the average of positive and negative potentials are computed as:

$$ \overline{V}_{s} = \frac{1}{t}\mathop \sum \limits_{i = 1}^{t} V_{s} \left( {r_{i} } \right),\overline{V}_{s}^{ + } = \frac{1}{m}\mathop \sum \limits_{j = 1}^{m} V_{s}^{ + } \left( {r_{j} } \right),\overline{V}_{s}^{ - } = \frac{1}{n}\mathop \sum \limits_{k = 1}^{n} \bar{V}_{s}^{ - } \left( {r_{k} } \right) $$
(4)

where \( \pi^{tot} \) is the average deviation of overall potentials and is computed as:

$$ \pi = \frac{1}{t}\mathop \sum \limits_{i = 1}^{t} \left| {V_{s} \left( {r_{i} } \right) - \overline{V}_{s} } \right| $$
(5)

π is interpreted as an indicator of internal charge separation, which is present even in molecules having zero dipole moment due to symmetry, e.g. para-dinitrobenzene and boron trifluoride. \( \delta_{\text{tot}}^{2} ,\delta_{ + }^{2} ,\delta_{ - }^{2} \) are respectively total, positive and negative variances and are computed as:

$$ \delta_{tot}^{2} = \delta_{ + }^{2} + \delta_{ - }^{2} = \frac{1}{m}\mathop \sum \limits_{j = 1}^{m} \left[ {V_{s}^{ + } \left( {r_{j} } \right) - \overline{V}_{s}^{ + } } \right]^{2} + \frac{1}{n}\mathop \sum \limits_{k = 1}^{n} \left[ {V_{s}^{ - } \left( {r_{k} } \right) - \overline{V}_{s}^{ - } } \right]^{2} $$
(6)

where \( \nu \) is electrostatic balance parameter and is computed as [31, 33]:

$$ \nu = \frac{{\delta_{ + }^{2} \delta_{ - }^{2} }}{{\left[ {\delta_{ + }^{2} + \delta_{ - }^{2} } \right]^{2} }} $$
(7)

In the summations above, t is the total number of points on the surface grid, m and n are the numbers of points at which \( V(r) \) is positive and negative, respectively. The features of \( \overline{I} (r) \) could be characterized analogously to those of \( V\left( r \right) \), that are extrema \( \overline{I}_{\text{s,max}} ,\overline{I}_{\text{s,min}} \), its average magnitude \( \overline{\overline{I}}_{\text{s}} \), average deviation \( (\pi_{{\overline{I}_{\text{s}} }} ) \), and variance \( (\delta_{{_{{\overline{I}_{\text{s}} }} }}^{2} ) \)—keeping in mind that \( \overline{I} (r) \) only takes positive values [7, 31].

Descriptors selection

The GIPF family descriptors consisted of 14 surface electrostatic potential, five average local ionization energies, Bader molecular volumes were calculated which have been listed in Table S2. Then 16 combinations of descriptors were calculated from original GIPF descriptors that have been listed in Table S3. Therefore, we calculated 35 descriptors that were used to build QSAR/QSPR models. In order to minimize the information overlapping in descriptors and to reduce the number of descriptors required in regression equation, the concept of non-redundant descriptors (NRD) were used [37]. That is, when two descriptors are correlated by a linear correlation coefficient value greater than 0.85, both descriptors are correlated with a dependent variables, the better correlation is used for the actual analysis, discarding the descriptors with a lower correlation. This objective-based feature selection left reduced and predictive descriptors for the studied compounds. Using these criteria for each physico-chemical property or biological activity, z descriptors out of 35 original descriptors were eliminated and 35-z descriptors remained. In GIPF approach, properties/activities of molecules are related to a few number of descriptors; therefore a variable reduction technique is needed. In this study, the most important variables are selected by elimination stepwise selection procedure, which combines the forward selection and backward elimination approaches. Initially, we consider the descriptive variable, which has the highest correlation with the response. If the inclusion of this variable results in a significant improvement of the regression model, it is retained and the selection continues. In the next step, the variable that gives the most significant decrease of the regression sum of squares is added to the model. After each forward selection step a backward elimination step is performed. In this step, a partial F test for the variables, already presented in the equation, is carried out. If a variable does not contribute significantly in the building of the regression model, then it will be removed. The procedure stops at the condition that no variables fulfill the requirements anymore to be removed or entered. After this selection procedure, classical MLR can be applied on the retained variables to build a predictive model [7, 38, 39].

Multiple linear regression (MLR)

An MLR model assumes that there is a linear relationship between the molecular descriptors of a compound, which is usually expressed as a feature vector X (where each entry indicates a descriptor), and its target property, y. An MLR model can be described using the following equation:

$$ y = \beta_{0} + \beta_{1} X_{i1} + \beta_{2} X_{i2} + \beta_{3} X_{i3} + \cdots + \beta_{k} X_{ik} + \varepsilon_{i} ,i = 1,2, \ldots ,n $$
(8)

where {X i1 ,…,X ik } are molecular descriptors, β 0 is the regression model constant, β 1β k are the coefficients corresponding to the descriptors X i1 to X ik and y is dependent variable [39]. The values for β 0 β k are chosen by minimizing the sum of squared vertical distances of the points from the hyper plane so as to give the best prediction of y from X. The molecular descriptors should be mathematically independent (orthogonal) to each other and the number of compounds in the training set should exceed the number of molecular descriptors by at least a factor of 4 [38, 40]. In this research, statistical parameters including R 2, squared correlation coefficient, \( R_{\text{adj}}^{2} \), adjusted squared correlation coefficient, RMSE, root mean squared error; REP, relative error prediction and F, F test (Fischer’s value) were calculated for each model:

$$ R^{2} = 1 - \frac{\text{SSE}}{\text{SST}} $$
(9)

where SSE and SST are the sum of squared errors and the total sum of squares, respectively; and calculated as:

$$ {\text{SSE}} = \mathop \sum \limits_{i = 1}^{n} (y_{i} - \widehat{y}_{i} )^{2} = \mathop \sum \limits_{i = 1}^{n} (y_{i} - \widehat{\beta }_{0} - \widehat{\beta }_{1} x_{i1} - \cdots - \widehat{\beta }_{k} x_{ik} )^{2} $$
(10)
$$ {\text{SST}} = \mathop \sum \limits_{i = 1}^{n} (y_{i} - \overline{y}_{i} )^{2} $$
(11)

and other parameters were calculated as:

$$ R_{\text{adj}}^{2} = 1 - \left( {1 - R^{2} } \right)\frac{{\left( {n - 1} \right)}}{{\left( {n - k - 1} \right)}} $$
(12)
$$ {\text{RMSE}} = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} (y_{i} - \widehat{y}_{i} )^{2} } $$
(13)
$$ {\text{REP}}\,\left( \% \right) = \frac{100}{{\widehat{y}_{i} }}\sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \widehat{y}_{i} } \right)^{2} } $$
(14)
$$ F = \frac{{(n - k - 1)R^{2} }}{{k(1 - R^{2} )}} $$
(15)

In the mentioned equations above, n is the number of compounds, k is the number of variables, \( y_{i} \) is the experimental property/activity, \( \widehat{y}_{i} \) is the average of experimental property/activity and \( \widehat{y}_{i} \) is the calculated property/activity from QSPR/QSAR model [41, 42]. These parameters have been listed in Table 1 for all models.

Table 1 Best models for the studied properties/activities and statistical parameters of benzenoids

Model prediction-validation

Model validation is a critical component of QSPR/QSAR development. A number of procedures has been established to determine the quality of models. In this research, a leave-one-out cross-validation (LOO-CV) and Y-randomization test are used to validate the predictive ability and check the statistical significance of the developed models.

Leave-one-out cross-validation (LOO-CV)

In LOO-CV, each time a molecule was removed and variable selection was performed for remaining molecules and a new model was built. This new model was used to predict the property (or activity) of removed molecules and this process was repeated for n times that n is the number of molecules that were used in the model building. Finally, the cross-validated squared correlation coefficient, R 2CV and root mean square error in cross-validation, RMSECV, were for each model calculated as:

$$ R_{\text{CV}}^{2} = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {y_{\text{pred,i}} - y_{\text{obs,i}} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{N} \left( {y_{\text{obs,i}} - \overline{y}_{\text{obs}} } \right)^{2} }} = 1 - \frac{PRESS}{{\mathop \sum \nolimits_{i = 1}^{N} \left( {y_{\text{obs,i}} - \overline{y}_{\text{obs}} } \right)^{2} }} $$
(16)
$$ {\text{RMSE}}_{\text{CV}} = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{N} (y_{\text{pred,i}} - y_{\text{obs,i}} )^{2} }}{N}} $$
(17)

where n is the number of training patters, \( y_{\text{obs,i}} \) and \( y_{\text{pred,i}} \) are the experimental, and predicted property/activity of the left-out benzenoid hydrocarbon i, respectively and \( \overline{y}_{Iobs} \) is the average of experimental property/activity of molecules [4346].

Y-randomization test

The Y-randomization of response is another important validation approach that is widely used to establish model robustness. In this test, dependent variable is reordered randomly and a new model is built. The procedure was repeated 100 times and the best model that has the maximum R 2 (\( R_{max}^{2} \)) and maximum cross-validated R 2 (\( R_{cv,max}^{2} \)) was selected. Small values of \( R_{max}^{2} \) and \( R_{cv,max}^{2} \) demonstrate that QSAR/QSPR model has not been obtained by chance [4750]. These parameters have been listed in Table 1 for all models.

Results and discussion

The best models were obtained by elimination stepwise selection regression algorithm and the statistical parameters for the models and their cross-validation tests were summarized in Table 1. It is interesting to note that for these data sets the combination descriptors \( {\text{A}}_{\text{s}}^{ - } \overline{V}_{s}^{ + } \) and \( A_{s}^{ + } \overline{V}_{s}^{ + } \) (as obviously, is demonstrated that positive and negative electrostatic potential regions of benzenoid hydrocarbons interact with each other or with solvent molecules) gives superior prediction power in the QSPR/QSAR models for several studied properties/activities. Also, among the thirteen obtained models, ten of them are mono-parametric and the rest are bi-parametric models. Finally, we predicted the values of all properties/activities for all benzenoids by creating the best models (selected in this paper) which have been listed in Table S4.

Boiling point (BP, °C)

The observed data of 23 benzenoids that have been listed in Table S1 were used to construct the QSPR model. After descriptors selection step, the following equation with a combinatorial descriptor was built:

$$ {\text{BP}} = - 61.104\left( { \pm 14.197} \right) + 0.433\left( { \pm 0.011} \right){\text{A}}_{\text{s}}^{ - } \overline{V}_{s}^{ + } $$
(18)

According to this equation, if a benzenoid has more points with negative electrostatic potential and also more positive average potential in its surface so has more electrostatic attraction between its molecules and its boiling point increase. For benzenoids with more than two rings, balance parameter value (\( \upsilon \)) is near to its maximum value that is 0.25. This means that benzenoids can interact up to a similar extent (whether strongly or weakly) through its both positive and negative electrostatic potential regions [51]. Although for benzenoids \( A_{s}^{ + } \cong A_{s}^{ - } \), but positive electrostatic points on molecular surface do not centralize. Thus a region with negative and positive electrostatic potential points cannot coincide as can be seen in Figs. 1 and 2, so in the model \( \overline{V}_{s}^{ + } \) has been selected rather than \( A_{s}^{ + } \) is chosen. The resulting plot for the mono-parametric model is shown in supplementary information Fig. SA.

Fig. 1
figure 1

a Calculated B3LYP/STO-3G ionization energy on molecular surface of benzene. Ionization ranges in eV/mol: red more than 12.4640, yellow between 12.4640 and 11.2937, green between 11.2937 and 10.1234, blue smaller than 10.1234, b Calculated B3LYP/6-31G* electrostatic potential molecular surface of benzene. Electrostatic potential ranges in Kcal/mol: red more than 4.6413, yellow between 4.6413 and −2.9282, green between −2.9282 and −10.4977, blue more negative than −10.4977

Fig. 2
figure 2

a Calculated B3LYP/STO-3G ionization energy on molecular surface of dibenz[a,n]triphenylene. Ionization ranges in eV/mol: red more than 12.5615, yellow between 12.5615 and 11.2963, green between 11.2963 and 10.0310, blue smaller than 10.0310. b Calculated B3LYP/6-31G* electrostatic potential molecular surface of dibenz[a,n]triphenylene. Electrostatic potential ranges in Kcal/mol: red more than 10.8232, yellow between 10.8232 and 1.9934, green between 1.9934 and −6.8364, blue more negative than −6.8364

Retention index (RI)

For 33 benzenoids, retention index data values were available that have been listed in Table 1 and after descriptors selection steps, the following model was created:

$$ \begin{aligned} RI =\, & 77.769\left( { \pm 14.438} \right) + 0.004323\left( { \pm 0.000216} \right)A_{s}^{ - } \bar{V}_{s}^{ + } \\ &- 5.7459\left( { \pm 1.0638} \right)\bar{I}_{{s,max}} \\ \end{aligned} $$
(19)

These descriptors have no collinearity (R 2 = 0.3972). \( A_{\text{s}}^{ + } \overline{V}_{\text{s}}^{ + } \) demonstrate that positive and negative electrostatic potential regions of benzenoids and stationary phase attract each other and this attraction is responsible for separation of benzenoids. Negative coefficient of another descriptor shows oppositional effect of this descriptor in the separation mechanism. The R 2adj for the model changed from 0.882 to 0.938 when \( \overline{I}_{\text{s,max}} \) was added to model. For dibenzo[a,h]pyrene molecule residual is more than twice the standard deviation of residual of retention index so this molecule is detected as an outlier. For the retention index of benzenoids, a good compatibility for the biparametric regressions is observed in supplementary information Fig. SB and also the resulted model suggests a mechanism for separation of benzenoids.

Water solubility (log S w,l)

For the water solubility of benzenoids (log S w,l range −3.85 to 3.28), the few available data allow only a moderate agreement between experimental and calculated values, the best-obtained model is:

$$ { \log }S_{\text{w,l}} = 5.5908\left( { \pm 0.56275} \right) - 0.00664\left( { \pm 0.000499} \right) A_{\text{s}}^{ - } \overline{V}_{\text{s}}^{ + } $$
(20)

In this equation, negative coefficient for descriptors is due to the repulsion forces between regions of water and benzenoids molecules that have electrostatic potential with the same sign. In water and benzenoids, regions with negative electrostatic potential exist on oxygen atom and benzene rings (see Figs. 1, 2, 3) that cause repulsion. In addition, positive electrostatic potential on hydrogen’s atoms create repulsion forces. Since positive electrostatic points on molecular surface of benzenoids have not been centralized (Figs. 1, 2) so in the model \( \overline{V}_{\text{s}}^{ + } \) have been selected instead of \( A_{\text{s}}^{ + } \). For benzenoids, charge separation is low rather than water (see Table S1; Fig. 3), so charge centers are not separated and repulsion forces between them overcome their attraction forces. For Eq. (20) naphthacene is outlier and when this molecule is removed; R 2 increases to 0.952. Fig. SC of supplementary information presents the resulting mono-parametric plot for log S w,l of benzenoids. For the water solubility of benzenoids rather than boiling point, a few available data result in weaker agreement between experimental and calculated values, as can be seen in Fig. SC.

Fig. 3
figure 3

Calculated B3LYP/STO-3G electrostatic potential molecular surface of water. Electrostatic potential ranges in Kcal/mol: red more than 22.3931, yellow between 22.3931, and 2.1572, green between 2.1572 and −18.0787, blue more negative than −18.0787

n-Octanol/water partition coefficient (log K OW)

The data set for lipophilicity or n-octanol/water partition coefficient (log K OW range 2.23–7.19) were included 18 benzenoids which were modeled by mono-parametric equation:

$$ { \log }\;K_{\text{OW}} = 0.27607\left( { \pm 0.1475} \right) + 0.004805\left( { \pm 0.000126} \right) A_{\text{s}}^{ - } \overline{V}_{\text{s}}^{ + } $$
(21)

Graph of electrostatic potential on surface of n-octanol shows that alkyl section has positive electrostatic potential between 20.9836 and 1.8806 kcal/mol that establish attraction force with negative region of benzenoids (Figs. 1, 2, 4). On the other hand, similar interactions between hydrogen atoms in benzenoids and oxygen atom in n-octanol exist. Selection of \( A_{\text{s}}^{ - } \) demonstrates that there are many points with positive electrostatic potential in the surface of receptor (here n-octanol) that interact with negative points of benzenoids (see Fig. 4). This equation shows interactions between many points with positive and negative electrostatic potential dissolves benzenoids in n-octanol. By removing tryphenylene and dibenz[a,h]anthracene that are outlier, R 2 increases from 0.990 to 0.996. As seen from supplementary information Fig. SD, there exist very good agreements between experimental and calculated values.

Fig. 4
figure 4

Calculated B3LYP/STO-3G electrostatic potential molecular surface of octanol. Electrostatic potential ranges in Kcal/mol: red more than 20.9836, yellow between 20.9836 and 1.8806, green between 1.8806 and −17.2223, blue more negative than −17.2223

Correlation between log K OW and log S w,l

As can be seen from comparing Eqs. (20) and (21) for predicting of log S w,l and log K OW, the combination selected descriptor is common to take opposite sign. This means that the lipophilicity and solubility of benzenoids act against each other [52]. According to Table S1 there are thirteen compounds in common between these two properties and the general relationship between log S w,l and log K OW as variables were analyzed using regression analysis as following:

$$ { \log }S_{\text{w,l}} = 6.258 \, ( \pm 0.550){-}1.451 \, ( \pm 0.100){ \log }K_{\text{OW}} $$
(22)

n = 13, R 2 = 0.950, R 2adj  = 0.945, RMSE = 0.441, F = 208.7, R 2CV  = 0.937, RMSECV = 0.494

The values of R 2adj for this relationship and Eq. (21) are 0.945 and 0.926, respectively which have no significant difference and demonstrate that the predictive power of \( {\text{A}}_{\text{s}}^{ - } \overline{V}_{\text{s}}^{ + } \) is almost the same as lipophilicity (log K OW) that is an experimental descriptor. On the other hand, this is a very good reason that proves that the solubility and the lipophilicity of compounds are related to the interactions between solute and solvent molecules that are electrostatic in nature.

n-Octanol/air partition coefficient (log K OA)

The octanol–air partition coefficient is a key descriptor of chemical partitioning between the atmosphere and other environmental organic phases such as soil and vegetation [53, 54]. K OA data values were available for nine compounds in the range between 5.13 and 13.91 log units which were used in this section and the best obtained model is:

$$ { \log }\;K_{\text{OA}} = - 1.9464\left( { \pm 0.49588} \right) + 0.005708\left( { \pm 0.000241} \right) A_{\text{s}}^{ + } \delta_{ + }^{2} $$
(23)

The square correlation coefficient between \( A_{\text{s}}^{ + } \delta_{ + }^{2} \) and \( {\text{A}}_{\text{s}}^{ - } \overline{V}_{\text{s}}^{ + } \) is 0.9731, so these descriptors are collinear. Also by increasing the number of rings in benzenoids \( A_{\text{s}}^{ + } \), \( \delta_{ + }^{2} \) and thus \( A_{\text{s}}^{ + } \delta_{ + }^{2} \) increase which indicates that the molecules become more lipophilic and more soluble in n-octanol. We mentioned interaction mechanism between benzenoids and n-octanol in n-octanol/water partition coefficient (log K OW) section. Fig. SE of supplementary information presents mono-parametric regression according to Eq. (23) and there are very good agreements existing between calculated and experimental values.

Soil sorption (log K OC)

Nine benzenoids had soil sorption data and the selected descriptor was a total surface area that resulted to the following equation:

$$ { \log }\;K_{\text{oc}} = - 2.6829\left( { \pm 0.22493} \right) + 0.032824\left( { \pm 0.000957} \right)A_{\text{s}}^{\text{tot}} $$
(24)

\( A_{\text{s}}^{\text{tot}} \) and \( A_{\text{s}}^{ + } \delta_{ + }^{2} \) descriptors are collinear (R 2 = 0.9123) and R 2 between \( A_{\text{s}}^{\text{tot}} \) and \( {\text{A}}_{\text{s}}^{ - } \overline{V}_{\text{s}}^{ + } \) is 0.9919 which demonstrates that the mentioned interaction mechanism above (see “ n-octanol/water partition coefficient (log K OW)” section) exists between benzenoids and n-octanol. Fig. SF of supplementary information presents good agreements between calculated and experimental values. These results demonstrate that \( A_{\text{s}}^{\text{tot}} \) can completely describe the change in \( { \log }\;K_{\text{oc}} \).

Henry’s law constant (log H)

Henry’s law constant data was available for eight benzenoids and a combinatorial descriptor was selected and we obtained the following mono-parametric equation:

$$ \begin{aligned} \log \;H & = 2.7611\left( { \pm 0.2724} \right) - 1.9777 \\ & \quad\times 10^{{ - 6}} \left( { \pm 1.7706 \times 10^{{ - 7}} } \right)\left( {{\text{A}}_{{\text{s}}}^{ - } \bar{V}_{{\text{s}}}^{ - } } \right)^{2} \\ \end{aligned} $$
(25)

Negative coefficient of descriptors demonstrated that if in benzenoids we have more points with negative electrostatic potential with low average, electrostatic attractions between positive and negative regions will be stronger and less molecules can go to gas phase. Fig. SG of the supplementary information shows good agreements between calculated and experimental values.

Bioconcentration factor (log BCF)

Bioconcentration factor is the ratio of a substance’s concentration in an organism to its concentration in the ambient water [55]:

$$ {\text{BCF}} = \frac{{C_{\text{org}} }}{{C_{\text{w}} }} $$
(26)

where C org is the concentration in target organism (µg/kg) and C w is the concentration in pure water (µg/l). Bioconcentration factor data was available for 11 benzenoids and after descriptors selection the following equation obtained:

$$ { \log }\;{\text{BCF}} = - 15.319\left( { \pm 2.2417} \right) + 2.2452\left( { \pm 0.26667} \right)\overline{V}_{s}^{ + } $$
(27)

Correlation between \( { \log }\;{\text{BCF}} \) and \( \overline{V}_{\text{s}}^{ + } \) demonstrates that there are molecules in organism that have regions with negative electrostatic potential which attract benzenoids molecule into organism. In Eq. (27) phenanthrene is an outlier and the removal of this molecule increase R 2 to 0.926. Fig. SH of the supplementary information presents results for mono-parametric regression.

Photo-induced toxicity (log 1/LT50)

This term is used for the phenomenon of increasing the toxicity of certain poly cyclic hydrocarbon such as benzenoids when exposed to UV light due to the formation of the free radicals and subsequent damage of macromolecule and is calculated as:

$$ {\text{Photo - induced toxicity}} = \log \left( {\frac{1}{{{\text{LT}}_{50} }}} \right) $$
(28)

where LT50 is median lethal time. Anthracene, pyrene, benzo[a]pyrene, dibenz[a,h]anthracene and benzo[ghi]perylene are among the most phototoxic compounds whereas phenanthrene and tryphenylene are not phototoxic. Data set included nine benzenoids and after descriptors selection steps, a model with two descriptors created:

$$ \begin{aligned} {\text{PIT}} & = 2.9275\left( { \pm 1.3189} \right) - 0.18886\left( { \pm 0.034743} \right)\delta _{{{\text{tot}}}}^{2} \\ & \quad+ 0.011162\left( { \pm 0.004434} \right)A_{{\text{s}}}^{ + } \\ \end{aligned} $$
(29)

Negative coefficient for \( \delta_{\text{tot}}^{2} \) shows more \( \delta_{\text{tot}}^{2} \), decrease photo-induced toxicity and positive coefficient of \( {\text{A}}_{\text{s}}^{ + } \) demonstrates that benzenoids with more rings have more photo-induced toxicity because of being \( {\text{A}}_{\text{s}}^{ + } \) and \( {\text{A}}_{\text{s}}^{\text{tot}} \) as collinear. As seen in Fig. SI of supplementary information there are moderate agreement between calculated and experimental values for biparametric correlation because of the scarcity data.

Polarographic half-wave reduction potentials (E 1/2)

Polarographic half-wave reduction potentials data (E 1/2, in unit of volt) were available for 27 benzenoids, and two descriptors were selected and the following equation was obtained:

$$ \begin{aligned} E_{{\frac{1}{2}}} & = 2.9672\left( { \pm 0.79989} \right) - 0.16559\left( { \pm 0.024092} \right)\delta _{ - }^{2} \\ &\quad - 0.06761\left( { \pm 0.025846} \right)V_{{{\text{s,max}}}} \\ \end{aligned} $$
(30)

Larger \( \delta_{ - }^{2} \) decreases E 1/2. this is reasonable because presence of points with more negative electrostatic potential repel the electrons. For Eq. (30) benzo[a]perylene and dibenzo[a,i]pyrene are outliers and when they were removed, R 2 increased to 0.822. These descriptors have no collinearity problem (R 2 = 0.209). As it’s seen in Fig. SJ of supplementary information, there are weak agreement between calculated and experimental values for biparametric correlation because they do not depend on electrostatic interaction only.

Heat of formation (\( \Delta H_{{\mathbf{f}}} ) \)

Heat of formation data (in unit of KJ/mol) was available for 20 benzenoids and finally the following model was obtained:

$$ (\Delta H_{\text{f}} ) = - 44.756\left( { \pm 10.618} \right) + 2.3549\left( {0.069023} \right)A_{s}^{ - } $$
(31)

Larger benzenoids have larger \( A_{\text{s}}^{ - } \) and this means larger benzenoids have more bonds that results in larger heat of formation. Correlation between \( A_{\text{s}}^{ - } \) and \( A_{\text{s}}^{\text{tot}} \) is high (R 2 = 0.998) that is a good reason for accuracy of Eq. (31). This correlation is slightly larger than the correlation between \( A_{\text{s}}^{ + } \) and \( A_{\text{s}}^{\text{tot}} \) (R 2 = 0.997). In Eq. (31) benzo[a]pentaphene is an outlier and the removal of this molecule increase R 2 to 0.989. Also Fig. SK of supplementary information shows very good agreements between calculated and experimental values for mono-parametric regression model.

Photolysis half-live (log t 1/2)

Photolysis is the most important decay process for PAHs. However, it is unlikely to quantify the photochemical transformation for all PAHs because laboratory tests are expensive and time consuming. QSPR models, which correlate the properties of pollutants with their structure descriptors, may be used to study photolysis mechanisms and to efficiently predict photolysis reaction parameters. In this research, we used GIPF descriptors to create a QSPR model for prediction of benzenoids photolysis half-live. Photolysis half-live (in unit of hour) data set were included seven benzenoids and in descriptors selection steps, a combinatorial descriptor was selected that was resulted in the following model:

$$ { \log }\;t_{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}}} = 3.1673\left( { \pm 0.45628} \right) - 0.00284\left( { \pm 0.000414} \right){\text{A}}_{\text{s}}^{ + } \overline{V}_{\text{s}}^{ + } $$
(32)

Equation (32) shows that photolysis half-live decreases for benzenoid hydrocarbons with more positive electrostatic potential points that have high positive electrostatic potential average value. Since large benzenoids have more points with electron cloud deficiency, thus larger benzenoids have less photolysis half-live (Table S1). While few molecules have data, Fig. SL of supplementary information shows good agreements between calculated and experimental values.

Molecular resonance energy (RE)

Resonance energy data was available for 20 benzenoids and after descriptor selection steps, the following equation was obtained:

$$ \begin{aligned} {\text{RE}} & = 0.74732\left( { \pm 0.20896} \right) \\ &\quad + 9.01 \times 10^{{ - 7}} \left( { \pm 9.85 \times 10^{{ - 8}} } \right) \times \left( {{\text{A}}_{{\text{s}}}^{ - } \bar{V}_{{\text{s}}}^{ - } } \right)^{2} \\ \end{aligned} $$
(33)

According to this equation, for benzenoids with more negative electrostatic potential area that their average is smaller, resonance energy increases. Benzenoids with more rings have greater \( A_{\text{s}}^{ - } \) and so they have more resonance energy. For Eq. (33) pentacene is an outlier and when we removed it R 2 increased to 0.912. Fig. SM of supplementary information indicates a plot of the cross-calculated versus experimental RE values for all 20 compounds that are studied. From this Fig., it can be also seen that the predicted values are comparatively in poor agreements with the experimental values, as shown by the R 2cv value (only 0.794).

Correlation between RE and log H

For both Eqs. (25) and (33), independent variables are the same that this fact demonstrates these two properties are collinear. Five molecules had data for both properties and the following equation was obtained:

$$ {\text{RE}} = 2.0510\left( { \pm 0.0844} \right) - 4259\left( { \pm 0.0806} \right){ \log }\;H $$
(34)

n = 5, R 2 = 0.9030, R 2adj  = 0.871, RMSE = 0.1595, F = 27.9136, R 2CV  = 0.7933, RMSECV = 0.2635.

Conclusions

The QSPR/QSAR methodologies based on general interaction properties function (GIPF) family descriptors were successfully applied for predicting the physico-chemical properties/biological activities of benzenoid hydrocarbons and these properties/activities depend on the forces that are electrostatic in nature. These compounds can interact through their both positive and negative electrostatic potential regions, up to a similar extent and are lipophilic. Minimum and maximum of R 2adj for QSPR/QSAR models are 0.637 (for E 1/2) and 0.993 (for log K OC) and F values are between 16.2 (for log 1/LT50) and 1462.1 (for BP). QSPR model for boiling point has the maximum RMSE due to its large boiling points values. \( R_{ \hbox{max} }^{2} \) for five models were larger than 0.4 because a few number of molecules have data for these properties.