Introduction

Molecular similarity is one of the highly applied concepts in rational drug design. It assumes that structurally similar fragments will elicit a similar biological response. The concept of bioisosterism is closely related to molecular similarity where substructures may be interchanged with retention of some degree of biological activity. Assessment of molecular similarity is a tedious process. Molecular graphs and molecular fingerprints can be used systematically to find special chemical features responsible for the biological activity. Fingerprints are binary vectors representing specific key structures in a chemical entity. Each bit score is represented by a binary score of either 1 or 0, where 1 represents the presence of a specific chemical feature, while 0 represents the absence of such feature. Computationally advanced software systems use either 32- or 64-bit systems. In 32-bit systems, the probability of encountering the same fragment is 1/2\(^{32}\) and for 64-bit applications the probability is 1/2\(^{64}\). The probability of collision is minimized by using advanced computational software which gives statistically significant results. Molecular similarity may be transcribed into a numerical value and can be applied for similarity measurement, virtual screening methods and cluster analysis. In the present study, five different types of binary fingerprinting methods were used to predict the structural requirements for the selective inhibition of Tankyrase I on a series of substituted flavones. The suitability of the particular fingerprint for the selected molecules will be assessed by kernel-based partial least square values. Although Tankyrases have become attractive targets for anticancer agents, there are few effective drugs that inhibit Tankyrase. This is why this simulation study was carried out.

Cancer is one of the most concerning diseases of the modern world. All the cancers have a common characteristic feature known as a dysregulated cell cycle machinery. Among all cancers, the colorectal cancer (crc) has attracted our attention. The recent literature reports that colorectal cancer (crc) has the second highest mortality rate in the cancer segment [1]. The exact molecular mechanisms for the development of colorectal cancer have been deduced. The single major factor responsible for the development of colorectal cancer is the over activation of the Wnt signaling pathway by its central activator beta-catenin [2, 3]. Under normal conditions, \(\upbeta \)-catenin levels are highly regulated by a feedback mechanism involving a \(\upbeta \)-catenin degradation complex [4]. Gene mutation causes the disruption of \(\upbeta \)-catenin degradation components. This leads to translocation of \(\upbeta \)-catenin into the nucleus, causing abnormal activation of transcription factors and gene networks responsible for the development of colorectal cancers. There is an urgent need to develop small molecule inhibitors for the selective inhibition of the Wnt signaling pathway, thereby increasing the levels of \(\upbeta \)-catenin degradation complex to combat colorectal cancer.

Tankyrases are members of the poly ADP-ribose polymerase (PARP) family of proteins. They have attracted attention because of their role in axin down regulation and stabilization of \(\upbeta \)-catenin. The post-translational modification involves the cleavage of PARP proteins, which results in splitting of NAD into ADP ribose and nicotinamide units. Poly ADP-ribose polymerases are responsible for many important biochemical signaling process in a cell machinery. The recently developed Tankyrase inhibitors were able to increase axin levels and down regulate \(\upbeta \)-catenin levels [5].

Recent literature also gave us insight about the advantage of selective inhibition of Tankyrase I. Partial knockdown of Tankyrase I leads to the shortening of telomere length [6]. The combination of selective Tankyrase I inhibitors with other class of drugs could be a viable strategy for the treatment of cancers like colorectal cancers.

Extensive literature search for small molecule inhibitors of Tankyrase I arouse our interest in the flavone pharmacophore due its diverse biological activity, and simple scaffold Flavones are naturally occurring secondary plant metabolites categorized under the broad class of flavonoids. They possess various degrees of free radical scavenging properties and are present in a wide variety of edible plants and vegetables. Flavonoids have also been shown to possess antitumor effect in various cancer cell lines. Inhibition of TNKS1 with flavone and its antiproliferative properties have already been reported. The present study is based on work reported in the literature [7].

Materials and methods

Selection of data set

In the present study, a data set of 25 out of 30 compounds was chosen from the literature [7] based on their structural diversity and activity. A training set of 19 molecules (\(70\%\) of total molecules) was used to generate a kernel-based partial least square regression equation. The training set molecules were selected based on their structural diversity, activity range of 3 log order difference and activities covering the entire range.

To assess the predictive accuracy of the generated kernel-based regression model, a set of 6 molecules were chosen for the test set. The test set was selected in such a way that it is a representation of the training set. The training set was used to generate a 2D QSAR model, and the test set was used to validate the generated model.

Activity values (IC50) from the literature were converted to pIC50 (logarithmic scale) using options available in the calculator. The logarithmic activity is termed as pact.

Importing the energy minimized structures along with their logarithmic scale activity

The structures were imported into Maestro [8], screened for errors, valence parameters, invalid chemistry and duplicate structures in order to avoid interferences in the generation of the model.

Selection and incorporation of various molecular properties

Molecular descriptors were incorporated using the option molecular properties available with the Canvas 2.9 interface. There are main four types of descriptors in Canvas, namely physicochemical descriptors, topological descriptors, ligfilter descriptors and Qik-prop descriptors.

Feature selection

Feature selection is based on a hierarchical clustering to identify subsets of properties that are representative of a larger set. Sixty descriptors were incorporated and explained briefly.

  1. 1.

    Adsorbability index (AI) A molecular descriptor, which predicts the activated carbon adsorption of chemical substances from aqueous solutions. The adsorbability index [9] for a molecule is calculated by the expression:

    $$\begin{aligned} {\hbox {AI}}=\sum \limits _{i} A_{i}+\sum \limits _{i} I_{i} \end{aligned}$$

    where A represents the atomic or group contributions of increasing or decreasing adsorbability in the chemical species and I represents the necessary correction factors.

  2. 2.

    ALOGP One of the most applied universal lipophilicity descriptors derived directly from experimental data.

  3. 3.

    Atomic composition indices The descriptor possessing zero-dimensional attribute deduced directly from the composition of chemical entities and also provides information on the molecular weight and atomic composition of the chemical entity.

  4. 4.

    Total information index on atomic composition \({(}{I}_{{\mathrm{AC}}})\) This descriptor calculates total information on atomic composition of a molecule directly from its molecular formula including hydrogen atoms [10].

    The atomic composition index (\({I}_{\mathrm{AC}}\)) is calculated using the formula:

    $$\begin{aligned} I_{\mathrm{AC}}=A^{\mathrm{h}}\cdot \hbox {log}_{2} A^{\mathrm{h}}- \sum \limits _{g} A_{g}\cdot \hbox {log}_{2} A_{g} \end{aligned}$$

    \({A}^{\mathrm{h}}\) total sum of atoms including hydrogens, \(A_{g}\) total sum of atoms which belongs to the gth chemical element.

  5. 5.

    Nuclear information content descriptor \((I_{{\mathrm{NUCL}}})\) An important descriptor which gives the information on the total number of protons and neutrons present in all the types of nuclei in a given molecule.

    It is expressed by the following formula:

    $$\begin{aligned} I_{\mathrm{NUCL}}={\mathop {\sum }\limits _{i=1}^{A}} I_{i}^{\mathrm{n,p}} \end{aligned}$$
  6. 6.

    Information index on size \((I_{{\mathrm{SIZE}}})\) This descriptor provides complete information content based on the atomic number [11].

    It is expressed as:

    $$\begin{aligned} I_{\mathrm{SIZE}}=A^\mathrm{h}\cdot \hbox {log}_{2} A^\mathrm{h} \end{aligned}$$

    \({A}^{\mathrm{h}}\) is the atom number with or without considering hydrogen atoms.

  7. 7.

    Autocorrelation descriptors \({(\mathrm{AC}}_{{\mathrm{L}}})\) This molecular descriptor is based on the autocorrelation function expressed as

    $$\begin{aligned} \hbox {AC}_{\mathrm{L}}={\mathop {\int }\limits _{a}^{b}} f(x)\cdot f(x+l) \cdot {\mathrm{d}x} \end{aligned}$$
  8. 8.

    Topological electronic descriptors \({(T}^{{E}}{)}\) These descriptors [12] are based on partial atomic charges (q) and are expressed as

    $$\begin{aligned} T^{\mathrm{E}}={\mathop {\sum }\limits _{i=1}^{A-1}} {\mathop {\sum }\limits _{j=i+1}^{A}} {\frac{|q_{i}-q_{j}|}{r^{2}_{ij}}} \end{aligned}$$
  9. 9.

    Partial charge weighted topological electronic index \({(\hbox {PCWT}}^{{\mathrm{E}}}{)}\) A molecular descriptor based on the topological electronic index. It is expressed as

    $$\begin{aligned} \hbox {PCWT}^{\mathrm{E}}={\frac{1}{Q^{-}_{\mathrm{max}}}}\cdot {\mathop {\sum }\limits _{b=1}^{B}}\left( {\frac{|q_{i}-q_{j}|}{r^{2}_{ij}}}\right) _{b} \end{aligned}$$
  10. 10.

    Local dipole index (D) This descriptor calculates average differences in the magnitude of charge between overall bonded atom pairs (i–j) [13] and is expressed as

    $$\begin{aligned} D={\frac{\sum \nolimits _{b}|q_{i}-q_{j}|{_{b}}}{B}} \end{aligned}$$
  11. 11.

    Atom in structure invariant index (ASII) It belongs to a class of charge-related indices derived from hydrogen depleted molecular graphs. It is grouped under the category of global descriptors.

    It is expressed as

    $$\begin{aligned} {\hbox {ASII}}_{i}={\hbox {ASII}}_{i}^{0}-h_{i}+q_{i} \end{aligned}$$
  12. 12.

    Charged partial surface area descriptors (CPSA) These are the set of descriptors [14] that correlate shape and electronic information content to identify chemical compounds and to quantitate the polar interaction between molecules. They mainly consider the Vander Walls radius as a key feature for the quantitation.

  13. 13.

    Partial negative surface area \({(\hbox {PNSA}}_{{1}}{)}\) It describes the information on total area of solvent-accessible surface area, comprised of all the negatively charged atoms.

    It is expressed as

    $$\begin{aligned} {\hbox {PNSA}}_{1}=\sum \limits _{a-} \mathrm{SA}_{a}^{-} \end{aligned}$$
  14. 14.

    Partial positive surface area \({(\hbox {PPSA}}_{{1}}{)}\) It is the total summation of the solvent-accessible surface area of all the positively charged atoms.

    It is expressed as

    $$\begin{aligned} {\hbox {PPSA}}_{1}=\sum \limits _{a+} \mathrm{SA}_{a}^{+} \end{aligned}$$

    The sum is restricted to positively charged atoms \(\left( {a+} \right) \)

  15. 15.

    Total charge weighted negative surface area \({(\hbox {PNSA}}_{{2}}{)}\) The product of partial negative solvent-accessible surface area to the total negative charge (\({Q}^{-}\)).

    It is expressed as

    $$\begin{aligned} {\hbox {PNSA}}_{2}=Q^{-} \cdot \sum \limits _{a-} \mathrm{SA}_{a}^{-} \end{aligned}$$
  16. 16.

    Total charge weighted positive surface area \({(\hbox {PPSA}}_{{2}}{)}\) The product of partial positive solvent-accessible surface area multiplied by the total positive charge (\({Q}^{+}\)).

    It is expressed as

    $$\begin{aligned} {\hbox {PPSA}}_{2}=Q^{+} \cdot \sum \limits _{a+} \mathrm{SA}_{a}^{+} \end{aligned}$$
  17. 17.

    Atomic charge weighted negative surface area \({(\hbox {PNSA}}_{{3}})\) The product of atomic solvent-accessible surface areas and partial charges \({q}_{\mathrm{a}}^{-}\) over all negatively charged atoms.

    It is expressed as

    $$\begin{aligned} {\hbox {PNSA}}_{3}=\sum \limits _{a-} q_{a}^{-}\cdot \mathrm{SA}_{a}^{-} \end{aligned}$$
  18. 18.

    Atomic charge weighted positive surface area \({(\hbox {PPSA}}_{{3}})\) The summation of the products of atomic solvent-accessible surface areas and partial charges \({q}_{\mathrm{a}}^{+}\) over all positively charged atoms.

    It is expressed as

    $$\begin{aligned} {\hbox {PPSA}}_{3}=\sum \limits _{a+} q_{a}^{+}\cdot \mathrm{SA}_{a}^{+} \end{aligned}$$
  19. 19.

    Difference in the charged partial surface area \({(\hbox {DPSA}}_{{1}}{)}\) The difference between partial positive solvent-accessible surface area and partial negative solvent-accessible surface area.

    It is expressed as

    $$\begin{aligned} {\hbox {DPSA}}_{1}={\hbox {PPSA}}_{1}-{\hbox {PNSA}}_{1} \end{aligned}$$
  20. 20.

    Difference in the total charge weighted surface area \({(\hbox {DPSA}}_{{2}}{)}\) The difference between total charge weighted positive solvent-accessible surface area and total charge weighted negative solvent-accessible surface area.

    It is expressed as

    $$\begin{aligned} {\hbox {DPSA}}_{2}={\hbox {PPSA}}_{2}-{\hbox {PNSA}}_{2} \end{aligned}$$
  21. 21.

    Difference in the atomic charge weighted surface area \({(\hbox {DPSA}}_{{3}}{)}\) The difference between atomic charge weighted positive solvent-accessible surface area and atomic charge weighted negative solvent-accessible surface area.

    It is expressed as

    $$\begin{aligned} {\hbox {DPSA}}_{3}={\hbox {PPSA}}_{3}-{\hbox {PNSA}}_{3} \end{aligned}$$
  22. 22.

    Relative negative charge (RNCG) Partial charge of the most negative atom divided by total negative charge.

    It is expressed as

    $$\begin{aligned} {\hbox {RNCG}}={\frac{Q_{\mathrm{max}}^{-}}{Q^{-}}} \end{aligned}$$
  23. 23.

    Relative positive charge (RPCG) Partial charge of the most positive atom divided by total positive charge.

    It is expressed as

    $$\begin{aligned} {\hbox {RPCG}}={\frac{Q_{\mathrm{max}}^{+}}{Q^{+}}} \end{aligned}$$
  24. 24.

    Relative negatively charged surface area (RNCS) The solvent-accessible surface area of the most negative atom divided by the relative negative charge (RNCG).

    It is expressed as

    $$\begin{aligned} {\hbox {RNCS}}={\frac{\hbox {SA}_{\mathrm{max}}^{-}}{{\hbox {RNCG}}}} \end{aligned}$$
  25. 25.

    Relative positively charged surface area (RPCS) The solvent-accessible surface area of the most positive atom divided by the relative positive charge (RPCG).

    It is expressed as

    $$\begin{aligned} {\hbox {RPCS}}={\frac{\hbox {SA}_{\mathrm{max}}^{+}}{{\hbox {RPCG}}}} \end{aligned}$$
  26. 26.

    Total hydrophobic surface area (TASA) The sum of solvent-accessible surface areas of atoms with absolute value of partial charges less than 0.2.

    It is expressed as

    $$\begin{aligned} {\hbox {TASA}}=\sum \limits _{a} \hbox {SA}_{a} \end{aligned}$$
  27. 27.

    Total polar surface area (TPSA) The sum of solvent-accessible surface areas of atoms with absolute value of partial charges greater than or equal to 0.2.

    It is expressed as

    $$\begin{aligned} {\hbox {TPSA}}=\sum \limits _{a} \hbox {SA}_{a} \end{aligned}$$
  28. 28.

    Relative hydrophobic surface area (RASA) The ratio of total hydrophobic surface area (TASA) to the total molecular solvent-accessible surface area (SASA).

    It is expressed as

    $$\begin{aligned} {\hbox {RASA}}={\frac{{\hbox {TASA}}}{{\hbox {SASA}}}} \end{aligned}$$
  29. 29.

    Relative polar surface area (RPSA) The total polar surface area (TPSA) divided by the total molecular solvent-accessible surface area (SASA).

    It is expressed as

    $$\begin{aligned} {\hbox {RPSA}}={\frac{{\hbox {TPSA}}}{{\hbox {SASA}}}} \end{aligned}$$
  30. 30.

    RHTA index The ratio of hydrogen bond donor groups to hydrogen bond acceptor groups.

    It is expressed as

    $$\begin{aligned} {\hbox {RHTA}}={\frac{{\hbox {HBD}}}{{\hbox {HBA}}}} \end{aligned}$$
  31. 31.

    SSAH index The total surface area of hydrogen atoms that can be readily donated.

    It is expressed as

    $$\begin{aligned} {\hbox {SSAH}}\equiv {\hbox {HDSA}}= \sum \limits _{d} \hbox {SA}_{d} \end{aligned}$$
  32. 32.

    RSAH index The average surface area of hydrogen atoms that can be donated.

    It is expressed as

    $$\begin{aligned} {\hbox {RSAH}} = {\frac{\sum \nolimits _{d} \hbox {SA}_{d}}{{\hbox {HBD}}}} \end{aligned}$$
  33. 33.

    RSHM index The fraction of the total molecular surface area associated with hydrogen atoms that can be readily donated.

    It is expressed as

    $$\begin{aligned} {\hbox {RSHM}} \equiv {\hbox {FHDSA}}={\frac{\sum \nolimits _{d} \hbox {SA}_{d}}{{\hbox {SASA}}}} \end{aligned}$$
  34. 34.

    SSAA index The sum of the surface areas of all hydrogen bond acceptor atoms.

    It is expressed as

    $$\begin{aligned} {\hbox {SSAA}}\equiv {\hbox {HASA}}= \sum \limits _{a} \hbox {SA}_{a} \end{aligned}$$
  35. 35.

    RSAM index The fraction of the total molecular surface area associated with H-bond acceptor groups.

    It is expressed as

    $$\begin{aligned} {\hbox {RSAM}} \equiv {\hbox {FHASA}}={\frac{\sum \nolimits _{a} \hbox {SA}_{a}}{{\hbox {SASA}}}} \end{aligned}$$
  36. 36.

    HDCA index The sum of charged surface areas of hydrogen atoms that can be donated.

    It is expressed as

    $$\begin{aligned} {\hbox {HDCA}} = \sum \limits _{d} q_{d} \cdot \hbox {SA}_{d} \end{aligned}$$
  37. 37.

    HBSA index The sum of the surface areas of both hydrogen atoms that can be donated to hydrogen acceptor atoms.

    It is expressed as

    $$\begin{aligned} {\hbox {HBSA}}= {\hbox {HDSA}}+ {\hbox {HASA}} \end{aligned}$$
  38. 38.

    Graph distance complexity Molecular descriptor derived from the distance matrix D [15].

    It is expressed as

    $$\begin{aligned} H_{D}={\mathop {\sum }\limits _{i=1}^{A}}{\frac{\sigma _{i}}{I_{\mathrm{ROUV}}}} \cdot v_{i}^{d}= {\mathop {\sum }\limits _{i=1}^{A}} {\frac{\sigma _{i}}{2W}}\cdot v_{i}^{d} \end{aligned}$$
  39. 39.

    Polar hydrogen factor \({(Q}_{{\mathrm{H}}}{)}\) This descriptor correlates polarity of molecules to C–H bonds. Its application is limited to halogenated hydrocarbons.

    It is expressed as

    $$\begin{aligned} {Q}_{\mathrm{H}}=\sum \limits _{b} \left[ \sum \limits _{C} k_{C}+ \sum \limits _{\alpha } k_{\alpha }+ \sum \limits _{\beta } k_{\beta }\right] \end{aligned}$$
  40. 40.

    Q polarity index Topological polarity index derived from the electro topological intrinsic state of the atoms confined in a molecule.

    It is expressed as

    $$\begin{aligned} {Q}={\frac{A^{2}\cdot {\mathop {\sum }\nolimits _{i=1}^{A}} I_{i}^{\mathrm{ALK}}}{\left( {\mathop {\sum }\nolimits _{i=1}^{A}} I_{i}\right) ^{2}}} \end{aligned}$$
  41. 41.

    Molecular polarizability effect index (MPEI) This descriptor works by the principle that molecules are polarized by electrostatic potential fields [16]. The index is calculated by summing the polarizability contributions from different atoms in a molecule.

    It is expressed as

    $$\begin{aligned} \hbox {MPEI}={\mathop {\sum }\limits _{i=1}^{A}}\hbox { PEI}_{i} \end{aligned}$$
  42. 42.

    Balaban distance connectivity index (J) It is one of the topological descriptors and its value does not vary substantially with the size of the molecules or number of ring systems.

    It is expressed as

    $$\begin{aligned}&{J}={\frac{B}{C+1}}\cdot \sum \limits _{b} (\sigma _{i}\cdot \sigma _{j})_{b}^{-1/2}\\&\quad = {\frac{1}{C+1}}\cdot \sum \limits _{b} (\bar{\sigma }_{i}\cdot \bar{\sigma }_{j})_{b}^{-1/2} \end{aligned}$$
  43. 43.

    Atomic charge (q) The experimental approach to calculate atomic charge is called Mulliken population analysis [17]. The method allocates electrons to atoms, transforming atomic charge to a local descriptor.

    It is expressed as

    $$\begin{aligned} q_{a}={Z}_{a}-{\mathop {\sum }\limits _{\mu =1}^{{N}_{\mathrm{AO}}}} {\mathop {\sum }\limits _{v=1}^{{N}_{\mathrm{AO}}}} P_{\mu v}\cdot S_{\mu v} \end{aligned}$$
  44. 44.

    Sub-molecular polarity parameter \({(\hbox {SPP}}^{{1}} \Delta )\): An electronic descriptor [18] that mathematically determines the excess charge difference between a pair of atoms.

    It is expressed as

    $$\begin{aligned} {}^{1}\Delta =\left| Q_{\mathrm{max}}^{+}-Q_{\mathrm{max}}^{-}\right| \end{aligned}$$
  45. 45.

    Second-order sub-molecular polarity parameter \({(}^{{2}} \Delta )\): Determines the second prime difference of excess charges [19].

    It is expressed as

    $$\begin{aligned} {DP}={\frac{\left| Q_{\mathrm{max}}^{+}-Q_{\mathrm{max}}^{-}\right| }{r_{\pm }^{2}}}= {\frac{{}^{1}\Delta }{r_{\pm }^{2}}} \end{aligned}$$
  46. 46.

    Molar polarization \({(P}_{{\mathrm{M}}}{)}\) The dipole moment induced for each unit volume V is termed molar polarization. Clausius–Mossotti equation explains this descriptor.

    It is expressed as

    $$\begin{aligned}&P_{\mathrm{M}}={\frac{\varepsilon -1}{\varepsilon +2}}\cdot {\frac{\hbox {MW}}{\varrho }}={\frac{4\pi }{3}}\cdot {N}_{\mathrm{A}} \cdot \alpha \cdot \\&{E}={\frac{n_{D}^{2}-1}{n_{D}^{2}+2}}\cdot {\frac{\hbox {MW}}{\varrho }}=\hbox {MR} \end{aligned}$$
  47. 47.

    Atom–atom polarizability A chemical reactivity index solely calculated on the basis of perturbation theory.

    It is expressed as

    $$\begin{aligned} {P}_{ab}=4.\sum \limits _{i}\sum \limits _{j}\sum \limits _{\mu }\sum \limits _{v} {\frac{C_{i\mu ,a}\cdot C_{j\mu ,a}\cdot C_{iv,b}\cdot C_{jv,b}}{\varepsilon _{i}-\varepsilon _{j}}} \end{aligned}$$
  48. 48.

    Anisotropy of the polarizability It quantifies the deviation of molecular polarizability from an equivalent spherical shape.

    It is expressed as

    $$\begin{aligned} \beta ^{2}={\frac{\left( \alpha _{xx}-\alpha _{yy}\right) ^{2}+\left( \alpha _{yy}-\alpha _{zz}\right) ^{2}+ \left( \alpha _{zz}-\alpha _{xx}\right) ^{2}}{2}} \end{aligned}$$
  49. 49.

    Overall electronic constants These are the Hammett substitution constants [20] which measure the total electronic effect of meta and para substituted benzene rings comprised of substituents in a side chain.

    It is expressed as

    $$\begin{aligned} \sigma _{\mathrm{m,p}}={\frac{1}{\varrho }}\cdot \hbox {log} \left( {\frac{K^{\mathrm{X}}}{K_{0}}}\right) ={\frac{1}{\varrho }}\cdot \left( \hbox {p}K_{\mathrm{a}}^{0}-{p}K_{\mathrm{a}}^{\mathrm{X}}\right) \end{aligned}$$
  50. 50.

    Information index on the molecular symmetry \({(I}_{{\mathrm{SYM}}})\): An important molecular symmetry descriptor which is solely based on total information content.

    It is expressed as

    $$\begin{aligned} I_{\mathrm{SYM}}=A\cdot \hbox {log}_{2}A- {\mathop {\sum }\limits _{g=1}^{G}} A_{g} \hbox {log}_{2} A_{g} \end{aligned}$$
  51. 51.

    Joshi steric descriptor \({(\mathrm{JM}}_{{1}}{)}\) A descriptor that directly measures the steric effect [21] of substituents. It is grouped under the broad class of quantum chemical descriptors.

    It is expressed as

    $$\begin{aligned} \hbox {JM1}={\frac{\Delta {E}_{\mathrm{X}}}{\Delta {E}_{\mathrm{H}}}}\quad \hbox {log}(\hbox {JM1})=\hbox {log}(\Delta {E}_{\mathrm{X}})- \hbox {log}(\Delta {E}_{\mathrm{H}}) \end{aligned}$$
  52. 52.

    Substituent front strain \({(S}_{{\mathrm{f}}}{)}\) A steric descriptor [22] obtained using empirical force fields and calculated based on enthalpy of formation.

    It is expressed as

    $$\begin{aligned} S_{\mathrm{f}}= & {} \Delta H_{\mathrm{f}}^{0}\left[ \hbox {XC}(\hbox {CH}_{3})_{3}\right] -\Delta H_{\mathrm{f}}^{0}\left[ \hbox {XCH}_{3}\right] \\&+\,8.87\,\,[10^{4}\,\hbox {J/mol}] \end{aligned}$$
Fig. 1
figure 1

Actual versus predicted test set (fingerprint—atom pairs)

Fig. 2
figure 2

Actual versus predicted training set (fingerprint—atom pairs)

  1. 53.

    Steric vertex topological descriptor (SVTI) Best performing steric descriptor for alkyl groups [23] characterized by their topological distance (d) from an H-depleted molecular graph.

    It is expressed as

    $$\begin{aligned} {\hbox {SVTI}}={\mathop {\sum }\limits _{j=1}^{A_{\mathrm{X}}}} d_{ij}\quad \forall d_{ij}\le 3 \end{aligned}$$
  2. 54.

    Steric density descriptor \({(\mathrm{SD}}_{{X}}{)}\) The substituent steric descriptor [24], which correlates molecular mass to van der Waals volume.

    It is expressed as

    $$\begin{aligned} {\hbox {SD}}_{X}= & {} \left( {\frac{MW}{{V}_{\mathrm{VDW}}}}\right) _{\mathrm{X}}- \left( {\frac{MW}{{V}_{\mathrm{VDW}}}}\right) _{\mathrm{H}}\\= & {} \left( {\frac{MW}{{V}_{\mathrm{VDW}}}}\right) _{\mathrm{X}}-0.29 \end{aligned}$$
    Table 1 KPLS results on training set
    Table 2 KPLS results on test set
    Table 3 Activity prediction (actual vs. predicted)
  3. 55.

    Model of the Frontier steric effect descriptor \({(R}_{{S}}{)}\) Theoretical descriptor used to estimate Taft’s steric constant [25] which is based on the fundamental characteristics of constituent atoms.

    It is expressed as

    $$\begin{aligned} R_{\mathrm{S}}=-30 \cdot \hbox {log}\left( 1-{\mathop {\sum }\limits _{i=1}^{n}} {\frac{R_{i}^{2}}{4\cdot r_{i}^{2}}}\right) \end{aligned}$$
  4. 56.

    Carbo similarity index (C) This descriptor [26] compares two molecules based on their electron density. It is also applied to compare any structural properties between molecules.

    It is expressed as

    $$\begin{aligned} {C}_{st}={\frac{{\mathop {\sum }\nolimits _{k=1}^{\mathrm{N}}} {P}_{sk}\cdot {P}_{tk}}{{\left( {\mathop {\sum }\nolimits _{k=1}^{\mathrm{N}}{P}_{sk}^{2}}\right) ^{1/2}}\cdot {\left( {\mathop {\sum }\nolimits _{k=1}^{\mathrm{N}}{P}_{tk}^{2}}\right) ^{1/2}}}} \end{aligned}$$
  5. 57.

    Electrophilic atomic frontier electron density descriptor \({(f}_{{a}}^{-}{)}\) Molecular descriptor reflecting the electron density status in the HOMO orbital of a compound.

    It is expressed as

    $$\begin{aligned} {f}_{a}^{-}=\sum \limits _{\mu }\left( c_{\mathrm{HOMO},\mu }\right) ^{2} \end{aligned}$$
  6. 58.

    Hardness indices (\({{\eta }}\)) These are the class of descriptors derived directly from density functional theory [27]. They correspond to the second-order derivative energy levels with respect to the total number of electrons in the system.

    It is expressed as

    $$\begin{aligned} \eta= & {} {\frac{1}{2}}\left( {\frac{\partial ^{2}{E}}{\partial {N}_{\mathrm{el}}{^{2}}}}\right) _{v({r})}= \left( {\frac{\partial \mu }{\partial {N}_{\mathrm{el}}}}\right) _{v({r})}\\= & {} \int {h}({r}){ \mathrm{d}r}= {\frac{1}{2\cdot {S}}} \end{aligned}$$
  7. 59.

    Composite nuclear potential \(({{\nu }}{(r)})\) This descriptor defines the pattern of the nuclei of a molecule.

    It is expressed as

    $$\begin{aligned} v({r})={\mathop {\sum }\limits _{a=1}^{A}} {\frac{{Z}_{a}}{|{r}-{R}_{a}|}} \end{aligned}$$
  8. 60.

    Kier alpha molecular flexibility index (\({{\varPhi }}\)) Direct measurement of molecular flexibility [28] derived from Kier alpha adapted shape descriptors.

    It is expressed as

    $$\begin{aligned} \varPhi ={\frac{{}^{1} {K}_{\alpha } \cdot {}^{2}{K}_{\alpha }}{A}} \end{aligned}$$

Incorporation of Binary fingerprints

The seven available binary fingerprints were incorporated using Canvas interface. Only the models with significant contribution were retained for further process.

Fig. 3
figure 3

Hashing in atom pair fingerprinting

Fig. 4
figure 4

Atomic contribution model for fingerprint—atom pairs

Fig. 5
figure 5

Actual versus predicted training set (fingerprint –atom triplets)

Fig. 6
figure 6

Actual versus predicted test set (fingerprint—atom triplet)

Kernel-based partial least square regression using different binary fingerprints

  1. 1.

    Binary fingerprint—atom pairs [29]

Description: represents pair of atoms, which are differentiated into atom types and distance of separation.

Table 4 KPLS results on training set
Table 5 Results of KPLS on test set

The concept of atom pairs has been used in this fingerprint. A molecular entity is hashed into smaller fragments to give an integer value, and the shortest distance between any two atoms is given by the term d. The actual contribution of the model to the activity is assessed by kernel-based partial least square equation by considering their regression values (\({R}^{2}\) and \({Q}^{2}\) values).

  1. 2.

    Binary fingerprint—atom triplets [29]

Description: triplets of atoms and three distances separating them.

The model is an extension of the atom-pair fingerprint. Triplets represent three atoms and the distance at which they are separated. A triplet can be presented in 6 different ways according to the theory of permutation and combination. An operation is performed to exclude the bits which correspond only to the permutation Typea-dab-Typeb-dbc-Typec-dca. The assessment method is identical to the description under atom pairs.

  1. 3.

    Binary fingerprint—fp linear [29]

Description: linear fragments \(+\) ring closures.

The linear fingerprint fragments a molecule in every possible linear combination path. The default application considers up to 7 bonds. A hashing operation is performed for each linear fragment to generate a corresponding bit address. To apply a linear path on closed ring systems, linear paths may be extended up to 14 bonds. The linear fingerprints can be extensively applied to compounds with extended ring systems. The assessment method used is the same for atom pairs.

  1. 4.

    Binary fingerprint—2D molprint [29]

Description: The molprint fingerprint uses heavy atoms present in a molecule and surrounding environment which is separated by a maximum of two bonds. The bit value is measured from a stored data containing a heavy atom and distance separated by other atoms by one or two bond orders. The assessment method used is the same for atom pairs.

  1. 5.

    Binary fingerprint—fp dendritic [29]

Description: Linear and branched fragments.

The dendritic fingerprint uses a combination of both linear and branched fragments up to a user-defined value of 5 bonds. There is no special treatment for ring systems which are considered as branched fragments. The assessment method used is the same for atom pairs.

Results and discussion

  1. 1.

    Results of kernel-based partial least square [30] regression on fingerprint—atom pairs (Figs. 12)

The fingerprint atom pairs gave a statistically significant 2D QSAR model with excellent regression coefficient values and cross-validation coefficient values as represented in Tables 1 and 2. The model also showed good predictive accuracy in both test and training set molecules as demonstrated in Table 3. A hashing technique used is represented in Fig. 3.

For the assessment of atomic contribution to the model, three molecules were taken from each active and inactive set. Atoms positively contributing to activity were colored blue and atoms detrimentally contributing to the model were colored yellow which is shown in Fig. 4.

Table 6 Activity prediction (actual vs. predicted)
Fig. 7
figure 7

Hashing in atom triplets

  1. 2.

    Results of kernel-based partial least square regression on fingerprint—atom triplets (Figs. 5, 6)

Fig. 8
figure 8

Atomic contribution model for atom fingerprint—atom triplets

Table 7 KPLS results on training set

The fingerprint atom triplets gave a statistically significant 2D QSAR model with excellent regression coefficient values and cross-validation coefficient values as represented in Tables 4 and 5. The model also showed good predictive accuracy in both test and training set molecules as evidenced in Table 6. The hashing pattern in the atom triplet is shown in Fig. 7.

Table 8 KPLS results on training set
Table 9 Activity prediction (actual vs. predicted)

For the assessment of atomic contribution to the model, three molecules were taken from each active and inactive set, atoms positively contributing to the activity were colored yellow and atoms detrimentally contributing to the model were colored green which is shown in Fig. 8.

Fig. 9
figure 9

Hashing in fp linear fingerprint

The fingerprint fp linear gave a statistically significant 2D QSAR model with excellent regression coefficient values and cross-validation coefficient values as represented in Tables 7 and 8. The model also showed good predictive ability in both test and training set molecules as evidenced in Table 9. Hashing pattern in fp linear is represented in Fig. 9.

  1. 3.

    Results of kernel-based partial least square regression on fingerprint—fp linear (Figs. 10, 11)

For the assessment of atomic contribution to the model [31], three molecules were taken from each active and inactive set, and atoms positively contributing to activity were colored red and atoms detrimentally contributing to the model were colored blue which is shown in Fig. 12.

  1. 4.

    Results of kernel-based partial least square regression on fingerprint—2D molprint (Figs. 13, 14)

The finger print 2D molprint gave a statistically insignificant 2D QSAR model with a large difference in regression coefficient values and cross-validation coefficient values as presented in Tables 10 and 11. 2D molprint is not a suitable fingerprint for the selected flavone class. The model also showed poor predictive accuracy in both test and training set molecules as shown in Table 12. Hashing pattern is represented in Fig. 15.

For the assessment of atomic contribution to the model, three molecules were taken from each active and inactive sets and atoms positively contributing to the activity were colored maroon and atoms detrimentally contributing to the model were colored blue as shown in Fig. 16.

  1. 5.

    Results of kernel-based partial least square regression on fingerprint—fp dendritic

The fingerprint fp dendritic gave a statistically insignificant 2D QSAR model with a large difference in the regression coefficient values and cross-validation coefficient values as represented in Tables 13 and 14. Therefore, fp dendritic [10] is not a suitable fingerprint for the selected flavone class (Figs. 17, 18, 19). The model also showed poor predictive ability in both test and training set molecules as evidenced in Table 15. The hashing pattern is represented in Fig. 20.

Fig. 10
figure 10

Actual versus predicted test set (fingerprint—fp linear)

Fig. 11
figure 11

Actual versus predicted training set (fingerprint—fp linear)

Fig. 12
figure 12

Atomic contribution model for fingerprint—fp linear

Fig. 13
figure 13

Actual versus predicted training set (fingerprint—2D molprint)

Fig. 14
figure 14

Actual versus predicted test set (fingerprint—2D molprint)

Fig. 15
figure 15

Hashing in 2D molprint (circular fingerprints)

For the assessment of atomic contribution to the model, three molecules were taken from each active and inactive set, and atoms positively contributing to the activity were colored Orange and atoms detrimentally contributing to the model were colored maroon which is shown in Fig. 19.

Structure–activity relationship studies

Figure 21 shows that substitution with electron-donating groups (e.g., hydroxyl) on both rings A and C was found to increase activity in all the 5 fingerprint models. The only exception was molecule Tan 1, which is an active molecule even though it has no substitutions on both rings A and C. Substitution with electron withdrawing groups on both rings A and C was found to decrease activity in all the 5 fingerprint models. This may be attributed to the electronic parameters exerted by various functional groups on the aromatic rings.

Table 10 KPLS results on training set
Table 11 KPLS results on test set
Table 12 Activity prediction (actual vs. predicted)
Fig. 16
figure 16

Atomic contribution model for fingerprint—2D molprint

Table 13 KPLS results on training set
Table 14 KPLS results on test set
Fig. 17
figure 17

Actual versus predicted training set (fingerprint—fp dendritic)

Fig. 18
figure 18

Actual versus predicted test set (finger print—fp dendritic)

Fig. 19
figure 19

Atomic contribution model for atom fingerprint—fp dendritic

Table 15 Activity prediction (actual vs. predicted)

During the analysis of binding site of flavones in the active site of Tankyrase 1, it is evident that hydrogen bond donor interactions are prominent for the effective binding. Consequently, electron-donating groups such as –OH and –NH2 were found to increase binding affinity toward Tankyrase I according to the developed QSAR model.

Fig. 20
figure 20

Generation of Hash codes in dendritic fingerprint

Fig. 21
figure 21

Interpretation of QSAR model

Conclusions

Kernel-based partial least square regression was performed on a series of novel mono-substituted flavones using five binary fingerprinting methods. The contribution of each fingerprint model to the activity depends on several factors. In the present study, the fingerprint atom pairs gave a statistically significant 2D QSAR model with excellent regression values. The factors responsible for the success of pairwise fingerprint are molecular size and molecular weight. All the selected test compounds were bicyclic fused systems with mono-substitutions on rings A and C. The molecular weight of all the test compounds was in the range of 200–400 Daltons. The fingerprint atom triplets also gave a statistically significant 2D QSAR equation due to the involvement of atom triplets which occur at particular site and at a particular distance. The three remaining fingerprints, namely linear, 2D Molprint and dendritic, could not reach acceptable regression values. This failure may be attributed to the atom typing scheme and structural variation. Suitable fingerprints should be selected based on structure, molecular size, types of ring systems and nature of extended branching.