Keywords

1 Introduction

Fungal infections are one of the most important issues in healthcare. The number of fungal infections is growing as a result of, among other reasons, continuing environmental pollution, an increase in background radioactivity, improper application of broad-spectrum antibiotics, growing use of cytostatic and immunosuppressive drugs, and the appearance of more and more frequent antifungal drug resistance [1,2,3]. Among these infections, invasive mycoses are becoming a more and more important medical concern due to the growing number of immunocompromised patients [4,5,6]. The number of currently available and approved systemic antifungals is insufficient [7,8,9], and the progress of developing novel antifungal drugs is not fully proportional to the rate of growth of antifungal diseases, which include invasive fungal infections that are an existential and growing problem for modern healthcare [10,11,12]. Effective use of antifungal drugs to treat various mycoses is an important factor in the fight against antifungal infections.

One of the main issues affecting drug research is the cost of research and development, which can reach as high as 2.5 billion dollars [13]. The time it takes to develop a new drug is also a key issue, as a great deal of time is lost on drugs that ultimately do not pass pre-clinical or clinical trials.

One of the modern approaches to developing novel highly-effective low-toxicity antifungal drugs with improved medical, biological, and biopharmaceutical properties is the chemical modification of existing antifungal drugs, chief among those polyene macrolide antibiotics [14,15,16]. In this chapter, we discuss a specific class of antibiotics: polyene macrolide antibiotics (PMA), which make up approximately a quarter of all existing antifungal antibiotics. The chemical structure of a PMA consists of a macrolide ring that contains conjugated double bonds on one side (forming the lipophilic side of the molecule) and a number of hydroxyl and keto groups on the other (forming the hydrophilic side of the molecule). Their biological target is ergosterol, one of the components of a pathogenic fungi phospholipid membrane.

Amphotericin B is the drug of choice (gold standard) among all known PMA due to its high antifungal activity against the vast majority of known clinical forms of mycoses. PMA derivatives (PMAD) are chemically modified versions of existing PMA drugs which retain the biological activity of the initial drug while having lower toxicity. They can be an important topic of research in the fight against fungal drug resistance [17,18,19].

Software engineers can support the process of PMAD research by creating a software system that can predict antifungal activity and toxicity on the basis of the chemical structure of a molecule.

The goal is the development of models and a software solution providing those models that can reduce antifungal drug research time and cost by selecting such PMAD that have lower toxicity while retaining their ability to bind to the biological target. Using such a program, a researcher can check the toxicity and antifungal activity of a potential PMAD, and select such PMAD to go to pre-clinical trials that have more favorable traits. The program helps to cut time and other resource expenditure for pre-clinical and clinical trials of PMAD that lack the desired pharmaceutical properties.

2 Description of the Software System

The software system contains interfaces for researchers, experts, and database administrators. It includes an intelligent data analysis subsystem, a subsystem of synthesis step selection, and databases providing them with the data they require to function (see Fig. 1).

Fig. 1
figure 1

Architecture of the software system for predicting and researching antifungal antibiotics’ properties

Where LD50—the lethal dose for half of the population (mg drug/kg, oral intake, rats), T—a vector of predicted results of assays corresponding to toxicity signaling reactions, BL—the likelihood of binding to the biological target (%), D—the graph representation of the molecule, I—additional data to train the neural networks on, R—the results of this training process (AUC, MSE), S—the SMILES notation representation of the molecule’s structure, SV—a vectorized version of that notation, MD—a vector of molecular descriptor values generated from the SMILES notation representation, MF—Morgan’s molecular fingerprint bit vector for the molecule, X—description of the initial PMA, Xe—the result of modifying the structure of that PMA to create a PMAD, Y—the experimentally derived values acquired by testing that PMAD, and Z—the PMAD synthesis steps.

The data analysis subsystem consists of one acute toxicity model based on gradient boosted decision trees, 12 recurrent neural networks modeling one property each based on embedded vector representations of the elements of the SMILES notation of the molecule, and a deterministic algorithm for predicting biological activity based on pharmacophore filtering.

3 Data Analysis Subsystem Components

The acute toxicity model utilizes the gradient boosted decision tree model catboost in order to predict toxicity. It is trained on data retrieved from ChemIdPlus [20] in the form of tsv data (approximately 6000 values). Of note is that, prior to predicting the value, we multiply it by a normalized (0, 1] value of its logP in order to adjust somewhat for absorption differences due to lipophilicity. The data is input as a SMILES string, then processed using RDKit [21], which also provides us with the descriptors we use and the RDKit molecular fingerprint that also serves as input. The data is then fed into a gradient-boosted decision tree (catboost) model. The model’s hyperparameters are as follows: iterations: 50,000, depth: 6, od_type: ‘Iter’, od_wait: 500, learning_rate: 0.07, random_strength: 40, l2_leaf_reg: 100, rsm: 0.3.

The predicted values are divided by the normalized logP value. The normalizer model is stored alongside the catboost model.

The assay-based toxicity prediction has a pre-processing step. First, we use the Chembl database [22] to attain approximately 1.7 million SMILES representations of valid molecules. We then determine all of the unique elements the SMILES notation consists of and one-hot encode them. We then utilize a skip-gram variant of word embeddings on these elements. The window size is 11 (5 to each side of the predicted element) and the embedded vector has 15 elements. The skip-gram variant of neural network encoding for embedded vector attainment is presented in Fig. 2.

Fig. 2
figure 2

Skip-gram encoding to attain an information-rich embedded vector (represented here as the hidden layer)

We limit predictions to SMILES notations of at most 300 elements. If a SMILES notation is shorter than 300 elements, we append zero vectors to ensure all inputs are of identical (300, 15) shape. The utilized neural network consists of a bi-directional GRU layer, represented in Fig. 3. The network is trained on the tox21 dataset [23].

Fig. 3
figure 3

Representation of a bi-directional GRU model for predicting one assay-based toxicity indicator. The figure represents the SMILES notation in its character form, though in the actual input each element is replaced by its embedded vector

The biological activity pharmacophore filtering model is a deterministic algorithm. First, RDKit fingerprints of a subset of polyene macrolide antifungal antibiotics are generated. Then, these are combined into a single fingerprint in such a way that only those features that exist in each of them are left in the resulting pharmacophore. This new fingerprint is treated as the minimum set of features required for antifungal activity. The algorithm to predict the probability of antifungal activity is as follows:

$${\text{p}}\, = \,\left( {\Sigma {2}0{48}_{{\text{i}}} \, = \,{\text{1 M}}_{{\text{i}}} *{\text{PH}}_{{\text{i}}} } \right)/\left( {\Sigma {2}0{48}_{{\text{i}}} \, = \,{\text{1 PH}}_{{\text{i}}} } \right),$$

where p—the predicted value, [0, 1], corresponding to the probability that the input molecule will show antifungal activity;

Mi—the i-th value of the 2048-bit vector, corresponding to the presence or absence of a structural element of the researched molecule;

PHi—the i-th value in the 2048-bit vector, corresponding to the presence or absence of a structural element of the pharmacophore;

In order to use this algorithm to classify researched molecules as having or lacking antifungal activity, a cutoff value is used. The selected cutoff value is 0.95, meaning that 95% of all structural elements of the pharmacophore must be present in any researched molecule for it to be marked as an antifungal antibiotic.

4 Interpretation and Discussion of Research Results

The acute toxicity model’s root means squared error was 58 mg/kg (LD50, oral intake, rats). This was acceptable given that the molar weight of antifungal antibiotics tends to be 600+ g/mol. The assay-based toxicity prediction utilizing the tox21 dataset results is presented in Table 1.

Table 1 Tox21 modeling results

The biological activity prediction pharmacophore filter was tested on a set of antifungal antibiotics as well as a set of drugs that are not antifungal antibiotics. With the selected cutoff point of 0.95, all of the antifungal antibiotics were correctly classified as such, and none of the non-antifungal drugs were classified as antifungal drugs.

5 Conclusion

We proposed an approach to designing a data analysis subsystem for a software system for predicting and researching the properties of antifungal antibiotics. These include gradient-boosted decision tree models, recurrent neural networks, and non-statistical algorithms. The software solution is configurable to various types of antifungal antibiotics, and its models can be trained on more antifungal antibiotic derivatives data to improve their accuracy. Testing was performed using sets of existing antifungal antibiotics as well as a number of recently synthesized novel antibiotics [14,15,16, 18, 19]. Testing supports the applicability of the system for predicting antifungal antibiotics’ properties.