Background & Summary

Nature’s vast chemical diversity1 has been a rich reservoir for various applications in personal care2, agriculture3, and health4. Methods to discover these valuable natural products have evolved from trial and error, to high-throughput screening5, and presently to the artificial intelligence revolution combined with modern bioinformatics and cheminformatics6,7. An enduring interest in the exploration and characterisation of natural products has yielded a diverse collection of valuable specialty chemicals exemplified by medicines8, herbicides9, and fragrances10. Natural products are typically synthesised through the concerted effort of multiple enzymes encoded by gene clusters in the microbial genome, also known as biosynthetic gene clusters (BGCs)11. Despite the vast chemical repertoire suggested by genomic information in observed BGCs, only a small subset can be obtained under laboratory conditions12, with the rest remaining unexpressed (silent) or with its predicted compound unobserved (cryptic)13.

To interrogate the untapped potential in these silent or cryptic BGCs, we have developed a multi-pronged activation strategy14 synergising integrase mediated genetic-based activation15 with the “one strain many compounds” (OSMAC)16 cultivation-based approach to significantly expand accessible metabolite space by approximately 2-fold. A series of 54 actinobacterial strains isolated from soil and marine environments in Singapore17 were integrated with 5 different regulators (Table 1) – cyclic AMP receptor protein (Crp)18, A-factor dependent protein A (AdpA)19, highly conserved Streptomyces antibiotic regulatory protein (SARP, RedD)20, fatty acyl CoA synthase (FAS)21, and a sporulation and antibiotics related gene A protein (SarA)22.

Table 1 List of global regulators examined, their descriptions, and accession codes.

The modifications yielded 459 mutants from 124 unique regulator-strain combinations. These native and engineered strains were then fermented in 3 to 5 media to yield a total of 2,138 fermentation extracts. High-throughput liquid chromatography-tandem mass spectrometry (LC-MS/MS) was employed to separate and characterize the complex chemical composition of these fermentation extracts, and the resulting data analyzed and organized into a curated dataset (Fig. 1).

Fig. 1
figure 1

Overview of LC-MS/MS dataset describing the metabolic profiling of the 54 actinobacterial strains and their 459 mutants. (a) The experimental workflow involves the following steps: we start from (1) an actinobacterial collection of 54 wild type strains, which are firstly (2) genetically activated by integrating overexpression cassettes of 5 global regulators (Crp, AdpA, RedD, FAS, SarA), (3) followed by OSMAC fermentation of these strains and their mutants in 3–5 different media to generate 2,138 samples. (4) Subsequently, sample preparation is carried out through lyophilization and solvent extraction, (5) followed by LC-MS/MS analysis of the fermentation extracts and (6) overall molecular networking data analysis. (b) Phylogenetic tree of 16S rRNA sequences of 50 strains14 and 4 model Streptomyces (Streptomyces venezuelae, Streptomyces griseus, Streptomyces coelicolor and Streptomyces albidoflavus).

Here, we report a curated LC-MS/MS dataset23 describing the metabolic profiling of the 54 actinobacterial strains and their 459 mutants, as well as molecular networking analyses and suggested data applications not found in the original manuscript14. By analyzing the tandem mass spectra of 2,138 fermentation extracts using molecular networking (Fig. 2), 743 distinct metabolites grouped into 69 clusters (each containing at least two metabolites), and an additional 126 orphan metabolites were identified. Detailed information on these annotated metabolites and clusters are reported here for the first time24. All natural product spectral libraries from GNPS were referenced for comprehensive coverage despite potential risk of false positive matches to natural products outside of the actinobacteria metabolic space. The LC-MS/MS spectral collection of 2,138 fermentation extracts has been deposited on the Global Natural Product Social Networking (GNPS)25 website and is available as a MassIVE dataset with accession number MSV00009223723.

Fig. 2
figure 2

Molecular networking analysis performed on tandem mass spectra of 2,138 fermentation extracts from 54 actinobacterial strains and their 459 mutants in 3 to 5 media14. 743 unique metabolites arranged in 69 clusters (with 2 or more metabolites) and 126 orphan metabolites are visualized with their connecting edges specifying a cosine score of more than 0.7. Red = metabolites present only in mutant strains. Blue = metabolites present only in wild type strains. Grey = metabolites present in both wild type and mutant strains.

Although originally designed to investigate the chemical potential of silent and cryptic BGCs, this substantive collection of metabolite profiles also provides the opportunity to interrogate a diverse pool of potentially novel natural products for starting points toward new therapeutics26, natural colors27, or other biomolecules with desirable functional activity.

Methods

Fermentation, extraction, and sample preparation

54 wild type strains (A1090, A1123, A11345, A1137, A1301, A1532, A1636, A2056, A2278, A2705, A2957, A30639, A33995, A34001, A34053, A40707, A40926, A4217, A44034, A5252, A53961, A58051, A5858, A61715, A6562, A80510, A8274, A8567, ATCC 23862, ATCC 31975, T10, T108, T118, T1195, T12, T1236, T1312, T1415, T1416, T1425, T1628, T168, T175, T265, T271, T298, T302, T343, T354, T36, T39, T467, T4680, T676) and their 459 edited mutants were received from the Agency for Science, Research and Technology (A*STAR)’s Natural Organism Library17. They were cultured on ISP2 plates [malt extract 10 g/L, Bacto yeast extract 4 g/L, glucose 4 g/L, Bacto agar 20 g/L] at 28 °C for 5 days. Three agar plugs of 5 mm diameter from the culture plate were then used to inoculate into 250 mL Erlenmeyer flasks each containing 50 mL SV2 seed media [glucose 15 g/L, glycerol 15 g/L, soya peptone 15 g/L, calcium carbonate 1 g/L, pH 7.0] and incubated for 4 days at 28 °C, with shaking at 200 rpm. A volume of 2.5 mL of the homogenized seed cultures were then inoculated into 250 mL Erlenmeyer flasks each containing 50 mL fermentation medium (Table 2). Marine actinomycetes strains were fermented in the same media with addition of 40 g/L sea salt, these media are annotated with the “M” prefix (i.e., MCA02LB instead of CA02LB). All cultures were fermented at 28 °C for 9 days shaking at 200 rpm with 50 mm throw. At the end of the incubation periods, cultures were freeze dried. A total of 2,138 fermentation samples were prepared. The lyophilized samples were extracted overnight (16 h) with methanol (14 mL) with shaking at 150 rpm. The extracted methanolic mixture was passed through cellulose filter paper (Whatman Grade 4, 1004-185) and the filtrate concentrated on a rotary evaporator, 0.1 mg of the dried methanol extract was then submitted for LC-MS/MS analysis.

Table 2 Media compositions employed for the fermentation of 54 wild type and 459 mutant strains.

Liquid chromatography-tandem mass spectrometry (LC-MS/MS) data acquisition

Fermentation extract samples were analysed on an Agilent 1290 Infinity LC System coupled to an Agilent 6540 accurate-mass quadrupole time-of-flight (QTOF) mass spectrometer. 5 µL of extract was injected onto a Waters Acquity UPLC BEH C18 column, 2.1 × 50 mm, 1.7 µm. Mobile phases were water (A) and acetonitrile (B), both with 0.1% formic acid. The analysis was performed at flow rate of 0.5 mL/min, under gradient elution of 2% B to 100% B in 8 min. LC-MS/MS data was acquired in positive electrospray ionization (ESI) mode MS1 was acquired between m/z 100–2500 at a scan rate of 3 spectra/sec while MS/MS was acquired between m/z 100–2000 at a scan rate of 4 spectra/sec. For MS/MS fragmentation, a ramped collision energy method was employed, whereby the collision energy was determined according to the following formula:

$${collision\; energy}\left({eV}\right)=\frac{({precursor\; mz}\times 5)}{100}+2.5$$

The typical QTOF operating parameters were as follows: sheath gas nitrogen, 12 L/min at 325 °C; drying gas nitrogen flow, 12 L/min at 350 °C; nebulizer pressure, 50 psi; nozzle voltage, 1.5 kV; capillary voltage, 4 kV. Lock masses in positive ion mode: purine ion at m/z 121.0509 and HP-0921 ion at m/z 922.0098.

Molecular networking

MSConvert v3.0.22198-0867718 from Proteowizard28 was used for initial processing of raw liquid chromatography-tandem mass spectrometry (LC-MS/MS) data into an open-source file format (.mzML). All tandem mass spectra (MS/MS) signals with intensity values below 1000 signal intensity were removed as background correction. Classical molecular networking was performed on resulting MS/MS spectra using the online workflow from the GNPS website (http://gnps.ucsd.edu). All peaks in a +/−17 Da around the precursor ion mass were deleted to remove residual precursor ions, and peaks not in the top 6 most intense peaks in a +/−50 Da window were filtered out. The precursor ion mass tolerance was set to 0.02 Da and the MS/MS fragment ion tolerance was set to 0.02 Da. Nearly identical MS/MS spectra with precursor ion m/z within the mass tolerance are combined into a single representative spectrum via the MS-Cluster algorithm29 and annotated as individual metabolites. Representative spectra created from a minimum number of 2 MS/MS spectra were considered for molecular networking. A network was then created where edges were filtered to have a cosine score above 0.7 and more than 6 matched peaks. Further, edges between two nodes were kept in the network if and only if each of the nodes appeared in each other’s respective top 10 most similar nodes. Finally, the maximum size of a molecular family was set to unlimited. The spectra in the network were then searched against GNPS’ spectral libraries. The library spectra were filtered in the same manner as the input data. All matches kept between network spectra and library spectra were required to have a score above 0.7 and at least 6 matched peaks.

Data Records

The dataset comprising of (1) unprocessed raw Agilent LC-MS/MS data (.d) as well as (2) converted open source file format (.mzML) copies of the 2,138 fermentation extracts from 54 actinobacterial strains and their 459 mutants in 3–5 media, has been deposited and is publicly accessible via MassIVE with the accession number MSV000092237 (https://doi.org/10.25345/C53X83W53)23. Detailed information on the 2,138 fermentation extracts analyzed, as well as the 743 individual metabolites and 69 clusters identified are available on figshare (https://doi.org/10.6084/m9.figshare.26144116)24.

Technical Validation

LC-MS/MS retention time consistency

A combination of 4 compounds (Table 3) was used as quality control for retention time stability between different samples run over the 17-month period of data acquisition. The quality control samples were analyzed using the same experimental methodology as for fermentation extract analysis, identified via their unique precursor m/z, and their retention times recorded. Low coefficient of variation (%CV ≤ 1.3) indicates stable elution times between sample runs.

Table 3 Performance of quality control sample consisting of four reference compounds collected over seventeen months.

Intra-study quality control samples

For each 96-well plate of samples analyzed via LC-MS/MS, a minimum of ten quality control samples, and ten methanol blanks were run to ensure consistency in retention time and background noise across the samples analyzed in this study. However, no additional intra-study quality controls such as pooled or representative fermentation extract samples were run, which is a limitation in experimental design.

Usage Notes

This dataset provides the opportunity to interrogate the chemical potential of a collection of 54 actinobacterial strains and their 459 activated mutants for novel natural products with desirable functional activity (e.g., anti-microbials, colorants). Some specific examples of such usage include 1) identification of known molecules with desired bioactivity such as valinomycin for antibiotic activity, then investigating networked metabolites or spectrally similar metabolites for novel antibiotic analogues, or 2) leveraging structural information captured in metabolite MS/MS data to perform spectral matching with known functional molecules to search for potentially novel natural products with similar structural characteristics that could demonstrate the desired functional activity. Additionally, the carefully curated mass spectral dataset presented here can also serve as a foundation for computational modelling applications, including artificial intelligence (AI) and machine learning. This study includes spectral data for various strains as well as their corresponding “activated” mutants. This dataset can be used to identify patterns in the production of different classes of molecules affected by genetic- and cultivation-based activation. This dataset also reveals the metabolic diversity in actinobacteria and the impact of genetic- and cultivation-based activation on metabolite production, this comparative data could facilitate bioinformatics studies aimed at metabolite annotation and pathway reconstruction. Additional metabolite characterisation such unsupervised substructure discovery (e.g. MS2LDA30), natural product classification (e.g. MolNetEnhancer31), and network annotation propagation (e.g. NAP32) may also be explored to provide richer insights.