Introduction

Biological nitrogen fixation (BNF), the enzyme-catalyzed reduction of dinitrogen (N2) from air to bioavailable ammonium, is identified a major source of nitrogen in N-starving marine ecosystems [1, 2]. Multiple studies have highlighted the significance of N2 fixation in sustaining marine primary production, powering the biological carbon pump, and eventually accelerating carbon sequestration to the ocean [3,4,5]. Based on in situ experiments, the fixed nitrogen by the diazotrophic communities was estimated to fuel up to half or more of primary production in the subtropical North Pacific and tropical North Atlantic regions [6, 7]. BNF is a process catalyzed by a nitrogenase enzyme complex in a selected group of microorganisms termed as diazotrophs. Nitrogenase is a highly conserved enzyme which consists of two multisubunit metallo-proteins, namely, dinitrogenase reductase and dinitrogenase encoded by nifH and nifKD [8]. Due to its highly conserved sequences, nifH gene has been widely used for phylogenetic and ecological studies of the diazotrophic communities in marine ecosystems [9].

The molecular techniques have revolutionized the traditional method of identifying N2 fixers via a microscope, and a wide range of diazotrophs have been separated and identified from marine environments [10]. Ever since, related studies of diazotrophs on biogeochemical, physio-ecological, and molecular aspects have been thriving and prosperous. Among all the diazotrophic communities, Proteobacteria and cyanobacteria are identified as the most abundant groups inhabiting euphotic layer of marine ecosystems [11]. Cyanobacteria have a broad geographic distribution and contain a variety of morphologic and phylogenetic groups including Trichodesmium spp., diatom–diazotroph symbioses, and unicellular N2 fixing cyanobacteria (groups A, B, and C) [12]. Previous studies have been mainly focused on cyanobacterial diazotrophs which are widely distributed in tropical and subtropical oceans. More recently, Proteobacteria, including Alpha-, Beta-, Gamma-, and Delta-proteobacteria, were reported dominant in open oceans such as the northern South China Sea, the Southern Indian Ocean, and the Eastern Tropical South Pacific region [13,14,15]. The dominance of Proteobacteria in open oceans reveals that our understanding of the diazotrophic community is still limited. Further studies focusing on physiological and ecological characterizations of these widely distributed bacteria and their interactions with environmental gradients have important implications for understanding their contribution to nitrogen fixation and global nitrogen cycling [16].

As the third largest ocean in the world, the Indian Ocean plays a significant role in global climatic change, energy flow, and material cycles [17]. The Indian Ocean is divided into two semi-enclosed basins in the north by Indian subcontinent and extends to at least 26°N in the south. North Indian Ocean is heavily influenced by the South Asian monsoon system. The surface circulation changes seasonally in response to the prevalent monsoons, with Somali Current flowing equatorward during the winter monsoon and poleward during summer monsoon [18, 19]. Upwelling and convective mixing induced by seasonally reversing monsoon winds can increase upward supply of nutrients and cause algal blooms seasonally in the northern Indian Ocean [20, 21]. It is generally believed that the Bay of Bengal (BOB) is less productive than the Arabian Sea (AS), partly due to the freshwater input which decrease the salinity on the surface and aggravate stratification [20, 22,23,24]. Winds over the northern Indian Ocean are generally weak during the period of pre-southwest monsoon, and stratified water results in low supplement of the nutrient from deeper water. Strong stratification, calm water, sufficient solar heating, and extreme oligotrophic condition make the northern Indian Ocean an ideal habitat for the diazotroph community during the pre-southwest monsoon period. In fact, surface blooms of Trichodesmium spp. are well documented in both the AS and the BOB [25,26,27], and heterotrophic bacteria have been reported dominant among diazotrophs in the AS during the winter monsoon [14, 28]. However, no detailed studies have been conducted in investigating composition and distribution of the diazotrophic communities at the BOB and the eastern Equator region.

Compared to the North Pacific [29,30,31] and North Atlantic [32,33,34,35], the diazotrophic community in Indian Ocean is understudied, and currently available data are primarily focused on the AS [14, 28]. In the present study, we collected water samples in the BOB and the open region of eastern equatorial Indian Ocean during the pre-southwest monsoon of 2017. In order to investigate composition and spatial distribution of the diazotrophic communities in the EIO, molecular approaches including high throughput sequencing analysis and real-time fluorescent quantitative polymerase chain reaction (qPCR) assay were applied to samples collected from different locations and depths. The high throughput sequencing analysis provided detailed information of the diazotrophic community structures including dominant and rare species, as well as uncultured populations. Based on the results of high throughput sequencing, qPCR assay was used to quantify the dominant diazotrophic groups in the EIO. In addition, multivariate statistics were conducted between communities and environmental parameters to determine the controlling factors in structuring diazotrophs.

Materials and Methods

Station Location, Sampling, and Physicochemical Analysis

The cruise was carried out in the EIO onboard R/V “Shiyan 3” from 9 March to 7 April 2017. Samples for molecular analyses were collected from 10 sampling sites at seven depths (0 m, 25 m, 50 m, 75 m, 100 m, 150 m, and 200 m) for each site. The sampling sites were located along four main transects targeting the BOB (I1, I2), the Equator region (I4), and the transect (I5) parallel to the coastline of the Sumatra (Fig. 1). Among all the stations, we chose six stations (I105, I504, I203, I210, I405, and I413) with three depths (0 m, 75 m, and 200 m) for high throughput sequencing (Fig. 1). Whereas for qPCR analysis, all samples were included (10 stations and 7 depths as shown in Fig. 1 and Table 1). Surface water was collected by using 10% HCl-rinsed polyethylene (PE) bucket at all the stations. Vertical seawater was collected by using 5 L Teflon-coated Go-Flo bottles (General Oceanics, Miami, Florida, USA) attached to a rosette multisampler, on which conductivity, temperature, and depth (CTD) probes were installed (Seabird SBE 911Plus, Sea-Bird Electronics, Washington, USA). Temperature and salinity were measured and recorded vertically by a CTD profiler.

Fig. 1
figure 1

The map showing the sampling stations in the EIO. Miniature map at lower left described the transects and stations for hydrography measurement (blue dots in red box). In enlarged map of the EIO, 10 sites (red dots) were used for qPCR analysis and 6 sites (circled) were subjected to high throughput sequencing

Table 1 Temperature (T), salinity (S), chlorophyll a (Chl a) concentration, and dissolved inorganic nutrients (ammonium, nitrate, nitrite, phosphate) in surface water at sampling stations

Sample Collection and Environmental Parameter Measurements

Water samples from different layers were transferred individually into 15 L PE buckets that were previously washed with 10% HCl and rinsed thrice with Milli-Q water. Samples for determination of nutrients were subsampled into 100 mL of 10% HCl-rinsed PE bottles and stored in 4 °C until analysis. In the laboratory, nutrient concentrations including ammonium (NH4+), nitrate (NO3), nitrite (NO2), phosphate (PO43−), and silicate (SiO32−) were determined by using a Technicon AA3 Auto-Analyzer (Bran+ Luebbe, Norderstedt, Germany) based on the classical colorimetric methods [36]: NH4+, dissolved inorganic nitrogen (DIN), PO43−, and SiO32− are measured by indophenol blue method, the copper-cadmium column reduction methods, phosphor-molybdate complex methods, and silico-molybdate complex methods, respectively.

For Chl a analysis, 500 mL water from each layer were vacuum-filtered (< 10 mmHg) through a 25-mm GF/F filter (Waterman, Florham Park, NJ, USA). The filters were placed into aluminum foil bags and stored in the dark at − 20 °C until analyzed. In the lab, filters were kept in 20 mL vials and pigments were extracted with 90% acetone (Guanghui, Yixing, Jiangsu, China) for 24 h at 4 °C. Chlorophyll concentrations were then determined using a Turner® Trilogy (CHL NA, model no. 046) fluorometer (Turner designs, San Jose, CA, USA). For molecular analyses, 2–4 L of seawater were filtered through 0.22-μm GTTP filters (Millipore, Eschborn, Germany) under low pressure vacuum. The filters were placed into 2-mL microtubes, and flash frozen immediately in liquid nitrogen on board. The filters were transferred to − 80 °C freezer in the lab until DNA extraction.

DNA Extraction, nifH Gene Amplification, and Sequencing

Genomic DNA were extracted by CTAB method as previously described with minor modifications [37, 38]. Briefly, filters were cut into small pieces and placed into 1.5-mL sterile centrifuge tubes. For each extraction, 600-μL CTAB extraction buffer and 15 μL 2% β-mercaptoethanol (Goaobio, Beijing, China) were added and incubated at 60 °C in a water bath for 60 min. Proteins, polysaccharides, and other impurities were separated by adding equal volume of 24:1 chloroform:isoamyl alcohol (Solarbio, Beijing, China), and then centrifuged at 14000 rpm for 20 min at 4 °C. Supernatant was transferred into a new clean tube, and this step was repeated once to insure complete removal of impurities. The genomic DNA were precipitated overnight at − 20 °C by addition of 2/3 volume of isopropanol (Sangon Biotech, Shanghai, China) followed by washing with 75% ethanol (Hushi, Shanghai, China) twice, then air-dried genomic DNA pellet was eluted with 70 μL sterile ultra-pure water and stored at − 20 °C until further processing. Note that the solutions used in our experiment were all marked with molecular grade. The quantity and quality of genomic DNA were checked by using a ND-2000 Nanodrop spectrometer (Thermo Fisher Scientific, Wilmington, Delaware, USA).

The nifH gene fragments were amplified from the genomic DNA through nested polymerase chain reaction (nested PCR) according to a previous protocol [10]. PCRs were performed using a Veriti 9902 thermocycler (Applied Biosystems, Foster City, CA, USA) with 10 μL reaction volumes containing 1× PCR buffer, 4 mM MgCl2, 400 mM dNTPs, 1 μM forward, and reverse primers (nifH3 and nifH4 for primary, nifH1 and nifH2 for secondary PCR), 0.2-unit KOD FX Neo polymerase (Toyobo, Osaka, Japan), and 1 μL of template DNA (genomic DNA for the first round, and PCR products from the primary PCR for the second round). Negative controls were set up by replacing template DNA with nuclease-free water. The thermal profile used for the nifH gene amplification was initial denaturation (95 °C, 5 min), followed by 38 (primary) or 40 (secondary) cycles of denaturation (94 °C, 1 min), annealing (52 °C for the first round and 59 °C for the second round, 1 min), and extension (72 °C, 1 min) with a final extension (72 °C, 7 min). Remarkably, primers used in secondary PCR were composed of dual-indexed barcodes, Illumina linkers, a sequencing primer binding region, and gene-specific sites. PCR products were checked by 1.8% agarose gel (BioWest, Castropol, Spain) electrophoresis after amplification, and products with approximately 360 bp bands were used for high throughput sequencing. All libraries were constructed and sequenced via paired-end chemistry (PE250) on an Illumina Hiseq2500 platform (Illumina, San Diego, CA, USA) at Biomarker Technologies, Beijing, China.

Quality Control and Sequencing Date Processing

The raw image data file obtained from Illumina Hiseq2500 platform were transformed into sequenced reads via base calling, and the results were stored in the format of FASTQ file which include the information of raw sequence data and corresponding sequencing quality. The raw sequence data were then separated by samples based on their barcodes, permitting up to one mismatch [39]. The raw sequence data were de-multiplexed, quality filtered, and analyzed through the open-source software pipeline QIIME [40], and the paired reads were merged into full-length sequences by FLASH v1.2.7 software [41]. For each sample, the raw tags were quality filtered to get high-quality clean tags via Trimmomatic v0.33 software [42], including removal of sequences less than 300 bases, homopolymers containing sequences (homopolymers ≥ 8 bases), and ambiguous base containing sequences [43, 44]. The barcodes, linker sequences, and primers were also removed in this process. Effective tags were subsequently obtained after removing chimera by UCHIME v4.2 software [45]. The remaining effective tags were clustered using USEARCH v10.0 at 97% similarity to generate operational taxonomic units (OTUs) [46]. Low abundance OTUs which containing less than 20 sequences across all samples were excluded from further processing.

To identify dominant taxa, top OTUs in each sample were selected for subsequent phylogenetic analysis. The most common sequences in each OTU were selected as the representative sequences. These sequences were first translated into amino acid sequences and blasted in the protein database at National Center for Biotechnology Information (NCBI) via BLASTX v2.8.1+ to identify the most closely related sequences [47] The representative sequences and the most closely related sequences were then aligned with ClustalW and used to construct a phylogenetic neighbor-joining tree by MEGA v7.0 [48, 49]. Cluster stability was tested by bootstrap resampling for 1000 times, and the phylogenetic tree was further edited by an online webpage iTOL [50]. The raw sequences obtained from this study were deposited in NCBI Sequence Read Archive with accession no. SUB3781024.

Quantification of Two Cyanobacterial and Two Non-Cyanobacterial nifH Phylotypes

According to the results of high throughput sequencing, abundance of four top diazotrophs, namely, Trichodesmium spp., Crocosphaera watsonii, Sagittula castanea, and γ-24774A11, were quantified by qPCR using an ABI Step One Plus Real-Time PCR System (Applied Biosystems, Foster City, CA, USA). The specific primers and probes shown in Table 2 were designed by previous studies with minor modifications to correct mismatch [33, 51]. The specificity of the primers and probes were all checked in the abovementioned studies, and no significant cross-reactivity was observed. The TaqMan probes were 5′-labeled with the fluorescent reporter FAM (6-carboxyfluorescein) and 3′-labeled with the quenching dye TAMRA (6-carboxytetramethylrhodamine). All primers and probes used in the present study were synthesized in Sangon Biotech, Shanghai, China. For all TaqMan PCRs, duplicate or triplicate 10 μL reactions were performed with 5 μL of 2 × Premix Ex Taq™ (Takara Bio, Tokyo, Japan), 0.2 μM of the forward and reverse primers, 0.4 μM of TaqMan probe, 0.2 μL of 50 × ROX reference dye, 1 μL of template DNA, and 3 μL of nuclease-free water. The PCR conditions were 95 °C for 30 s, followed by 45 cycles of 95 °C for 5 s and 60 °C for 30 s. Standard curves were determined by analyzing 10-fold serial dilutions of the target nifH gene inserted plasmids with a final gene copy numbers ranged from 101 to 107 for each reaction. The r2 values of each standard curve ranged between 0.98 to 1.00, and PCR amplification efficiency ranged from 90 to 110%. The copy numbers of each diazotroph in the environmental samples were calculated based on mean Ct values. Furthermore, non-target templates were also tested in the same conditions, and gene copies in negative controls less than 10 or undetectable were considered contamination-free. The detection limit of qPCR reactions was approximately 50 nifH genes in 1-μL template DNA, which was equivalent to approximately 3500 copies with a final volume of 70 μL in the present study.

Table 2 Primers and probes used for TaqMan–qPCR analysis targeting two cyanobacterial and two non-cyanobacterial nifH phylotypes

Statistical Analysis

Richness and diversity indices including Chao1 richness estimator, Ace richness estimator, Shannon diversity indices, and Simpson diversity index were calculated with R v 3.3.2 software [52]. To evaluate the coverage of sequencing, the abundance data was standardized and calculated based on entropy (Q statistics) in the online software iNEXT [53, 54]. Rarefaction curves were also done in this online software based on the standard operating procedure shown in the website (https://chao.shinyapps.io/iNEXTOnline/). Non-metric multidimensional scaling (NMDS) analysis and cluster analysis were used in the present study to demonstrate vertical and horizontal distribution patterns of the diazotrophic communities in PRIMER V6.0 software [55]. The diazotroph community data was first square-root transformed in the software, and then, a lower triangular resemblance matrix was created based on Bray–Curtis similarity. Subsequently, a hierarchical cluster tree and a NMDS biplot were constructed in the software based on the matrix. To reveal the correlations between the diazotrophic communities and environmental factors, detrended correspondence analysis (DCA) was first carried out by decorana function in vegan to determine whether redundancy analysis (RDA) or canonical correspondence analysis (CCA) was more suitable [56, 57]. Because the length of gradient value of the first axis was greater than 3.0, CCA was selected to explore correlations between community structure and environmental gradients and run in R v 3.3.2 software. Highly correlated environmental factors were removed by co-linearity test conducted with vif.cca function in vegan and variance inflation factors (VIFs) of remaining environmental factors were all less than 20 [58]. Depth-integrated (0–200 m) gene abundances were calculated by trapezoidal integration over depths of the euphotic zone, and the results were plotted by ODV V5.0.0 [59]. The significant differences of depth-integrated gene abundances (log-transformed) for diazotrophs among different regions were evaluated by t test in IBM SPSS Statistics 25.

Results

Hydrography and Environmental Parameters

The surface hydrographic characteristics for all stations were shown in Table 1 and Table S1. During the study period, sea surface temperature (SST) varied from 28.5 to 31.1°C, and BOB had higher SST than other regions (p < 0.01). Sea surface salinity (SSS) ranged from 30.4 to 34.3, and BOB had higher SSS than other regions (p < 0.05). The maximum and minimum value of SSS was 34.3 ppt at Sta. I404 and 30.4 ppt at Sta. I201, respectively. The surface regime could be distinguished by T-S properties of the upper 200 m at sampling stations where lower salinity and slightly higher temperature were observed at the BOB stations (Fig. 2). The potential density (σ0) was less than 21.0 kg m−3 at near surface water at Sta. I110, I203, I206, and I210, followed by nearly 21.0 kg m−3 at other stations except for some unstable data. Chl a concentrations ranged from 0.067 to 0.407 μg L−1, and there was no significant difference of surface Chl a concentrations in the area study. Across all the sampling stations, concentrations for dissolved inorganic nutrients were quite low in the study area, suggesting the EIO was a typical oligotrophic ocean. For instance, nitrate, nitrite, ammonium, and phosphate concentration ranged from 0 to 2.043 μM, 0.129 to 0.2 57 μM, 0 to 1.379 μM, and 0 to 0.155 μM, respectively (Table S1).

Fig. 2
figure 2

Potential temperature (θ) versus salinity scatters (T-S properties) in the upper 200 m water column in the EIO

Vertical profiles of temperature, salinity, dissolved inorganic nutrients, and Chl a are shown in Fig. 3 and Fig. S1. The study region was characterized by stable surface waters with higher temperature and lower salinity than deep waters. Obviously, the thermocline and halocline in the Equator region was shallower than other areas and increased gradually from the open oceans to the offshore region. The halocline was clearly observed at the BOB and the Equator region, but not obvious in the offshore section. Vertical profiles of nitrate and phosphate in survey stations showed consistent patterns with lower concentrations in the top 50 m and increased dramatically in deep waters. Nitrite concentrations peaked at 75 m but remained nearly the same in other layers. The Chl a maximum layer was at 75 m except for Stn. I105 (50 m instead), and integrated Chl a concentrations for top 200 m water column ranged from 27.83 μg L−1 (Stn. I409) to 40.70 μg L−1 (Stn. I105). Compared to other nutrients, no consistent patterns were observed for vertical distribution of ammonium.

Fig. 3
figure 3

Depth profiles of temperature and salinity in the BOB (I1, I2), the Equator region (I4), and the transect (I5) parallel to the coastline of the Sumatra

Sequencing Statistics and Diversity Estimates

In total, 1,185,461 effective tags were included in our study after quality control, and the detailed information for each sample was listed in Table 3. The sequencing coverage (C) was all greater than 99.9%, suggesting the sequencing effort was deep enough to cover nifH gene diversity. On average, 65,859 sequences per sample were obtained with an average length at 320 bp. Based on 97% similarity, a total of 218 OTUs were obtained after excluding rare OTUs (< 20 sequences across all samples). The rarefaction curve plateau are shown in Fig. S2. The OTU richness ranged from 17 to 72, and the minimum and maximum OTUs were observed at Stn. I405 (200 m) and Stn. I210 (75 m), respectively. Combined all alpha diversity indices, the lowest and highest diversities occurred at the same station Stn. I203 but at different layers, 0 m and 75 m, respectively. Overall, Stn. I210 had the highest diversity indices (H = 2.39, D = 0.86) while the lowest indices occurred in Stn. I405 (H = 1.28, D = 0.63).

Table 3 Diversity and predicted richness of nifH genes recovered from the EIO by Illumina Hiseq2500 platform

Phylogeny and Composition of Diazotrophs

Due to large numbers of OTUs recovered and most of them were unidentifiable, we only focused on the top OTUs in the present study. The OTUs contained ≥ 500 sequences across all the samples were defined as top OTUs in our study. Top 30 OTUs accounted for more than 97% of all the sequences, and they were included in subsequent phylogenetic analysis (Fig. 4). Reference sequences from NCBI were all retrieved from marine environments, and the unknown species were labeled with ocean names where they originated. Top 30 OTUs were grouped into three defined clusters of nifH genes [2], in which 28 OTUs belonged to cluster I, 1 OTU (Phasecolarcto bacterium) belonged to cluster II, and 1 OTU (Verrucomicrobiae bacterium) belonged to cluster III (Fig. 4). Nearly half of the OTUs were closely related to Alphaproteobacteria (14/30 OTUs), followed by Betaproteobacteria (5/30 OTUs), Gammaproteobacteria (5/30 OTUs), and Cyanobacteria (4/30 OTUs).

Fig. 4
figure 4

A neighbor-joining phylogenetic tree (left) and a bubble map to show OTU relative abundance (right) at each site. The tree was constructed by nifH gene amino acid sequences obtained from this study and reference sequences from GenBank. The topology of the tree was inferred from 1000 bootstrap resampling, and bootstrap values greater than 50% were shown with red labels at branches. The bubble sizes corresponded to relative abundance of each OTU

Relative abundance of diazotrophs is shown in Figs. 4 and 5. Overall, Proteobacteria was clearly the most dominant group in the EIO, followed by Cyanobacteria. Within all the diazotrophic community, OTU 550 shared 100% similarity with a recently isolated Sagittula (Rhodobacteraceae, Alphaproteobacteria) [60] and dominated diazotrophs across all samples. OTU 550 was detected in all the samples but was more abundant in the Equator region (Stn. I405 and I413) and deep layers (Fig. 4). Besides OTU 550, other Alphaproteobacteria such as OTU 115 (Novosphingobium malaysiense), OTU 881, and OTU 537 (Bradyrhizobium sp.) were also commonly detected. OTU 702 was the second most dominant diazotroph which shared 100% similarity with an uncultured Betaproteobacterium, and its occurrence was noted in deep waters. OTU 536, closely related to Gammaproteobacterium γ-24774A11, was also recovered in the present study though in low abundance. Apart from γ-24774A11, other Gammaproteobacteria including OTU 1136, OTU 1128 (Vibrio diazotrophicus), OTU 659, and OTU 388 (Pseudomonas stutzeri) were also detected.

Fig. 5
figure 5

Depth profiles of nifH gene abundances (log10 copies L−1) for Sagittula castanea (alpha-HQ586648), γ-24774A11, Trichodesmium, and UCYN-B. For convenience, one gene copy was used to represent where nifH genes were under detection

In addition, four OTUs were clustered with cyanobacteria, including OTU 229 (Trichodesmium spp.), OTU 325 (Crocosphaera watsonii), OTU 888 (UCYN-A3) and OTU 301 (Cyanothece sp. WH 8904). Among which, Trichodesmium spp. was the most abundant cyanobacteria, while the other cyanobacteria exhibited low abundances (Fig. 4). Trichodesmium sequences were recovered from all surface waters except for Stn. I405. Compared to the Equator area, Trichodesmium was more abundant at the BOB, especially at Stn. I203 where Trichodesmium blooms might have occurred. Meanwhile, Crocosphaera watsonii, UCYN-A3, and Cyanothece sp. WH 8904 were mainly detected in surface waters at Stn. I210 and I413.

Quantification of Dominant nifH Phylotypes

Abundances for four major nifH phylotypes (Sagittula castanea, γ-24774A11, Trichodesmium spp. and UCYN-B) were quantified by qPCR. The Ct values of non-target template were all undiscovered or in high values for Sagittula castanea (not detected), γ-24774A11 (Ct = 38), Trichodesmium (not detected), and UCYN-B (Ct = 35.4). The sensitivity and accuracy of standard curves for different targets were shown in Table 4.

Table 4 Sensitivity and accuracy of the standard curves determined by primer pairs and probes for different targets

Depth profiles of nifH genes for the four dominant phylotypes are shown in Fig. 5, and the depth-integrated abundances (up to 200 m) were shown in Fig. 6. The highest abundances were usually detected in the upper layer from 0 to 50 m. Sagittula castanea was detected at all stations with all depths, and peaked at 25 m across all sampling sites (Fig. 5). Combined with depth-integrated gene abundances, Sagittula castanea was more abundant in the Equator region and the offshore than the BOB (p < 0.05) (Fig. 6). Trichodesmium, UCYN-B, and γ-24774A11 were generally concentrated in the upper water layers. For instance, nifH genes for Trichodesmium spp. reached up to 3 × 108 copies L−1 in the surface water at Stn. I203 (Fig. 5). Trichodesmium and UCYN-B showed similar patterns regarding depth-integrated gene abundances, with higher gene copy numbers in the BOB and lower in the Equator region (p < 0.05) (Fig. 6). For γ-24774A11, no significant differences were observed at different regions. In summary, our results demonstrated contrasting spatial patterns for Trichodesmium (higher at the BOB) and Sagittula castanea (higher in the Equator region and the offshore) in the EIO (Fig. 6).

Fig. 6
figure 6

Depth-integrated (0–200 m) gene abundances (× 106 copies m−2) for Sagittula castanea, γ-24774A11, Trichodesmium, and UCYN-B in the EIO. Bubble sizes corresponded to gene abundances at each sampling station

Statistical Analysis

Spatial distribution patterns (both horizontal and vertical) of the diazotrophic communities in the EIO are presented in Fig. 7. Horizontally, the surface samples from the same transect were almost grouped together (Fig. 7a), showing similar community structures in the same area. The surface samples in the BOB were mainly composed of Cyanophyceae, while samples in the Equator region and the offshore were mainly composed of Alphaproteobacteria (Fig. 7b). Vertically, we observed that the surface samples and the deeper samples were separated at 30% similarity (Fig. 7a). It indicated that the surface samples were distinct from other deeper samples, and the composition of the diazotrophic communities varied along the vertical gradient. From Fig. 7a, samples at 75 m and 200 m were almost grouped together except for samples from Sta. I405. The two samples at Sta.I405 were mainly composed of Alphaproteobacteria. Whereas for other deeper samples, Betaproteobacteria was also an important component (Fig. 7b).

Fig. 7
figure 7

Non-metric multidimensional scaling (NMDS) analysis of diazotrophic communities (stress = 0.14, similarity = 30%) from the EIO (a) and corresponding community structures at class level (b). Different classes and samples from different depths were color coded

Correlations of the diazotrophic community and associated environmental factors were analyzed by CCA (Fig. 8). Temperature, salinity, phosphate, and ammonia were included in CCA after excluding environmental factors with VIFs > 20. The environmental factors in the first two axis explained > 83.63% of the total variance in the diazotrophic community distributions. Temperature (p = 0.001), salinity (p = 0.001), and phosphate (p = 0.001) contributed significantly to the total variance and were closely associated with the first and second axes (999 times Monte Carlo permutations). The sample distributions in CCA were similar to NMDS which also presented a vertical separation. Temperature positively correlated with the diazotrophic distributions in surface, while salinity and phosphate were driving distribution of the diazotrophic communities in deep water samples. Cyanobacteria were all positively correlated with temperature, while most of Proteobacteria were positively related to salinity and phosphate. Interestingly, we observed γ-24774A11 was plotted together with cyanobacteria, suggesting γ-24774A11 possibly shared same ecological niches with cyanobacteria.

Fig. 8
figure 8

Canonical correspondence analysis (CCA) based on diazotrophic composition and biotic/abiotic parameters as explanatory variables. The two CCA axes (CCA1 and CCA2) explained 83.63% of total variations in abundance data. Arrows represented environmental parameters. Different colors of circle and triangle symbols represented different samples and taxa, respectively. Only the top 30 OTUs were included. Significance (**) was determined by 999 Monte Carlo permutation tests with R V.3.0 software

The correlations between nifH gene abundances and various environmental parameters are listed in Table 5. Among which, γ-24774A11, Trichodesmium, and UCYN-B related nifH gene abundance were all exhibiting similar trends to the environmental parameters: significant negative correlations with water depth (p < 0.01), salinity (p < 0.01) and nutrients (p < 0.01), and a positive correlation with temperature (p < 0.01). In contrast, Sagittula castanea was homogeneously distributed in the water column and therefore exhibited no correlation with depth and temperature, but showed significant negative correlations with nitrate and phosphate (p < 0.05) (Table 5).

Table 5 Correlation analysis of nifH gene abundances (log10 copies L−1) (Trichodesmium spp., UCYN-B, Sagittula castanea, and γ-24774A11) and environmental parameters

Discussion

Recent studies have shown that putative N2-fixing phylotypes of heterotrophic diazotrophs appeared to be ubiquitous in diverse marine and estuarine environments, even in deep, cold, or high-latitude waters where cyanobacteria were low or absent [61,62,63,64]. Due to their widespread distribution, heterotrophic diazotrophs are potentially important N2 fixers, and global N inputs solely based on cyanobacterial diazotrophs is likely underestimated. Similar to previous studies, our data elaborate that Proteobacteria are the most abundant diazotrophs in the EIO, especially in the equatorial region and deep waters. The prevalence of diverse heterotrophic N2-fixing bacteria in oceans has been attributed to the overproduction of phytoplankton and accumulation of DOC in previous study [65]. During our cruise, we observed an unusually high phytoplankton primary productivity (Liu HJ et al. unpublished data) in the equatorial region which could be influenced by Wyrtki Jets [66]. The periodic Wyrtki Jets bring high-salinity and high-nutrient waters from 60° E to east along the equator [67]. Rahav et al. explained that addition of polysaccharide may increase efficiency of ectoenzymes such as β-glucosidase, and generated bioavailable molecules with enhanced polysaccharide hydrolysis [68]. Further, numerous studies have demonstrated that heterotrophic diazotrophs were stimulated by addition of nutrients or dissolved organic matter (DOM) [69, 70]. However, distribution of heterotrophic diazotrophs and their environmental drivers in marine ecosystems are still poorly understood. Further studies on biogeochemical, physio-ecological, and molecular aspects are needed to evaluate roles that heterotrophic diazotrophs play in nutrient and carbon cycling in global oceans.

Alphaproteobacterial genes were contributing the biggest fraction to the nifH gene diversity in our study, although previous studies have revealed that Gammaproteobacteria dominated global oceans [11, 16]. The dominance of Alphaproteobacteria and Betaproteobacteria in the diazotrophic communities was also reported in the Arabian Sea and the Western Equatorial region during northeast monsoon [14]. In the present study, OTU550 was the most abundant Alphaproteobacteria, as well as the dominant diazotroph in the EIO. The representative sequence of OTU550 shared 100% similarity with a new species named Sagittula castanea which was isolated from the oxygen minimum zone off Peru in the eastern tropical South Pacific (ETSP) [60]. Sagittula castanea has been confirmed capable of fixing N2 in the laboratory experiments, while it is still uncertain if this species is actively fixing N2 in natural environments [60]. In fact, Sagittula-related nifH sequences have been reported dominant in both shelf and open oceans of the ETSP, surface waters in the Indian Ocean, and the deep waters at the South China Sea, where N2 fixation was measurable [14, 51]. Presumably, non-cyanobacterial diazotrophs were responsible for N2 fixation in these regions. Sagittula castanea was recognized as a particle-attached lifestyle based on metabolic pathway analysis [60]. High C:N ratio contained in the particles provides sufficient organic carbon to heterotrophic diazotrophs. Due to Sagittula castanea was universally distributed in the euphotic zone (Figs. 4 and 5), this species exhibited no correlations with water depth, temperature, or salinity (Table 5). However, the negative correlations between Sagittula castanea and nitrate/phosphate revealed that the growth of Sagittula castanea might be limited by other factors rather than nutrients (Table 5).

In this study, a typical Gammaproteobacteria species with a broad geographic distribution, γ-24774A11, was also retrieved. Our qPCR results showed that the depth-integrated gene abundance of γ-24774A11 was two or three orders of magnitude lower than that of Sagittula castanea and Trichodesmium spp. (Fig. 6). Though γ-24774A11 was not dominant diazotroph in our study, its abundance was still comparable to other studies in global oceans [14, 16]. γ-24774A11 always occurred in upper layers of the water column and became undetected below approximately 100 m (Fig. 5). The correlation analysis showed that γ-24774A11 abundances were controlled by depth and dissolved inorganic nitrogen (Table 5), which agreed well with a previous study conducted in the South Pacific Ocean [15]. Interestingly, we observed that the distribution of depth-integrated gene abundances of γ-24774A11 was similar to Trichodesmium and UCYN-B, but in contrast to Sagittula castanea. Our CCA also demonstrated the concurrent ecological niches between γ-24774A11 and cyanobacteria. As stated above, availability of DOC can be a limiting factor for the growth of heterotrophic N2-fixing bacteria, and therefore, a positive correlation was expected between phytoplankton production and the abundance of γ-24774A11. However, in this study, γ-24774A11 had a negative correlation with Chl a. The same observation was also reported by Moisander et al. (2014), who suggested that environmental conditions in favor of Trichodesmium and Crocosphaera watsonii would benefit growth of γ-24774A11 [15].

Higher abundance of Trichodesmium spp. were found at BOB than the Equator region and the transect parallel to the coastline of the Sumatra, although hydrological conditions seemed favorable for Trichodesmium in the whole study area. Especially at Sta. I203, Trichodesmium nifH genes was up to 1.5 × 108nifH gene copies L−1 at the surface water according to the result of qPCR (Fig. 5). It was hard to explain the difference of Trichodesmium distribution in such similar hydrological conditions across the study area. Many studies have revealed that iron (Fe) is an important cofactor for nitrogenase, and it plays a crucial role in synthesis and expression of nitrogenase in diazotrophs [14, 30]. According to literature, dissolved Fe (dFe) in surface waters ranged from 0.1–0.62 nM in the north-eastern Indian Ocean, and the concentration decreased dramatically from the BOB to the Equator region [71, 72]. It was hypothesized that Fe availability might limit growth of Trichodesmium in the Equator regions, while Fe was not a limiting factor in the BOB. The similar result was also reported by Shiozaki et al. (2014), who observed high Fe concentration coupled with high abundance of Trichodesmium in the AS, while severely limited Fe concentration coupled with undetectable Trichodesmium in the Equator region [14]. Another possible explanation was mesoscale eddy circulations frequently occurred in the BOB during pre-southwest monsoon. Anti-cyclonic (warm-core) eddies are a common phenomenon in the Indian Ocean during pre-southwest monsoon, and they could influence the distribution of phytoplankton via regulating vertical environmental parameters [73]. Jyothibabu reported that most Trichodesmium blooms recorded in the Indian Ocean were during the pre-southwest monsoon, and warm-core eddies caused downwelling of the surface waters and provided optimal growing conditions for Trichodesmium [74]. Furthermore, the specific gas vesicles in Trichodesmium could provide buoyancy and help them migrate vertically in water columns to obtain phosphate and other nutrients from deep waters, and help them float on the sea surface forming patches, bands, or mats depending upon the status of the sea [75].

Except for Trichodesmium, other cyanobacteria such as UCYN-A3, Crocosphaera watsonii, and Cyanothece sp. WH 8904 were also retrieved in the present study via high throughput sequencing. UCYN-A were reported widely distributed in tropical and subtropical oceans and made significant contribution to BNF in these regions [76, 77]. According to nifH phylogeny, the UCYN-A lineage was divided into at least four main sublineages, namely, UCYN-A1, UCYN-A2, UCYN-A3, and UCYN-A4 [78]. However, in the present study, only UCYN-A3 was recovered via high throughput sequencing. Turk-Kubo et al. (2016) reported that the UCYN-A3 sublineage is more widely distributed in oligotrophic waters than other three sublineages [79]. It has been reported that optimum temperature for UCYN-A lineage was 24 °C, but water temperature in the present study ranged from 29.3 to 31.1°C, which was not a favorable condition in general [31]. Therefore, water temperature and nutrient levels likely explained the low occurrence of UCYN-A in our study. In addition, Crocosphaera watsonii was also presented low abundance in our study. Fu et al. (2014) reported that thermal limits for Crocosphaera watsonii ranged from 24 to 32 °C, and the optimum growth temperatures was 30 °C [80]. Our correlation analysis also showed gene abundance of Crocosphaera watsonii was positively related to water temperature and negatively related to salinity and phosphate. Though the temperature in the EIO seemed favorable for the growth of Crocosphaera watsonii, it was still presented in low abundance. Shiozaki et al. [14] suggested that the shallower nitracline depths, which could result in higher upward fluxes of nutrients to the surface water, were responsible for the low abundance of UCYN-B in the EIO [52]. Until now, little is known about controlling factors that drive distributions of Cyanothece sp. WH8904.

Our results indicated that temperature, salinity, and phosphate were major environmental factors to explain variation and distribution of the diazotrophic community in the EIO. Similarly, it was reported in the northern South China Sea where salinity and phosphate were responsible for spatio-temporal variations of the diazotrophic communities [11]. Also, the diazotrophic communities may exhibit vertical distribution patterns along water depths; however, Loescher et al. (2014) suggested that water depth integrated temperature, salinity, and other physical variables, as they were all highly collinear with depth [70]. Nevertheless, distribution and composition of the diazotrophic communities were strongly influenced by these environmental variables. Ammonia was also included in our CCA analysis, in spite of its influence on the diazotrophic communities was not significant. In addition, due to highly collinearity with other factors, nitrate and nitrite were not included in our CCA analysis. But, this could not eliminate potential roles of such inorganic nutrients in structuring spatial heterogeneity in the diazotrophic communities. In fact, it is generally believed that reactive inorganic nutrient concentrations regulate the abundance of diazotrophs, as well as N2 fixation [16]. Presumably, these widespread heterotrophic diazotrophs may have many adaptive mechanisms to deal with high concentration of various inorganic nutrients. For instance, a recent comparative genomics study revealed that nitrogenase in an isolated heterotrophic bacterium can be activated with high ammonia concentration in lack of functional CbbP or DraT2 proteins [81]. Nevertheless, how heterotrophic diazotrophs respond to their ambient nutrients warrant future studies.

Conclusion

To the best of our knowledge, this study presented the first evidence of composition and distribution of the diazotrophic communities in the EIO during the pre-southwest monsoon. Due to logistical constraints, our results were all generated from molecular data, and direct cell counts were not included in our study. Also due to possibly low abundance of diazotrophs, we applied high numbers of PCR cycles in gene amplification. Although it has been a well-developed protocol [e.g., 59], we realized that high PCR cycles might generate PCR bias and skew community structure characterization. In addition, primers used in our nested PCR were recently reported containing one mismatch that might influence the results of high throughput sequencing [64]. Despite these challenges, we observed diverse groups of diazotrophs that belong to Alpha-, Beta-, Gamma-proteobacterial groups; cyanobacterial cluster; and Firmicutes. Among which, Proteobacteria was the most dominant diazotroph in the EIO during pre-southwest monsoon, potentially contributing significantly to BNF in this oligotrophic ecosystem. Our results were in good agreement with previous studies conducted in the Western Indian Ocean, where heterotrophic bacteria were also reported dominant within the euphotic zone. Up to date, it is still poorly understood about ecology of marine heterotrophic diazotrophs and their actual contribution to BNF in the EIO. Further investigations on niche specialization and eco-physiological characterization of diazotrophs are greatly needed in order to better understand interactions of heterotrophic and cyanobacterial diazotrophs and how these interactions influence global nitrogen cycling.