Introduction

Microalgae and cyanobacteria are well known for possessing tremendous potential to colonize a range of habitats starting from the comfiest one to the extreme conditions like thermal springs, cold springs, and hypersaline systems (Badger et al. 2006; Ward et al. 2012; Singh et al. 2018; Malavasi et al. 2020). These organisms are extremely important for sustaining vital processes of aquatic and terrestrial systems, including agroecosystems (Sharma et al. 2012; Sharma 2015; Singh et al. 2014; Rai et al. 2019; Chittora et al. 2020). They may play important role in providing resilience to the rhizospheric microbial community (Ahmed et al. 2010) as they do in aquatic ecosystems (Akins et al. 2018). However, microalgae and their community composition also change under the influence of environmental disturbances (Chaurasia 2015; Lear et al. 2017; Ward et al. 2017). Hence, the biodiversity of microalgae and cyanobacteria and their responses to diverse physicochemical factors should be explored appropriately (Chaurasia 2015). Unfortunately, microalgae and cyanobacteria exist as a marginalized component of the biodiversity research (Nabout et al. 2013; Rejmánková et al. 2004; Chaurasia 2015; Suganya et al. 2016), and the role of their community composition has often been ignored in modelling of agriculture ecosystem processes (Allison and Martiny 2008).

Owing to their small size and highly variable pattern of distribution, adequate assessment of the biodiversity of microorganisms is a more challenging task than that of higher plants. Disparities in pattern of distribution are indeed due to variations in prevailing environmental parameters which may be experienced by these organisms even on proximate sites due to vertical and horizontal gradients of the factors, such as pH, temperature, humidity and nutrients availability (Armitage et al. 2012; Shen et al. 2015; O’Brien et al. 2016; Van Der Putten 2017; Banerjee et al. 2020). Sometimes, temporal variations are crucial in shaping the structure of a microbial community. For example, in early spring, the diatom populations in freshwater systems increase due to the availability of nutrients but when light intensity increases during summer, species richness of green algae and cryptophytes rises. Subsequently, green algae are replaced by large-sized diatom species and cyanobacteria during early autumn, while the community is consecutively re-dominated by diatoms with the return of winter (Sommer et al. 1986, 2012). Similar kinds of changes in the microbial community of agriculture systems can also be expected during different crops seasons. Thus, the timing of field study becomes critically important in a biodiversity study and hence it needs adequate attention of the researchers (Fattorini 2003; Pagliarella et al. 2018).

In biodiversity researches, plant and animal ecologists generally employ advanced statistical concepts and tools, such as sampling designs, categorization, normalization, data extrapolation, regression analysis, ANOVA, factor analysis, cluster analysis, logistic regression, generalized linear and generalized additive modelling in order to enhance the quality of their studies (Guisan et al. 2002; Fattorini 2003; Chiarucci et al. 2011; Pagliarella et al. 2018). However, very few microbiologists have given suitable consideration to the above tools for assessing microbial diversity of diverse habitats (Ampe and Miambi 2000; Oliveira et al. 2020; Banerjee et al. 2020). Counting individuals present in the sampling area of a specified quadrat size has been recognized as a standard procedure all through studying the diversity of higher plants (Misra 1968; Kershaw 1973; Cox 1990; Elzinga et al. 2001). However, this approach cannot be straightforwardly employed for exploring microbial biodiversity. Here, we become either dependent on the haemocytometer-based microscopic counting or circumscribed to use the advanced but expensive molecular and omics approaches (Hill et al. 2003; Gupta et al. 2013; Emerson et al. 2017; Kushwaha et al. 2020). The development of molecular tools has changed the scenario of studying microbial diversity. This has greatly helped us in deciphering the existence of non-culturable microbial forms, which are considered as the major components of the soil ecosystem (Schleifer 2004; Bodor et al. 2020). However, to extract the comprehensible information, both the haemocytometer- and the molecular tool-based biodiversity assessment methodology ultimately require mathematical, statistical and software-based data analysis (Ampe and Miambi 2000; Hill et al. 2003; Oliveira et al. 2020; Banerjee et al. 2020).

The present study provides an overview of the biodiversity of microalgae and cyanobacteria in crop fields. It includes a specific discussion on biodiversity concept, types, components and different measures. The modern statistical tools, such as principal component analysis (PCA) and canonical correspondence analysis (CCA) have been covered as these tools support in identifying the major factors influencing the distribution of microorganisms in a specific habitat or community. ANOSIM and SIMPER have also been discussed subsequently since they are useful in the comparative analysis of the biodiversity between communities and at the landscape level. Efforts have also been made to describe the merits and shortcomings of some available software packages.

Diversity of microalgae and cyanobacteria in the agricultural fields

The importance of cyanobacteria in enhancing the fertility of agricultural fields or reclamation of usar lands has long been very well established (Singh 1950, 1961). The potential of microalgae and cyanobacteria in agriculture and other applied fields was recently reviewed by several researchers (Abdel-Raouf et al. 2012; Singh et al. 2014, 2016; Abinandan et al. 2019; Rai et al. 2019; Chittora et al. 2020). A list of microalgae and cyanobacteria often reported from agriculture fields can be seen in Table 1, with a brief mention of major mechanisms through which they may contribute to enhancing the quality of agricultural lands.

Table 1 Some agriculturally important cyanobacteria

Based on laboratory and pilot-scale experiments, the beneficial effects of some selected microalgal and cyanobacterial species in improving the phosphorus, nitrogen, and carbon content of the soil has been regularly documented (Karthikeyan et al. 2009; Prasanna et al. 2012; Natarajan et al. 2012; Swarnalakshmi et al. 2013). But, the actual biodiversity of microalgae and cyanobacteria in agriculture fields has been sporadically investigated (Ahmed et al. 2010; Alvarez et al. 2021). Irissari et al. (2001) reported the diversity of unicellular, heterocystous and non-heterocystous cyanobacteria in the paddy fields of Uruguay. The diversity of N2-fixing cyanobacteria in agricultural fields of Thailand increased with the crop rotation process and was affected by environmental factors and season (Chunleuchanon et al. 2003). In the rice fields of Fujian (China), 11 genera of cyanobacteria were identified using 16S rRNA gene sequencing (Song et al. 2005). The occurrence of cyanobacteria and microalgae from the cornfields of north-eastern Italy showed a decrease in cyanobacterial diversity due to prolonged use of chemical fertilizers (Zancan et al. 2006). Hendrayanti et al. (2018) chronicled various cyanobacterial representatives from the paddy fields of Serang Mekar Village, Ciparay-South Bandung, West Java, Indonesia. The cyanobacterial richness in the agricultural lands of Al Diwaniyah city (Iraq) was represented by 96 species mostly belonging to N2-fixing unicellular and filamentous forms (Alghanmi and Jawad 2019).

Several unicellular, heterocystous and non-heterocystous cyanobacteria were documented from the rice fields of different states of India like Kerala, Meghalaya, Tamil Nadu, Assam, Bihar, Orissa, Uttar Pradesh, Telangana and Maharashtra (Prasanna and Nayak 2007; Srivastava et al. 2009; Dey et al. 2010; Bharadwaj and Baruah 2013; Singh et al. 2014; Khare et al. 2014; Vijayan and Ray 2015; Srinivas and Aruna 2016). Anabaena circinalis showed the maximum relative abundance among the diverse cyanobacterial species reported from the rice fields of Assam (Bharadwaj and Baruah 2013). The rice-based cropping systems of north and eastern India exhibited the presence of Nostoc, Anabaena and Phormidium with the predominance of heterocystous forms (Prasanna et al. 2013b). Singh et al. (2014) recorded 29 cyanobacterial strains from the paddy fields of Chhattisgarh. Of which, 15 were non-heterocystous. Likewise, 19 species of cyanobacteria (11 heterocystous and 7 non-heterocystous) were identified from paddy fields of Bihar, India (Khare et al. 2014). In the Kuttanadu Paddy Wetlands (Kerala, India), 45 species of cyanobacteria were documented. Here, Chroococcus turgidus showed the maximum relative abundance, while the highest species richness was observed during monsoon season when paddy crop attained the panicle growth stage (Vijayan and Ray 2015). Srinivas and Aruna (2016) reported the members of Nostocaceae, Chroococaceae, Scytonemataceae, Oscillatoriaceae in the rice fields of Telangana, India. Anabaena and Oscillatoria were abundant in the paddy fields of Patan and Karad (Maharashtra, India; Ghadage and Karande 2019).

Some researchers have documented cyanobacterial species from soils other than paddy fields. Zancan et al. (2006) reported the cyanobacterial diversity of cornfields of north-eastern, Italy. Ahlesaadat et al. (2017) characterized the cyanobacterial diversity of wheat fields of Yazd province, Iran. Recently, Alghanmi and Jawad (2019) have explored cyanobacterial diversity from soils of a variety of crops of Al Diwaniyah city, Iraq. Rai et al. (2018) investigated the diversity of cyanobacterial forms along the rural-urban gradient. These latter authors concluded that urbanization adversely affected the diversity and microbial community composition but favoured heterocystous forms.

It is concerning to note that a majority of the above-mentioned studies are restricted to cyanobacteria totally ignoring the microalgal component of the ecosystem. Further, most of these studies usually lack adequate quantitative estimates of the microalgal and cyanobacterial diversity and also miserably fail to furnish sufficient sampling details. The fact that most of the field studies underestimate the cyanobacterial diversity is attributable inter alia to (i) low sampling efforts, (ii) sensitivity of molecular markers used, and (iii) definition of species as per the researcher (Dvořák et al. 2015).

Biodiversity and its types

Biodiversity (Wilson 1988) refers to all kinds of variations in organisms starting from gene to biosphere levels. One may come across a variety of terms, such as genetic diversity, phylogenetic diversity, species diversity, ecological diversity, ecosystem diversity, functional diversity, etc., in the existing literature. All these terms are used either to express the different levels of understanding of biodiversity or to reflect its diverse ecological and functional perspectives. Whittaker (1972) introduced the concept of alpha, beta and gamma diversity, which is an illustration of biodiversity within the community, between-community and at the landscape level. Gamma diversity represents the total diversity of the landscape, while alpha diversity is the diversity of the sub-communities residing at a local scale. These two diversities are straightforward to comprehend and measure. Beta diversity, however, is comparative and represents the differences between the two sub-communities. Ecologists first estimate alpha and gamma diversity and then derive beta diversity from these two. Initially, beta diversity was proposed to involve multiplicative portioning (i.e., DαDβ = Dγ), however, the latter additive formulation was proposed (i.e., Dα + Dβ = Dγ) taking into account that alpha and beta diversity are not necessarily independent (Daly et al. 2018). It seems worth mentioning here that though additive and multiplicative partitioning of biodiversity are appreciated and widely used due to offering a single set of values of alpha and beta diversity, both methods suffer from the disadvantage of significant loss of information (Daly et al. 2018).

Biodiversity which focuses on the functional roles of species in communities and ecosystems is termed functional diversity (Laureto et al. 2015). The functional diversity of habitat, niche space, community or ecosystems is of immense importance as it is directly related to the diverse aspects of ecosystem processes, such as productivity, nutrient cycling, ecosystem stability and sustainability (Petchey and Gaston 2006; Costanza et al. 2007; Laureto et al. 2015). The idea of plant functional traits has emerged from here and now has attracted a great deal of attention of modern-day ecologists, working in the field of higher plants diversity (Petchey and Gaston 2006; Laureto et al. 2015). Different kinds of models, such as the sampling effect model and niche differentiation model, have been proposed by ecologists to assess the effects of functional diversity on the productivity of the ecosystem. Species redundancy hypothesis and niche complementarity model help understand the relationship between functional diversity and ecosystem processes (Goswami et al. 2017). The two widely used models, rivets and idiosyncratic are useful in comprehending the interdependency of species richness and functional diversity for the stability of an ecosystem (Ehrlich and Ehrlich 1981; Lawton 1994). Nevertheless, the concept of functional diversity has not been adequately explored in the case of microalgae and cyanobacteria (Goswami et al. 2017). Most of the microbial studies conducted so far are devoted largely to discovering the species and phylogenetic diversities of the microbial communities in question.

Basic components of biodiversity

Species richness and species concept in cyanobacteria and microalgae

Species richness and evenness are the two primary components of biodiversity. Almost all kinds of indices incorporate these two components for providing a quantitative assessment of biodiversity. Another component, which has gained less attention from researchers, is disparity (Daly et al. 2018). The total number of species present in the community under study is called species richness. By and large, species richness is straightforward as taxonomic identification and description of a new species is well described for higher plants. However, in the case of microalgae and cyanobacteria, both the species concept and the criteria for taxonomic identification of a new species are ambiguous (Gupta et al. 2013; Chaurasia 2015; Dvořák et al. 2015; Komárek 2016). Identification of microalgal and cyanobacterial species based only on morphological features is not appreciated nowadays because of phenotypic plasticity (to different environments and the culture media) and the presence of cryptic species (Hadi et al. 2016). As cyanobacteria reproduce asexually, their different identified forms do not also fully satisfy the criteria of the biological species concept. Rippka et al. (1979) used to classify cyanobacterial forms into different groups, but, this grouping is inadequate in view of the taxonomic species concept.

Classification based on 16S rRNA gene sequence is used for molecular identification of cyanobacteria (Hoffmann et al. 2005). Komárek (2006) advocated the use of molecular criteria for the identification of cyanobacterial species. However, the use of this approach becomes debatable when the outcome is correlated with morphological features, particularly considering the adaptations of cyanobacteria to the changing environmental conditions (Gupta et al. 2013). Moreover, molecular methods used for identification also do not fully justify the biological species concept. This approach may also reveal variable species identification of the same specimen by employing different molecular markers. Recently, DNA barcoding was employed by some researchers for the identification of microalgal and cyanobacterial species (Dvořák et al. 2015; Ballesteros et al. 2021). However, Eckert and his team reported barcoding gaps in more than half of the studied cases (Eckert et al. 2015). Hence barcoding needs proper validation before using it as a tool for the identification of microalgal or cyanobacterial species.

The polyphasic approach that takes into account morphological, genetic and ecological attributes of cyanobacteria and microalgae for species characterization has been employed by many researchers (Komárek and Kaštovský 2003; Zapomělová et al. 2013; Hauer et al. 2014; Komárek 2016; Sciuto and Moro 2015; Renuka et al. 2018). Several recent taxonomic revisions of cyanobacteria are based on this approach. However, some serious concerns are associated with this approach too. Dvořák and co-workers have provided an elegant discussion on the species concept and taxonomic diversity of cyanobacteria in their review (Dvořák et al. 2015). Moreover, the status of species characterization in the case of cyanobacteria and microalgae is still puzzling. Thus, this aspect demands sincere efforts as without framing a sound basis of species concept and identification of cyanobacteria and microalgae, it would not be possible to realize their actual diversity in any habitat or community, including agricultural ecosystems (Palinska and Surosz 2014; Chaurasia 2015; Komárek 2016).

Species evenness

The equitability of distribution of species inhabiting the community of interest is mentioned as its evenness in the field of ecology. If all the species inhabiting the community are present in equal proportions, it is called even. In contrast, if species are disproportionately present with one or two species dominating the community, it is referred to as uneven (Wittebolle et al. 2009). This concept is straightforward in the case of the diversity of higher plants, but it becomes yet again complicated for cyanobacteria and microalgae due to the imprecise nature of species concept and their taxonomic identification. Evenness is a key factor that regulates the functional stability of ecosystems. It is also important for understanding the representation of functional traits of each species. The communities with uneven distribution of species are often believed to be susceptible to invasion and are not resilient to stresses and disturbances (Wittebolle et al. 2009; Daly et al. 2018).

Species disparity

The third but ignored component of diversity is disparity (Daly et al. 2018). The species richness and evenness are based on species-neutral diversity. This means that distinct species have nothing in common. These components do not account for any disparity between species. According to this, a community of five markedly different species is not considered more diverse than a community of five species of the same genus. However, this might not always be the case in a natural community, particularly in the case of microbial ones. Various species of the same genus may possess several common attributes and thus might greatly influence the functional stability of the community. Thus, the disparity is somehow accounting both the similarity and dissimilarity that exist between similar kinds of species. The measurement of similarity or dissimilarity between species can be done considering genetic, functional, morphological and phylogenetic grounds.

Functional diversity

Villéger et al. (2008) introduced the concept of enumerating functional diversity. This idea involves an inclusive approach and integrally involve the issue of disparity. These authors recommended enumeration of functional richness, functional evenness and functional divergence. Since these indices are independently calculated, they do not influence each other similar to Whittakerian measures like alpha, beta and gamma diversity. In addition, functional diversity measures are complementary indices. The functional richness has merit to consider the niche and the niche volume of a particular species in a community (Mason and Mouillot 2013). Functional evenness gives weightage to species abundance when functional space is filled by species (Villéger et al. 2008). Divergence of species in their functional space from the centre of gravity is analyzed by functional divergence, which also prioritizes abundance. Thus, these indices independently dispense arrangement of species (relative abundance and orientation) in a multidimensional functional space and bring into light biodiversity–environment–ecosystem relationships (Villéger et al. 2008). The mathematical expressions and other details of functional diversity indices can be found elsewhere (Villéger et al. 2008; Mason and Mouillot 2013). Moreover, the concept of functional diversity is largely unexplored for microbial communities and thus demands adequate attention.

Biodiversity measurement within the community

Any important study of biodiversity, no matter which aspect is in focus, must include a quantitative evaluation. However, it is a complicated task both theoretically and practically. Biodiversity is enumerated by developing mathematical functions, usually known as biodiversity indices. The use of such indices allows comparison between spatial regions, temporal periods, taxa, niches or trophic levels. Biodiversity indices measure the taxonomical and phylogenetical relationship of the species and are the numerical, partial inter-changeable tools to quantify diversity (Clarke and Warwick 2001; Contoli and Luiselli 2015). After employing molecular tools for the identification of cyanobacteria and microalgae, indices are applied to enumerate the diversity of the region under study. For the metagenomics approach of diversity analysis, the indices may be calculated by counting the Operational Taxonomic Units (Hill et al. 2003; Rasheed et al. 2013). In literature, various kinds of diversity indices, such as Simpson (1949), Shannon and Weaver (1949) and Margalef (1958), have been suggested. The data collected either in binary form (i.e., presence or absence of species at a study site) or in a quantitative form, which contain many zero values for absent species, are required for the calculation of the indices. A meaningful discussion on the mathematical formulation of different indices and their grouping as classical, effective numbers, similarity sensitive and parametric families can be found elsewhere (Daly et al. 2018). Moreover, all these indices comprise certain strengths and may suffer from some kinds of constraints as well. Mathematical expression and parameter details of some useful biodiversity indices are listed in Table 2.

Table 2 Different measures used in biodiversity estimation

Species richness is the measure of the number of species present in a community and does not emphasize the number of individuals of a species present in the community. With the involvement of spatial diversity, species richness is regarded as the key measure of biodiversity (Elo et al. 2018). The most commonly used species richness indices are Margalef’s and Menhinick’s. These indices are easy to calculate and have a direct relationship with the number of species and sample size (Magurran 2004). The total number of species generally increases with increasing the sampling area. Nevertheless, species richness is the simplest index and is still being used by ecologists as a measure of diversity even though it does not throw any light on the relative abundance of documented species. Species abundance distribution can give an insight into the processes that decide the biological diversity of the communities. It reflects the competition for limiting resources among species (Magurran 2004). A precise study of temporal and spatial changes in a community should be done as it can provide information regarding variations in species abundance. Certain models such as the log-normal, the log-series, the broken-stick model and the geometric-series are used by researchers for such purposes (Tokeshi 1993; Hill et al. 2003). However, these models have been generally applied to higher plant communities and are rarely explored for microbial systems like cyanobacterial and microalgal communities.

Shannon’s index is the most commonly used measure, for the estimation of ecological diversity (Tandon et al. 2007; Pandey and Kulkarni 2006). It is a mathematical measurement to define community composition, i.e., the number of species and commonness of species in a community. It measures the degree of uncertainty in predicting the species of a random individual from a community with S species and N individuals. It is highly regulated by rare species and species richness. Since this index is susceptible to slight variations in diversity representing the actual state of the environment, it is preferred over the other available indices. Yadav et al. (2018) used the Shannon diversity index for evaluating the effect of nutrient enrichment on the species composition of periphytic algal communities colonizing chemical diffusing substrates. Likewise, Ikram’s group successfully employed Shannon diversity to study changes in microalgal and cyanobacterial communities along the gradients of temperature and other physicochemical factors in two hot springs of Garhwal Himalaya, India (Ikram et al. 2021a). These latter authors showed that the Shannon diversity decreased considerably as water temperature exceeded 50 °C in the studied hot springs.

The other important index to measure biodiversity is the Simpson index. This particular index attaches importance to the evenness of common species and picks up the species that are dominant or eminent in the community (Simpson 1949). However, as higher values indicate lower diversity, this index is not considered a very natural measure of biodiversity. The reciprocal form of Simpson original index measures evenness but suffers from the constraint that the index varies with the species richness. Gini-Simpson’s diversity index, also known as the probability interspecific encounter, gets an upper hand among various indices derived from Simpson’s original index as it is less sensitive to species richness and emphasizes the most abundant species in a community (Daly et al. 2018).

Biodiversity between communities

The functioning of ecosystems for the conservation of biodiversity and ecosystem management can be better understood by measuring beta diversity. Beta diversity represents the dissimilarity in species composition between sites in a landscape or geographical region (Whittaker 1960). Beta diversity can be estimated by computing diversity indices for each site and testing hypotheses about the environmental factors which may offer a suitable explanation for the variations existing among sites. An alternative approach may involve a direct analysis of the community composition data over the study sites concerning the sets of environmental and spatial variables (Legendre et al. 2005). The statistical methods of partitioning the variation of the diversity indices or the community composition data to the environmental and spatial variables are very useful for accomplishing such tasks (Peres-Neto et al. 2006). Bray Curtis dissimilarity and Jaccard’s index are two popular statistic-based tools, which are used by researchers for comparing diversity between two communities (Schroeder and Jenkins 2018). Moreover, these concepts have sporadically been explored for diversity assessment of microalgae and cyanobacteria.

Diversity of a landscape

Ecologists term the overall species richness of a landscape or geographical area as gamma diversity. According to Whittaker (1972), alpha and beta diversity are the two independent components of it. However, the modern ecologists prefer to use the term landscape diversity that not merely includes total species richness but also takes into account the patch diversity, such as patch number, patch shape, landscape fragmentation, patch edge, and diverse functional aspects of inhabiting species (Bojie and Liding 1996). Thus, landscape diversity involves a holistic approach for exploring biodiversity and ecosystem functioning of a geographical area. It is an imperative concept for restoring and maintaining the sustainable and resilient features of agriculture landscapes (Schaller et al. 2018). The microalgal and cyanobacterial diversity, which has been hitherto ignored by plant ecologists despite its valuable ecological functions, need to be given due emphasis in any program aimed at measuring landscape diversity.

Zeta diversity

Beta diversity, whether derived through the multiplicative or additive partitioning approach, is commonly used by researchers to understand similarity in species composition of two different sites. However, beta diversity does not present the holistic view regarding the actual pattern of diversity if the study area involves multiple sites. Therefore, some other kinds of relationships like species-area curve and interspecific distribution and rarity and endemism pattern are required to understand the phenomenon in totality (Gaston and Blackburn 2000; McGill 2010). This diversity measure determines the total set of biodiversity ingredients and systematically provides the spatial distribution of multispecies groups. It simultaneously provides information regarding the species-area relationship, multispecies dwelling patterns and ranking of species endemism. The exponential and power-law expressions of zeta diversity are also capable of deducing the information regarding niche assembly processes. Thus, zeta diversity is regarded as a pertinent measure for providing all-inclusive insights into biodiversity distribution patterns and the processes that regulate them and their response to the changing environmental factors (Hui et al. 2014, 2018). However, this measure of diversity has received meagre attention from researchers working in the area of microbial ecology.

Useful statistical tools

Sampling methods and collection of basic data

The collection of data relating to the abundance of various species is essential for calculating all kinds of biodiversity indices. Such data provide primary information of the community under the study. Since all kinds of further analyses are based on data collected from sampling, it must be done with utmost care. The size and strategical procedure of sample collection need to be decided very carefully keeping in view the statistical concepts and apparent features of the study area. Of the different sampling techniques prescribed by statisticians, such as simple random sampling, stratified sampling, cluster sampling, multi-stage sampling, etc., the most suitable one can be selected.

During the biodiversity estimation of higher plants in forests, grasslands or shrublands, the widely used sampling approaches are quadrat, transect, and plotless methods (Misra 1968; Kershaw 1973; Elzinga et al. 2001). However, a similar standardization of sampling procedure is lacking for microalgae and cyanobacteria. A majority of studies focusing on the biodiversity of microalgae and cyanobacteria do not appropriately describe the sampling procedure used. Some researchers have used a quadrat size of 400 cm2 for sampling microalgae from thermal springs (Sompong et al. 2005), while others have advocated using 100 cm2 size, without mentioning any valid reason for this choice (Ikram et al. 2021a). If we consider the micrscopic size of cyanobacteria and microalgae, the above-mentioned quadrat sizes could be considered unreasonably large. But the quadrat size should neither be very large nor very small. While the former makes the study tiresome, the latter may provide imprecise results.

For estimating relative abundance and calculating various diversity indices, the primary requisite is to count the number of individuals of a species present in the sampled quadrats. In the case of microalgae and cyanobacteria, the counting of individuals can be done with the help of a haemocytometer or any other similar device. However, it is not as simple as we think. Due to the unicellular and multi-cellular morphology of microalgal and cyanobacterial forms, deciding individual representation often becomes difficult. In the case of unicellular algae and diatoms, each cell can be taken as an individual unit. However, this practice can not be applied per se for the large filamentous algal forms. Earlier researchers preferred to count each cell of a filament as a unit. But then the variable length of filaments as also their curvature create trouble in haemocytometer-based counting. The statistical tools can play an important role in standardizing such procedures. Lawton et al. (1999) and Olson (1950) suggested some statistical corrections that should be taken into account during counting filamentous algal forms. A definite length of the filament is considered as a unit for such small filamentous cyanobacterial forms in which septa are hardly visible (DeNicola et al. 2006; Passy and Larson 2011; Yadav et al. 2018). The colonial and aggregate forming taxa also create difficulties during counting individual representation in the community. Some researchers considered a specified area as a representation of an algal cell unit. However, such methodology cannot be considered as a true representation of individual share in the community and hence needs biological and statistical justification before generalization.

Sometimes, a very small amount of microalgal and cyanobacterial samples may comprise a large number of cells and it virtually becomes difficult to count cells under the microscope. Researchers generally dilute the samples to overcome such problems. Depending on the density of algal cells, some researchers counted 100 to 500 cells per sample (Yadav et al. 2018). But, it is an intuitional choice and the minimum number of counting of algal cells that could be reasonable and appropriate for representing the share of an individual species in the community should be justified statistically. A variety of statistical tools are available to help in this context. Generally, it is expected to have enough cell counts so that the standard error of data remains < 10% (Gotelli and Ellison 2004).

Plotless or distance-based sampling techniques, such as point-centred quarter, nearest neighbours and closest individual methods have been thoroughly worked out in the case of higher plants (Elzinga and Salzer 1998; Hijbeek et al. 2013). These methods are generally applied in forests but can be used in grasslands and shrublands as well. These techniques are used to estimate the density and distribution of plants considering the average space occupied by an individual in the study area. Plotless techniques comprise several advantages over quadrat-based sampling. It is usually prompt, need less equipment and does not involve determination or adjustment in quadrat size and numbers. However, these techniques have hardly been employed or adequately modified and optimized in exploring the microbial diversity of soil ecosystems.

Detecting outliers and errors in collected data

Outliers are recorded values of measurements or observations that are outside the range of the bulk of the data (Gotelli and Ellison 2004). On the other hand, errors are recorded values that do not represent the original measurements or observations (Gotelli and Ellison 2004; Osborne and Overbay 2004). Some, but not all, errors are also outliers. Conversely, not all outliers in a dataset are errors. Detection of errors and outliers in the biodiversity data set is another important task of an ecologist as they can have a prominent influence on the results of statistical tests by increasing variance in the data (Gotelli and Ellison 2004). Some researchers consider outliers as noise, but outliers may be more than that in the case of biodiversity-based studies. They can reflect the key biological functions of species in an ecosystem and sincere thought over their presence may lead to new hypotheses, ideas, or discovery of an entirely new species. In some cases, a few data values appear outliers just because of the forced normalization of the dataset, but the appropriate transformation of data can be used to resolve such issues. Three simple techniques, calculating column statistics, checking ranges and precision of column values, and graphical exploration of data can be employed for this purpose. The simple column statistics is a straightforward way to find out the unusual low or high values in the spreadsheet data. The measurements of simple statistical parameters like mean, median, standard deviation, and variance provide a quick overview of the range of the values in the data set. The suspicious minimum and maximum values can easily be identified. Most of the spreadsheet software packages comprise functions that calculate these values. Further, the spreadsheet functions can also be used to check that all the data values in a column are within reasonable boundaries. Graphical exploratory data analysis (Graphical EDA) is another way to hunt for outliers and errors (Gotelli and Ellison 2004). Three types of graphs namely (i) box plots, (ii) stem-and-leaf plots and (iii) scatter plots are very popular to find out the unexpected trends or patterns in the data sets. The first two are used for the plotting of a single variable. On the other hand, scatter plots are used for bivariate or multivariate data.

Transformation of biodiversity data

When the purpose is to know how the different environmental factors influence the distribution of various genera and species inhabiting a community or landscape along space and time, we need to normalize the whole data set for analyses. The transformation makes it possible and converts the data in such a form that becomes more understandable, communicable and appropriate for meeting the assumptions. For transforming data, a mathematical function is simply applied to all the observations of a particular variable (Gotelli and Ellison 2004; Legendre and Legendre 2012). Most transformations comprise simple algebraic continuous monotonic functions. This valuable tool is also important for exploring variations in species composition of microalgal and cyanobacterial communities along the gradients of diverse physicochemical factors (Sompong et al. 2005; Ikram et al. 2021a, b). Because of the involvement of monotonic functions, transformation does not change the rank order of the data but does affect the variance and shape of the probability distribution. Transformations are often used to convert non-linear relationships into linear relationships as they are more comprehensible. The logarithmic, square-root, reciprocal, Box-Cox are examples of some other useful transformations.

Analysis of biodiversity data

Regression

Regression is a powerful tool in studying and modelling the spatial distribution of species in relation to various environmental factors (Gotelli and Ellison 2004). The most basic regression describes the linear relationship between an independent variable and a dependent variable. From a statistician’s point of view, regression and correlation are different tools as the former reveals the association between dependent and independent variables based on cause-and-effect relationship, while the latter comprises variables that are found merely associated without any such cause-and-effect. Although different models in statistics have been developed for regression and correlation, some researchers consider that the distinction is arbitrary and often just semantic. Moreover, environmentalists do not pursue correlations between variables unless they think or suspect a certain kind of cause-and-effect relationship. The non-parametric extension and several other kinds of modifications of classical regression are widely used in biodiversity studies for developing a variety of ecological models for different responses, such as species richness, abundance classes, and presence-absence data (Lehmann et al. 2002).

Multivariate analysis

Multivariate techniques are used for studying the relationship between the distribution pattern of species in relation to environmental parameters. For this, the similarity coefficients are calculated and subsequently data are classified by clustering or mapped into two- or three-dimensional plots, known as ordination plots. The ordination plots represent the relative dissimilarity of species composition. The arrangement of data of samples in these plots is positioned on the distances between the pair of samples of the communities. The points placed near each other on an ordination map are said to have similar communities. Attempts have been made to obtain information about the species-environment relations from the data of the communities obtained from the field surveys by ecologists (ter Braak 1988). The data of biological communities always show a skewed distribution. And the non-linear relationship is present between the environmental variables and the species. A unimodal function of the environmental variables is observed for species in a community (Whittaker 1956, 1967). The clustering and ordination techniques are used to summarize the multi-species data to similar clusters or ordination axes and interpreted according to the known data about the environment and the species of the area under study (ter Braak 1988). The interpretation of the relationship between the species and the environment by cluster and ordination techniques has been termed as indirect gradient analysis by Whittaker (1967). Hence, a technique of direct gradient analysis was put forward by Whittaker (1967), which is also known as regression, to describe the relationship of environmental variables with the species.

The common multivariate techniques used by ecologists are ANOVA, ANOSIM, cluster analysis, principal component analysis, canonical correspondence analysis, multidimensional scaling. Moreover, multiple correlations could also be applied to find out the interdependency of different environment variables.

Analysis of variance (ANOVA)

ANOVA is used to test the difference between the means of the studied samples. For ecological studies, it checks the null hypothesis of no difference in mean of diversity between the two sites (Gotelli and Ellison 2004). The result is said to be statistically significant when the p-value (probability) is less than the significance level (usually taken as 0.05) or 95% confidence level. The null hypothesis is rejected denoting differences in the mean of diversity between the sites. But the ANOVA test is not preferred much for testing the difference between the sites or samples because of the intricacy of reliance between species and the general ineptness of the normality assumption. When species data contains a large number of zero values, its transformation for getting normality is not possible. In such circumstances, other multivariate techniques are applied for the interpretation of data as also to determine the relationship of species with environmental parameters.

Analysis of similarities (ANOSIM)

It is a non-parametric permutation technique, which has been aptly described by Clarke and Green (1988). It is employed to the (rank) similarity matrix underlying the ordination or classification of samples. It is a more valid and informative test as a large number of replicates and permutations are made in it. Interpretations are drawn if there are significant differences between the groups. The R-value is observed for each pair-wise comparison. The pairwise test between the samples gives a p-value which denotes how significantly samples are different, and the R-value shows how strongly they are different from each other. The indicator R* shows complete dissimilarity in communities when equal to 1, while reveals a close similarity if values tend to be 0.

Multidimensional scaling (MDS)

This technique is based on the similarity or dissimilarity between the samples. The non-metric MDS ordination is a visual display of the pattern of proximities. It utilizes the rank of similarity to display the samples in the plot. The most similar samples are placed together in the ordination plot, while the widely apart samples reflect the variations among them. The goodness of fit is known as stress in the case of MDS. The stress value ranges between 0 and 1. The stress value near zero represents a good fit of the model for MDS. The MDS is used by the ecologists to illustrate the similarities between the samples in a smaller number of dimensions.

Cluster analysis

The cluster analysis is the method employed for presenting variance amongst the communities and the samples. It intends to detect the “natural grouping” of samples based on their similarity to each other. When comparing different sites (or subsites of sites), the similarity matrix of species is maneuvered to elucidate the species that analogously co-occur across the sites or subsites (Fig. 1). The hierarchical agglomerative method is the most commonly used technique of clustering. The data or the samples are fused to form clusters based on a similarity matrix (such as Bray-Curtis), till a single cluster is formed. It is generally represented by a dendrogram (i.e., tree diagram) (Clarke and Warwick 2001).

Fig. 1
figure 1

 A schematic representation of cluster analysis

Principal component analysis (PCA)

PCA is used to decrease the number of factors from the sample and identify the significant one from the big data pool of samples. It focuses on conserving as much data as possible while diminishing the multi-dimension data to lower dimensions. The orthogonal transformations are used to convert the feasibly correlated variables into linearly uncorrelated variables. These linearly uncorrelated variables are known as “principal components” that record for the most variance in the sample. The first principal component represents the highest possible variance. The ordination plot of PCA represents how close/profoundly associated two factors are. The transformed data (using eigenvectors) are used by PCA as number in the ordination plot rather than the real data. Because by plotting the real data, the relationship and the pattern between the points cannot be interpreted (Clarke and Warwick 2001). PCA is a more apt test for environmental variables as ecological data (specifically in the case of microbes) have more zero counts that do not need any special treatment (data normality test).

Canonical correspondence analysis (CCA)

CCA is used when there are a large number of species present in the community and a great intrinsic variability may prevail in the system. The ecological data are either quantitative (abundance i.e., number of individuals) or incidence types (presence/absence), and the species to environment relationship is non-linear and non-monotonic. Thus, in the light of such characteristics, CCA is more appropriate than other traditional linear-based multivariate techniques (ter Braak and Verdonschot 1995). This method helps ecologists to decipher the response of species to the environmental variables or distribution patterns along the environmental gradients. CCA can also be used for examining spatial and seasonal disparities in the communities (Snoeijs and Prentice 1989; Bakker et al. 1990; Anderson et al. 1994). In the biplot of CCA (Fig. 2), arrows indicate the quantitative environmental variables. The length of the arrow indicates variable importance and their positive or negative association with the axis (Abrantes et al. 2006). The angle between vectors indicates a correlation among environmental variables. The locations of points (sites or species) in the plot represent their compositional similarity to each other and are dominated by species that are projected near them in the CCA plot. The location of species indicates their distributional similarity to each other.

Fig. 2
figure 2

 A hypothetical scheme depicting canonical correspondence analysis (CCA) between the environmental components and the species present at the study sites

Multiple correlations

It is a statistical technique applied to estimate the correlation and interdependency of different environmental parameters. The effect of different factors on one factor and the strength of the relationship between them can be inferred by the multiple correlation method. A strong correlation represents a prominent effect of different factors on a single factor. Conversely, a poor correlation reveals that the effect of other factors on the factor under consideration is unimportant.

Software packages

Nowadays, a variety of statistical software packages are available, namely, ANALYTICA, IBM-SPSS, STATISTICA, STATA, SIGMA PLOT, MATLAB, OriginPro, XLSTAT, R package, BIOTA, CANOCO, PAleontological STatistics (PAST) and PC-ORD. Of these, some packages are exclusively used for mathematical and statistical operations, while others are meant for biodiversity assessment. Researchers have employed different software for determining microalgal and cyanobacterial diversity (Omelon et al. 2007; Barinova et al. 2011; Zhan and Sun 2012; Kühl et al. 2012; Roy et al. 2015; Schulz et al. 2016; Gaikwad et al. 2016; Mogul et al. 2017; Zhang et al. 2021). The development of software packages has made the use of statistical tools very easy and comfortable. Considerable caution needs to be exercised while using these software as ignorance of basic statistical concepts may lead to incorrect interpretation of the results. Ignorance of statistical concepts also leads to an unsound experimental design. Hence, care should be taken while employing statistical design or analyzing results through software packages. It would be still better if a proficient statistician is directly involved right from the time of designing the experiment. Some software packages are not user-friendly because of their complex ways of data feeding. Table 3 lists the advantages and drawbacks of some statistical packages. Since the selection of statistical software depends mainly on its user-friendly features, the focus should now be given to developing subject-specific statistical software packages for the assessment of microbial biodiversity.

Table 3 Characteristics of some commonly used statistical software packages

Conclusions and future perspectives

Microalgae and cyanobacteria are extremely important for maintaining the vitality of the agroecosystems. However, they have remained a neglected component in biodiversity studies. The concept of functional diversity has several merits. However, it has not been effectively explored for microalgal and cyanobacterial communities. A majority of previous research in the field of microalgae and cyanobacteria is focused on identifying species and phylogenetic diversity. However, species concept and the criteria for taxonomic identification of a new species in microalgae and cyanobacteria is very confusing. Thus, this particular aspect demands precise and critical efforts. Landscape diversity assessment is an imperative proposition for restoring and maintaining the sustainable and resilient features of agriculture landscapes. Thus, any program developed to estimate landscape diversity needs sufficient attention to include work components for measuring the microalgal and cyanobacterial diversity. A majority of studies dealing with microalgal and cyanobacterial diversity do not appropriately describe the sampling procedure details. Detection of errors and outliers in the biodiversity data set is crucial to get a better insight of it and also for performing various statistical analyses. While regression and multiple correlations help in realizing the relations of different environmental factors, ANOVA, ANOSIM, MDS and cluster analysis are powerful techniques for understanding the compositional differences of different microbial communities. PCA and CCA are effective in interpreting the influence of environmental factors on the distribution of microalgal and cyanobacterial species in a study area. Statistical software packages are the backbone of the current research activities. Some of the software packages available are simplistic, while others incorporate operational complexities. Thus, before employing such software programmes in biodiversity-based investigations, gaining a sound understanding of them is strongly recommended. As the selection of statistical software depends mainly on its user-friendly features, it is high time to shift the major focus to developing biodiversity statistical packages specific to microorganisms.