Introduction

Freshwater exotic species are an issue of growing management concern (Vander Zanden and Olden 2008). They have one of the most harmful and least reversible impacts on natural ecosystems as they might change the local fauna and flora all around the world (Vitousek et al. 1996; Ricciardi and MacIsaac 2000). Exotic species are able to lower the ecological quality through changes in biological, chemical and physical properties of aquatic ecosystems (Olenin et al. 2007). On the top of this, the socio-economic damage is another important issue caused by invasive species.

Azolla filiculoides (Lam.) is one of the alien species that is already widely distributed in tropical, sub-tropical and warm temperate regions, but particularly in Southeast Asia (Sweet and Hills 1971). It is a fast-growing fern with a doubling time of 2–5 days (Lumpkin and Plucknett 1982; Zimmerman 1985; Van Hove and Lejeune 2002; Taghi-Ganji et al. 2005). Azolla is a unique species among free-floating plants. In the scientific literature, however, various benefits have been mentioned for these floating ferns (e.g. nitrogen fixation and use of Azolla as green fertilizer), they are considered as troublesome weeds in invaded ecosystems.

A. filiculoides (Barreto et al. 2000) is a particular example of an alien species in many countries including Anzali wetland, northern Iran (Delnavaz and Ataei 2009). This floating water fern was intentionally brought from Philippine by Ministry of Agriculture for nitrogen fixation capacity. Since then, it has been spread in many wetlands in northern Iran particularly in Anzali wetland complex (JICA 2005). Due to its massive spread, some native macrophytes such as duck weed (Lemna minor) were completely wiped out and some of them are on the verge of extinction (JICA 2005).

For wetland restoration and conservation management, it is very important to get acquainted with the habitat preferences of exotic aquatic species. Therefore, it is necessary to examine the relationship between wetland characteristics and habitat requirements of exotic species. To achieve this goal, the use of suitable ecological methods is very important to successfully predict the habitat preferences of the target invasive species. Predictive modelling is one of the most important steps in the development of a standard habitat assessment protocol (Parsons et al. 2004). Ecological modelling can allow for the integration of physical, chemical and biological characteristics into measures, rather than just observations of causes and effects (Goethals and De Pauw 2001).

In recent years, various modelling techniques such as artificial neural network (ANN) (Robert 2003; Zhengfu and Fernando 2007; Gooyong et al. 2014), evolutionary polynomial regression (Elshorbagy and Baroudy 2009; Giustolisi and Savic 2009; Savic et al. 2009), classification trees (CTs) (Dzeroski et al. 2000; Dakou et al. 2007; Zarkami et al. 2010, 2014) and support vector machines (SVMs) (e.g. Zarkami 2011) have shown to be very powerful methods for assessing habitat requirements of organisms. CTs (Quinlan 1993) are such powerful tools allowing to predict different characteristics such as presence/absence, biomass and abundance of various kinds of aquatic organisms. They are particularly useful to develop ecological data mining methods when dataset are limited (Goethals et al. 2007).

CTs can give insight in complex, unbalanced, non-linear ecological data where commonly used exploratory and statistical modelling techniques often fail to find meaningful ecological patterns (De’ath and Fabricius 2000). They yield reliable classifications with a transparent set of rules (Hoang et al. 2010). Due to their transparency and flexibility in use, CTs have recently gained in popularity (Hoang et al. 2010; Zarkami et al. 2012; Haghi Vayghan et al. 2015) and have been applied in a variety of ecological studies (Dakou et al. 2007; Everaert et al. 2011; Zarkami et al. 2012).

The SVMs (Vapnik 1995) is also a powerful method that implement a sequential minimal optimization (SMO) algorithm for training a support vector classifier using kernel functions (Platt 1998; Keerthi et al. 2001). They consist of input and output layers connected with weight vectors. SVMs maximise the margin around a hyperplane that separates two classes by mapping the input space into a high dimensional space or feature space. The mapping is determined by a kernel function such as polynomial kernel. SVMs have been applied for successful assessment of different types of organisms: e.g. macro-fauna community types (Akkermans et al. 2005), pike (Esox lucius) (Zarkami et al. 2012).

Multivariate analysis comprises a set of techniques dedicated to the analysis of data sets with more than one variable. This makes multivariate techniques suitable for analyzing ecological data which compose a number of environmental and biological data. Principal component analysis (PCA) is a multivariate data analysis technique, which is often used in different fields. In general, this method finds hypothetical variables (components) accounting for as such as possible of the variance in the multivariate data (Davis 1986; Harper 1999). Discriminant analysis (DA) method provides discriminant test for two or more groups (the latter is sometimes called canonical variates analysis). This module aims to find a transformation of input variables into latent variables (features) with maximum class separability (Fukunaga 1990). A scatter plot of specimens along the first two canonical axes produces maximal and second to maximal separation between all groups. The axes are linear combinations of the original variables as in PCA, and eigenvalues indicate amount of variation explained by these axes.

Since exotic Azolla may cause possible threats to the rich original biological diversity, assessment on their habitat preferences can be helpful in order to restore, conserve and manage wetland ecosystems.

The present study primarily aims to compare the reliability of applied models (CTs and SVMs) using two performance criteria (CCI% and k). This would be an important issue to decide upon the most influential predictors deriving the habitat preferences of the target exotic species in the wetland. The main aim of the present research is to analyze the habitat preferences of A. filiculoides in Selkeh Wildlife Refuge (a protected area in Anzali wetland, north of Iran) using these two data-driven techniques. The aforementioned methods were chosen because they can both be used when dataset are not so big. Finally, the present paper aims to perform multivariate techniques using PCA and DA in order to assess the most important variables for the occurrence of Azolla in wetland.

More specifically, these types of researches will be useful for monitoring the most important variables for the target species in future.

Materials and methods

Description of study area

Anzali wetland complex is ecologically and internationally known as an important wildlife refuge particularly for a large number of birds. This wetland itself comprised three important areas including Siahkeshim Protected Area, Sorkahnkol Wildlife Refuge and Selkeh Wildlife Refuge (the study area). This international wetland consists of large, shallow, eutrophic freshwater lagoons (due to too much nutrients entering this part of wetland), shallow impoundments, marshes and seasonally flooded grasslands in the southwest Caspian lowlands. This wetland supports various species of fauna: breeding and wintering area of 77 migratory bird species (Mansoori 1995), a nursery and spawning habitat for 49 fish species (Abbasi et al. 1999), and a habitat for 31 species of mammals which reside in Anzali watershed (JICA 2005). More than 100 plant species occur in the wetland. There are three types of plant communities (JICA 2005) which are commonly found in entire wetlands including: submerged communities (e.g. Ceratophylum demersum), floating (e.g. A. filiculoides and Nymphaea odorata), and emerged (e.g. Phragmites australis). The whole area of the lagoon is covered by the submerged plant community. Some species of submerged plants such as C. demersum are very useful in Anzali wetland since they can accumulate the highest concentration of heavy metals (Ahmad et al. 2016). A. filiculoides forms a dense mat that covers approximately a quarter of Anzali wetland. Only the lagoon is spread of A. filiculoides. Phragmites is found in roughly a quarter of the wetland except for the lagoon. This kind of community generally lives in the shallow area of the eastern wetland and is widely distributed.

Various point and non-point sources of pollutants enter Anzali wetland. They originate from direct discharge of sanitary wastewater (produced from coastal cities), direct discharge of industrial wastewater (without or with less treatment), and agricultural activities in the surrounding areas. Pollutant loads negatively affect water quality and degrade structural characteristics of the wetland.

Figure 1 shows the study area (Selkeh Wildlife Refuge) which is an important refuge for migratory birds in Anzali wetland. The Selkeh has a total area of 360 ha which is located between latitude 37°22′58″ and 37°26′51″N longitude 49°27′ 09″ and 49°28′ 30″E. There are two rivers entering this area, namely Hendekhaleh and Trabkhaleh, both discharging various types of pollutants (e.g. untreated domestic, industrial and agricultural wastewaters) in this protected area (JICA 2005). Rice is one of the main crops which is cultivated in the surrounding area. Application of fertilizers, pesticides and herbicides (in the paddy fields) has caused eutrophication problem in this aquatic ecosystem. Moreover, various industrial and urban wastes have negatively impacted the wetland (JICA 2005).

Fig. 1
figure 1

(Source Guilan Environment Protection Bureau 2007)

Location of the sampling sites in Selkeh Wildlife Refuge, south of Anzali wetland, northern Iran

Sampling sites and study design

Seven sampling sites were considered during 1 year study period. The selected sites were monthly sampled during 12 sampling campaigns (from October 2007 till September 2008) with taking 84 total dataset into account.

The most important criteria that were considered for the site selection were based on variations in natural characteristics (e.g. drying out of wetland in dry season), anthropogenic influences (e.g. resulting from domestic, industrial and non-point sources), feasibility sampling (e.g. accessibility for boat transport) and ecological considerations (e.g. distribution of Azolla over the entire wetland). Systematic sampling (as described by Buckland et al. 2007) was used to take the samples at fixed intervals along the length of the wetland. To do so, two parallel transects, spaced at 500 m intervals, were run across the west-east gradient of the wetland (as illustrated in Fig. 1). The first transect consisted of sampling points of 1, 3 and 4, whereas the second transect comprised sampling points of 5, 7 and 6. Since point 1 was located in the river, one additional sampling point (no 2) was considered between points of 1 and 5 in order to get more wetland-related information. As a whole, the 7 sites completely differed from each other regarding ecological and geographical conditions. The sampling site 1 located near the river had no direct connection to the other sampling sites. The other six sites (the sites of first and second transects), however, were located in the wetland; they were separated from each other with a geographical barrier such as a ridge. The sites located within each transect were also independent to each other. For instance, in the first transect, one site was surrounded by the reeds (e.g. Phragmites australis and Typha latifolia) and another one was located in the open area and the third one was selected in the location dominated by Trapa natans. The independence of sites was also taken into account for the second transect.

As stated already, the distance between sampling sites was approximately chosen 500 m. The exact geographic location of each sampling site was determined using a GPS (GARMIN, etrex Vista). The selected sites were considered as the optimal sampling design for measuring the biotic (cover percentage of the exotic fern) and abiotic variables in the wetland. The other parts of wetland were not considered for site selection because of some reasons: dense cover of some floating (e.g. Nelumbium nuciferum) and emergent plants (e.g. Phragmites spp. and Typha spp.) caused the boat passage through the wetland impossible. On the other hand, the low depth of wetland (particularly in dry seasons) intensified the problem of boat transport through the wetland. On top of this, sampling was impossible in some parts of wetland due to drying out of ecosystem in dry seasons.

At each sampling site, a set of wetland characteristics including the water quality and physical-habitat variables were measured (Table 1). Since the data-mining techniques applied in the present study (CTs and SVMs) are less affected with multicollinearity (high correlation between two or more predictor variables), the whole variables were statistically considered as inputs to the techniques. For the multivariate techniques (PCA and DA), one of the highly correlated variables were eliminated from the set of variables because they had no added values for analyzing the habitat preferences of the exotic fern. In total, 12 variables were ultimately used for the multivariate techniques.

Table 1 Summary statistics (minima, maxima, median, average and standard deviation) of the input variables used for modelling habitat requirements of A. filiculoides in seven sampling sites of Selkeh Wildlife Refuge (Anzali wetland, northern Iran) over a 12 months’ study period (2007–2008) (n = 84)

Among the abiotic variables, it was assumed that air temperature and sunlight (the number of sunny hours) might stimulate the Azolla’s growth in the study area. As shown in Fig. 2, the highest air temperature and the maximum number of sunny hours were recorded in July and June, respectively.

Fig. 2
figure 2

The trend of A.T °C (air temperature), sunlight (the number of sunny hours) and depth in the Selkeh Wildlife Refuge (Anzali wetland, north of Iran)

The water quality samples were measured on a monthly basis near the wetland surface. In order to have a regular measurement, the samples were taken in the middle of each month. The chemical variables were analyzed based on standard methods (as described by APHA/AWWA/WEF 1998).

Field measurements for the physical-habitat variables were monthly measured by the Bandar-E Anzali Weather station, northern Iran (latitude 37°28′21″N and longitude 49°27′43″E). The water depth of wetland was measured using a yardstick.

Data related to A. filiculoides (cover percentage) was used as a response variable for all methods. This variable was simultaneously sampled with the abiotic data at each month. It was measured using a percentage cover class based on a modified Braun-Blanquet cover-abundance scale (Sumners and Archibold 2007). Cover percentage of invasive species was measured using 1 m2 quadrat. In order to obtain an accurate estimate of cover percentage, the quadrat was divided into a grid of 100 squares. At each sampling site, the quadrats were repeated three times. Then, the average of three quadrats (observed cover percentage) was made in order to have an ultimate estimation. Depending on the extent of occurrence in the sampling sites and also to make an easy interpretation of the obtained results, the cover percentage of A. filiculoides was divided into three ordinal classes. The number of real observations for the classes of low, medium and high accounted for 40, 22 and 22, respectively.

Data processing and analysis methods

Before applying the data driven and multivariate techniques, data were primarily tested for statistically normality. The data of some variables such as SO4 2−, NH4 +, organic nitrogen and o-PO4 2− were not normally distributed (the outliers of these variables were shown with the box—and—whisker plots). Since the outcomes of classification tools such as CTs and SVMs are not strongly influenced by wide ranges of data values, no transformation of the data was applied for the aforementioned models. In contrast, log-transformation was merely made for the multivariate techniques applied in the present paper.

The Pearson correlation (r) was used to find the correlation between pairs of input variables and also between input variables and cover percentages of the target species. No collinear variables were dropped for the data mining techniques, while the removal of one of the highly correlated variables was only considered for the multivariate techniques.

Then data mining techniques (CTs and SVMs) (Witten et al. 2011, version 3.6.6) were used as the main models to analyze the habitat preferences of A. filiculoides. The output variable for the data-driven methods was the cover percentage of A. filiculoides, whereas the input variables were a set of water quality and physical-habitat characteristics of wetland. First, the method was applied based on CTs using all input predictors. For the training and validation of CTs, different fold cross-validation (from 2 to 10) were tested in order to get a reliable estimate of the model error and to avoid overfitting of the model. On the basis of this, CTs stability was best maximized using a 3-fold cross-validation. This was based on the highest predictive outcomes of CTs resulted from the predicted vs. observed values. The J48 with different intensities of pruning confidence factor (PCFs) were induced by changing the confidence factor into 0.01, 0.10, 0.25 and 0.50 values.

Akin to CTs, the output and input values of SVMs was the cover percentage and wetland characteristics, respectively. In the present research, the polynomial kernel was used. The parameter settings were default values except for the exponents of the polynomial Kernel (the exponents were tested from 1 to 10) (Witten et al. 2011).

The percentage of correctly classified instances (CCI%) and Cohen’s kappa statistic (k) (Cohen 1960) were used for assessment of the two techniques. Both criteria were based on the confusion matrix (the observed values vs. predicted ones) (Table 2). The predictive performances of the both techniques were identified with true positive (TP = a), false positive (FP = b), False Negative (FN = c) and True Negative (TN = d) cases obtained from each model (Fielding and Bell 1997) (Table 3). The degree of agreement in k was based on the following ranges (Landis and Koch 1977): ≤0 (poor); 0–0.20 (slight); 0.20–0.40 (fair); 0.40–0.60 (moderate); 0.60–0.80 (substantial) and 0.80–1 (almost perfect) and also the models with a CCI > 70% and k > 0.40 were considered to be reliable.

Table 2 The confusion matrix as a basis for the performance measures with true positive values (TP), false positives (FP), false negatives (FN) and true negative values (TN)
Table 3 Measures of predictive accuracy calculated from a confusion matrix

Paired Student’s t-tests (a two-tailed test with a 95% confidence interval) were conducted for the comparison of the predictive performance of the two applied methods.

Multivariate techniques (PCA and DA) were used to examine the environmental variables affecting the prevalence of Azolla in the study area. These two techniques were applied from the program package PAST (Paleontological Statistics, version 3) (Hammer 2013). PCA based on the first and second components were used to determine the most important variables for exploring the occurrence of the exotic fern in the Selkeh Wildlife Refuge. DA was performed for the specimens with three groups of Azolla’s cover classes to find the relation between the cover classes of Azolla and different seasons and sampling sites in the study area.

Results

Correlation analysis

The correlation analysis showed that Azolla’s cover percentage is strongly and positively correlated with the number of sunny hours (r = 0.79; p < 0.01) and air temperature (r = 0.86; p < 0.01) so the highest and lowest cover percentage of Azolla in relation to the aforementioned variables were observed in the dry (e.g. in summer) and wet seasons (e.g. in winter), respectively. The correlation analysis of wetland characteristics also showed that some chemical variables were strongly correlated. For instance, the high correlation was found between total nitrogen and NH4 + (r = 0.71; p < 0.01), total nitrogen and organic nitrogen (r = 0.73; p < 0.01).

Among the entire variables recorded in the wetland, the mild and extreme outliers were mainly recognized in water quality variables such as SO4 2−, NH4 +, organic nitrogen, and o-PO4 3− (Fig. 3). As illustrated here, the outliers are skewed to the upper part of the box plots. For the most values of these nutrients, the measurements were probably accurate due to the contaminated sites by the nutrients.

Fig. 3
figure 3

The box and whisker plots (with minimum, median, maximum, the lower quartile (Q1 = 0.25 (N + 1)) and upper quartile (Q3 = 0.75 (N + 1)) representing distribution ranges of the water quality variables recorded in the Selkeh Wildlife Refuge (n = 84) (Org-N organic nitrogen)

Comparison of predictive performances of CTs and SVMs

Among nine times cross–validation tested (from 2 to 10), the best predictive performance of CTs for Azolla were obtained with the 3-fold cross-validation. On the basis of this, the given value was used for testing different PCF levels to the entire data. The outcomes of the CTs with 4 pruning levels are presented in Table 4. As seen here, different pruning affected the number of leaves, tree size and performance criteria.

Table 4 The predictive outcomes of J48 for Azolla at Selkeh Wildlife Refuge (Anzali wetland, northern Iran) with four pruning confidence factors (PCFs)

However, when PCFs were applied at four levels (0.50, 0.25, 0.10 and 0.01), the predictive results were relatively stable over four different PCF levels. Based on results obtained, average CCI (%) for analyzing the habitat needs of Azolla (CCI = 71 ± 1.40%) indicated that more than 70% of instances were correctly classified. This implicitly demonstrated that the CCI gave a reliable outcome. Average k was also reliable since it met the threshold value (k > 0.40) to yield a trustworthy results (k = 0.55 ± 0.02).

In addition to predictive performances, the number of leaves and sizes of each tree were also considered in order to test the complexity of the induced model. A very complex tree was constructed at PCF 0.50. Here, number of leaves (12) and tree size (23) revealed that the induced tree was difficult to use for interpreting Azolla’s habitat requirements. More pruning of the induced trees resulted in an easy and better understanding of habitat requirements of the species. In contrast, at PCF 0.01, only five leaves (with tree size of 9) were constructed but this level was not considered for the interpretation because it provided less information on Azolla in sampling sites. The PCF 0.10 was less reliable in terms of the two evaluation criteria used here. For this reason, it was not used for the interpretation of outcomes. However, there was no a significant difference over the four levels of pruning, the PCF 0.25 was finally selected as the optimal pruning level since the highest values of the predictive results were obtained at this level of pruning (CCI = 72.00 ± 0.65% and k = 0.60 ± 0.01). Therefore, this value was used for subsequent model application and evaluation of occurrence of Azolla in the sampling sites. The confusion matrix (Fig. 4a) presents the performance of the induced tree by visualizing the distribution of the instances around the diagonal of the matrix. About 72% of instances were correctly classified and the remaining of instances (28%) was misclassified.

Fig. 4
figure 4

a, b Confusion matrixes of J48 (a) (PCF 0.25) and SMO (b) (exp = 1) for Azolla’s cover percentage classes

In SVMs, the best and highest predictive results (according to the two criteria) was obtained in the fold of 4. This cross-validation was ultimately checked based on the application of different exponents (from one to ten). The experimental results showed that the performance of SVMs was sensitive to the various exponents. The best and highest predictive outcomes were obtained using an exponent of one (using a linear support vector machine). Testing other exponents of the polynomial kernel caused a higher risk of over-fitting. This led to lower the predictive performances of the applied models. Therefore, SMO with exp = 1 was then run ten times after randomisation in order to check for robustness and reproducibility.

The outcomes of model showed that the habitat preferences of Azolla were successfully predicted by SVMs (CCI = 82.10 ± 3.35% and k = 0.72 ± 0.05). As illustrated in the confusion matrix (Fig. 4b), more instances were correctly classified with the SVMs.

Paired Student’s t-tests (a two-tailed test with a 95% confidence interval) were conducted for the comparison of the predictive performance of models based on the two applied methods. The results of Paired Student’s t-tests (p value = 0.001) showed that there was a significant difference between the CTs and SVMs regarding the two predictive criteria. The test showed that the SVMs performed significantly better than CTs for both criteria. In other words, a better predictive result was obtained for A. filiculoides with the SVMs.

Rule induction with the CTs (J48 algorithm)

In total, among whole variables introduced to CTs, only nine predictors (four physical-habitat and five water quality variables) were constructed by the induced trees as the most important predictors to explain the occurrence of Azolla (Fig. 5). As shown here, tree was split in two portions based on air temperature. The left side of induced tree is related to the wet seasons (when the air temperature is dropping below 12 °C), while the right side of tree is more related to the dry seasons when air temperature is gradually increasing. This means that air temperature was the most important driving predictor for the habitat needs of Azolla in the wetland. The aquatic fern would prefer almost a medium range of air temperature (12 °C) for its growth in the Anzali wetland (mean A.T °C ± SD = 16.18 ± 6.78 °C). When air temperature falls below 12 °C, the relative humidity would then play the key role for its occurrence in this valuable ecosystem. In higher humidity (>82%), then prevalence of Azolla would be low, while if relative humidity was ≤82%, then total suspended solids (TSS) concentration would be more important. Here, Azolla showed a low prevalence in the sampling sites when TSS concentration exceeded 8 mg/l, while with a TSS ≤ 8 mg/l, it had medium prevalence.

Fig. 5
figure 5

J48 (PCF 0.25) for predicting the habitat preferences of A. filiculoides in Selkeh wildlife refuge (COD chemical oxygen demand, TSS total suspended solids). L, M and H represent low, medium and high cover percentage of A. filiculoides, respectively). Values between brackets in the rectangles indicate instances in which rules are true/false

Depth of water layer also had a great importance in the study area. This variable grew after air temperature confirming its high ecological significance in the wetland. All other variables were related to the water depth. The minimum and maximum depth recorded in Selkeh Wildlife Refuge was 0.30 and 1.20 m, respectively (mean depth ± SD = 0.53 ± 0.20 m). Apparently, Azolla preferred a medium depth for its growth in the wetland (0.50 m). In the deeper parts of wetland (>0.50 m), the assessment of Azolla cover was only dependent on water quality variables (such as organic nitrogen and chemical oxygen demand). Increasing the organic nitrogen concentration will in turn lead to a medium prevalence of this species. In addition, rising chemical oxygen demand concentration in the wetland (>35 mg/l) would lower Azolla distribution and reverse. In shallow water (<0.50 m), the amount of sunlight contributed to the prediction. The invasive aquatic fern might expand its growth in the wetland when the amount of light intensity surpassed 207.8 h per month. But when light intensity dropped below 207.8 h, the occurrence of Azolla became dependent on orthophosphate and nitrate concentrations, respectively. So orthophosphate concentration is higher than 0.08 mg/l, the invasive fern will show a medium prevalence. When light intensity is below 207.8 h the amount of nitrate will then play an important role for Azolla in the study area. Azolla tended to show a high prevalence when nitrate concentration reached below 0.24 mg/l.

Weighing attributes using SMO

SVMs models gave different attribute weights for each input variable. Variables with an absolute weight value >0.50 were considered as very important predictors. The depth of wetland, dissolved oxygen, sulphate, air temperature, the number of sunny hours, humidity, orthophosphate, biological oxygen demand and TSS played an important role for assessing the habitat needs of Azolla in the study area (each predictor with an absolute weight >0.50). Salinity, ammonium and total nitrogen had an intermediary contribution to the prediction (0.40 < attribute weight < 0.50). Other variables (number of freezing days, total dissolved solids (TDS), organic nitrogen, potassium and total phosphate) had a less effect on the growth of this invasive species (each variable with a weight <0.10). Based on the attribute weights provided by the SVMs, most habitat variables (expect freezing days) were more important variables than the water quality ones.

PCA

To examine the impact of spatio-temporal patterns on habitat conditions of Azolla, a PCA biplot (using log transformed values) was performed using 12 environmental variables in different sampling sites in dry and wet seasons (Fig. 6). The distribution of the samples showed clear seasonal patterns so that based on the first and second principal components, seasonal variations showed dissimilarity between dry and wet seasons. The first component describes 51.3% (with eigenvalue of 0.35) and second component explains 22% (with eigenvalue of 0.15) of total variations, respectively.

Fig. 6
figure 6

PCA biplot of the samples collected in the dry (D) and the wet (W) seasons in the Selkeh Wildlife Refuge using log transformed values of water quality and habitat variables. D1, D2, W3 and W4 represent the sampling seasons which took place in spring, summer, autumn and winter, respectively. Si represents the number of sampling sites from 1 to 7. The first component explains 51.3% (with eigenvalue of 0.35) and second component describes 22% (with eigenvalue of 0.15) of total variations

From the biplot, it can be observed that the most important environmental variables influencing the habitat preferences of the exotic fern in the Selkeh wetland were related a combination of physical-habitat and water quality variables. TSS, sulphate, air temperature and the number of sunny hours were among others affecting the occurrence of Azolla population mostly in spring and summer periods. In reverse, low cover percentage of Azolla was mainly attributed in the wet seasons (particularly in the winter period). In contrast to temporal patterns, there are no clear spatial patterns for the invasive fern in the different sampling sites. Nevertheless, among different sites, the sites of five and seven provided relatively a suitable habitat condition for Azolla in the wetland particularly in dry seasons.

DA

Akin to PCA, a DA was performed with 12 environmental variables (using log transformed values) collected in the same sampling sites and seasons (Fig. 7). Multivariate DA showed that the three groups of Azolla’s cover class were distinctly classified. The first discriminant axis describes 82.35% of the total variability and the second axis 17.65% of all eigenvalues, respectively. The projection of the two Canonical Variates displays a clear separation of low cover class with only small area of overlap with the cluster of high cover class while medium cover class has relatively a big area of overlap with low and high cover classes. From the DA biplot, it can be seen that the most important environmental variables affecting the high occurrence of Azolla population in dry seasons (as depicted with the green convex hulls) were related to TSS, SO4 2−, air temperature and the number of sunny hours. On the contrary, wet seasons (in particular winter period) supported less population of Azolla (low prevalence of Azolla is shown with the red convex hulls). Chemical variables such as the dissolved oxygen, organic nitrogen and nitrate concentrations and physical-habitat variables (e.g. depth and humidity) were observed in the opposite of high cover class.

Fig. 7
figure 7

Scatter plot of DA showing three groups of Azolla’s cover class based on 12 water quality and physical-habitat variables (green, blue and red convex hulls represent high, medium and low cover classes of Azolla, respectively). The first axis describes 82.35% of the total variability (with eigenvalue of 2.4) and second axis explains 17.65% of the total variability (with eigenvalue of 0.5), respectively. The specimens belonging to each group are represented with different symbols. TSS total suspended solids, SO4 sulphate, AT air temperature, SU the number of sunny hours, Dep depth, DO dissolved oxygen, ON organic nitrogen, SiDi represents the number of sampling sites from 1 to 7 in dry seasons (spring and summer), SiWi the number of sampling sites in wet seasons (autumn and winter). (Color figure online)

Discussion

The outcomes of CTs and SVMs showed that the methods were able to provide a reliable prediction for the habitat preferences of A. filiculoides in the protected area. Many studies (Hoang et al. 2010; Zarkami et al. 2010; Zarkami 2011) stated that models should meet at least a threshold value of CCI ≥70% and k ≥ 0.40 in order to have a reasonable assessment for the target organism.

The use of optimal tree pruning (so-called PCF) in the J48 reduced the complexity of tree. By doing so, the induced model could become more transparent to assess the habitat preferences of organisms (Zarkami 2011). Consequently, the constructed trees allowed for a correct and easy ecological interpretation of selected variables for the exotic fern in the study area. In addition, the use of attribute weights in SVMs could provide more reliable predictive results than CTs for analyzing the habitat requirements of organisms proving that the SVMs are more robust than CTs to solve a range of problems which are observed as a noise in datasets (Hoang et al. 2010). However, for management of wetland ecosystems, the CTs are preferred over SVMs due to a better visualization of the selected predictors by wetland managers.

All data-driven (CTs and SVMs) and multivariate methods (PCA and DA) showed that the both physical-habitat and water quality variables might influence the habitat preferences of A. filiculoides in the Selkeh wetland. However, according to decision rule made by J48 and attribute weight of SMO, the physical-habitat variables might have to some extent more impact than water quality ones on the occurrence of the exotic fern in the wetland (Sadeghi et al. 2012). The biplot of multivariate techniques showed that the habitat preferences of the exotic fern in the wetland are more influenced by the water quality variables than physical-habitat ones.

Therefore, according to the both data-driven methods, air temperature, humidity, the number of sunny hours and the depth of wetland were important predictors in the wetland. There is a close and positive relationship between air temperature and the high abundance of Azolla in tropical and subtropical climates (Hill 2003; Van Der Heide et al. 2006). This predictor is a main factor in stimulating the growth of some invasive aquatic plants like water hyacinth (Eichhornia crassipes) and A. filiculoides in tropical regions (Kannaiyan and Somporn 1989). This implies that seasonality (the effect of different seasons) can have an important impact on the growth of A. filiculoides in the Anzali wetland. When temperature goes above 30 °C or drops to below −4 °C, the growth of this aquatic fern is highly restricted (Serag et al. 2000; Liu et al. 2008; Fernández-Zamudio et al. 2010). The given ranges were almost in accordance with the mean range recorded in the Anzali wetland.

Another important predictor in determining the growth of Azolla at the Selkeh Wildlife Refuge was the depth of wetland. Very deep or very shallow waters in the wetland can restrict the growth and biomass production of the A. filiculoides (Biswas et al. 2005). This means that this aquatic fern requires a suitable depth for its optimal growth and biomass production. Apparently, many parts of Anzali wetland offer a suitable depth for the establishment of A. filiculoides (JICA 2005). The optimal depth for A. filiculoides is the places where emergent plants (e.g. cattails, Typha latifolia, common reed, Phragmites australis, and sedges, Carex spp..) become established at the edge of the ecosystem. According to the biplots of PCA and DA, the deeper part of wetland can restrict the prevalence of Azolla in wet seasons (low prevalence of Azolla is mainly found in wet seasons when the depth of wetland is increasing).

Other physical-habitat variables (except the number of freezing days) were also important factors to meet the habitat needs of the exotic fern in the sampling sites. The number of freezing days had almost no contribution to the evaluation of occurrence of the exotic species in the Selkeh wetland. The fact that the number of days dropping to below freezing is very short in the Anzali wetland, therefore this predictor cannot be considered as a determinant factor for the growth of A. filiculoides. Since there is enough humidity in the northern part of Iran, Anzali wetland is considered as a suitable place for the survival of A. filiculoides so that this variable was also considered as driving predictor in the wetland. Seemingly, this invasive species cannot survive in other parts of Iran due to a relative humidity <60%. In general, low humidity (<60%) makes Azolla very weak because an increase in Azolla biomass is somewhat dependent on air humidity as at less than 60% of relative humidity, the fern becomes dehydrated and fragile (Bocchi and Malgioglio 2010). Although, it has to be noted that very high humidity might also play an inhibitory role for the growth of Azolla. This is obvious from decision made by the constructed trees where low cover percentage of Azolla coincides with very high humidity (>82%). Also based on the ordination techniques, the low cover percentage of Azolla can be attributed to the winter season where the maximum humidity is found in the given period.

In addition to the physical-habitat variables, some chemical predictors such as organic nitrogen, TSS, SO −24 , DO and o-PO −34 concentrations (according to both data-driven methods) as well as a combination of these variables (based on PCA and DA methods) can also enhance A. filiculoides’s growth and sporulation (Janes 1998). Particularly the lack of phosphorus might decrease or even stop Azolla’s growth (e.g. Watanabe and Espinas 1976). In contrast, low concentration of nitrate cannot restrict the growth of this exotic fern (Sadeghi et al. 2012). This was very well confirmed by the applied methods in the present study. The application of various types of fertilizers (in particular sulphate and phosphate) in rice fields is a main source of these nutrients in the wetland. High concentration of various types of nutrients in the Selkeh wetland might result in a significant decrease in dissolved oxygen concentration as confirmed with attribute weight of SVMs and also ordination techniques of PCA and DA in the dry seasons.

In contradiction of temporal patterns, no clear spatial patterns were found for the exotic fern in the different sampling sites. Though, some sites (e.g. 5 and 7) provided relatively a suitable habitat condition for Azolla in dry seasons in the wetland due to enough nutrients (e.g. sulphate), adequate sunny hours and air temperature. On top of this, these sites were in the vicinity of emergent plants (such as Phragmites australis, Sparganium erectum and Typha latifolia). These plants are able to create a windbreak for supporting of Azolla in the wetland (JICA 2005).

However, all applied techniques presented the most important explanatory wetland characteristics for analyzing the habitat preferences of A. filiculoides in the wetland, the selected variables might not be exactly the only ones for assessing the habitat requirements of the target organism (Ambelu et al. 2010). In other words, the methods (based on available information) would give the priority to the most important predictors. If those physical-habitat variables selected by the data-driven methods (particularly air temperature, humidity, depth…) were excluded from the models, other variables (particularly nutrients) would be more important ones. Sometimes the correlation in the data is ecologically meaningful so some variables (e.g. nutrients) were strongly correlated. However, the SVMs and CTs are generally less affected by correlated variables (Zarkami et al. 2012), those very high correlated variables might to some extent prevent the methods for selecting both variables so that any r above 0.20 would cause such data-driven models unstable (Goethals 2005).

However, gathering a big dataset might make the models more reliable, the historical datasets will not necessarily improve the predictive performances of models if the additional series are noisy or unrelated to the target variable (Boivin and Ng 2006). On the top of this, if there are not huge variations of environmental gradients over a historical period, 1 year data collection (albeit with considering monthly sampling into account) will be sufficient to enhance the predictive power of data-driven models (as it is already performed for CTs and SVMs in the present research).

Moreover, the study of the habitat needs of invasive aquatic ferns is to some extent difficult because the biotic and abiotic variables that influence their growth are complex. Particularly, in contrast to other Azolla species, the assessment of habitat requirements of A. filiculoides is not so easy because this fern can somewhat tolerate a variety of environmental conditions (Karatayev et al. 2009). On top of this, invasive species are more tolerant than native ones to environmental pollutants (Devin and Beisel 2007).

Conclusions

Based on the outcomes of present work, it is concluded that when datasets are limited the SVMs might yield more trustable outcomes than CTs for predicting the habitat requirements of exotic species. Nevertheless, CTs (due to a better visualization of the outcomes for wetland managers) can be a promising tool over SVMs in order to relate the wetland characteristics to habitat preferences of the exotic species. Yet, historical data gathering would further improve the prediction accuracy of models and hereby lead to a better decision on the habitat preferences of A. filiculoides in the wetland. According to the results of multivariate techniques (PCA and DA), it is concluded that combination of these ordination techniques with data-driven ones might yield better outcomes to decide on the most important variables deriving the habitat preferences of exotic ferns in wetlands. The results of data-driven and multivariate techniques suggest that for the future monitoring, one has to take the both physical-habitat (e.g. depth and humidity) and water quality variables (e.g. orthophosphate) into account while focusing more on the water quality characteristics of the wetland since most of the physical-habitat variables are unmanageable.