Keywords

1 Motivation

Scarce resources need to be managed very carefully. Water quantity, water quality and soil quality certainly are scarce and valuable resources in most parts of the world, and are intimately intertwined. None of them could be replaced by any technical means at affordable prices and at the required scale. Water and soil resources management has to grasp the actual state, current and future threats and options, has to implement and perform measures, and, last but not least, has to evaluate the effects and possible harmful side-effects of the adopted measures as well as the efficiency of the management. Thus a well-designed monitoring scheme is indispensable for providing the required data with the necessary spatial and temporal resolution (Muller 2012).

Natural systems usually exhibit considerable heterogeneity in space and time which any monitoring has to account for. This requires a large number of replicates, yielding large data sets that are not easy to handle. Moreover, usually a multitude of different processes act in parallel and at different scales and have an impact on the observed variables. In return, single measured variables usually are prone to various processes that will be reflected by the measurement data, although to unknown quantities. Last but not least, the monitoring scheme itself needs to be evaluated and to be checked with respect to efficiency as well as to possible errors etc.

2 Example Data Set

Here two new techniques will be presented that have been developed recently and have proved successful in a variety of studies on soil and water quality. Both methods explicitly allow for non-linearities that are to the author’s experience more the rule rather than the exception for real world data. Statistical analyses and graphs have been performed using the R software environment (R Development Core Team 2006) which is freely available at http://www.r-project.org.

Both methods have been applied to a data set that was kindly provided by Dr. Uwe Schindler, Leibniz Centre for Agricultural Landscape Research, Müncheberg, Germany. It comprises data from 718 samples of shallow groundwater, sampled between March 2000 and May 2004 by piezometers at 2–4 m depth at seven different sites in the Quillow catchment, about 100 km north of Berlin, Germany. Land use was arable land mostly, but included grassland and forest sites as well. In total, 41 different pipes have been sampled. Number of temporal replicates per pipe varied between 1 and 38.

Eight variables have been determined: Concentration of NH4, NO3, P, Cl, and SO4, pH value (pH), redox potential, and electric conductivity (EC). The data were tested for plausibility prior to the analysis. Values below the level of quantification were replaced by half the value of quantification level. To compensate for different data ranges for different variables, the data were z-normalized for each variable separately (values divided by the respective mean value, and then divided by the standard deviation).

Most variables are only slightly correlated with each other. Only SO4 and Cl concentration are clearly correlated with electric conductivity and are closely related to each other (Table 1). This data set is rather small with respect to the number of samples as well as to the number of measured variables compared to some recent applications (Lischeid and Bittersohl 2008; Lischeid 2009; Schilli et al. 2010). Thus it is well suited to serve as an example. A good textbook that provides in-depth information about these methods and many more related approaches is given, e.g., by Lee and Verleysen (2007).

Table 1 Pearson correlation between the measured variables

3 Dimensionality Reduction and Process Identification

Measured variables of soil or water quality usually reflect effects of various processes. E.g., nitrate concentration is affected by fertilisation, uptake by plants, nitrogen release by decomposition of organic matter, oxidation of ammonium, leaching, denitrification, etc. On the other hand, single processes usually affect more than one variable. E.g., denitrification requires reduced oxygen availability and low redox potential, which often comes along with enhanced Fe(II) concentration or sulphate concentration due to oxidation of sulphides. This interplay generates a characteristic “fingerprint” that can be used in an inverse approach to identify this process based on observed measured variables. Other processes differ with respect to their “fingerprints” and thus can be differentiated in a large data set. This is in fact the basic idea of applying dimensionality reduction approaches.

The term “dimensionality reduction” reflects another aspect of these approaches: Often only a very small number of processes prevail in a given data set. Thus, a small number of “components” suffices to explain a large fraction of the variance of a data set with many variables, that is, a high-dimensional data set. Thus application of dimensionality reduction techniques facilitates interpretation and understanding the basic features of large, high-dimensional data sets.

A well known approach is the Principal Component Analysis (PCA). In a mathematical sense it performs an Eigen value decomposition of the correlation matrix of the measured variables. It is the most efficient approach as long as the data are Gaussian distributed and only linear relationships have to be accounted for. It has often been applied to large multivariate data sets of water quality (Haag and Westrich 2002; Thyne et al. 2004; Fernandes et al. 2006) as well as in soil science and geochemistry (Gupta et al. 2006; Zhang et al. 2007; Weyer et al. 2008; Langer and Rinklebe 2009).

However, the PCA is neither well suited for non-Gaussian distributed data nor for data sets that exhibit pronounced non-linear correlations. To that end a variety of non-linear methods of dimensionality reduction approaches have been developed (Lee and Verleysen 2007). Unlike for the linear case, there is no clear advice when which approach would be superior to others, and a trial-and-error procedure is suggested (Lee and Verleysen 2007). One of these approaches is presented here. The Isometric Feature Mapping (Isomap) has been introduced by Tenenbaum et al. (2000). In a certain sense it is a non-linear extension of the linear PCA approach (Gámez et al. 2004). The theory of that approach is not presented here. To that purpose, the reader is referred to Tenenbaum et al. (2000), Gámez et al. (2004), Mahecha et al. (2007) or Lee and Verleysen (2007).

Here the potential of that approach for soil and water quality monitoring is illustrated. Compared to the linear PCA, the first two Isomap components explain a substantially larger fraction of variance of the data set (82 % compared to 70 %) (Fig. 1). This provides some evidence that non-linear relationship play a role in the given data set. Including the third component yields about 90 % of the variance both for Isomap as for PCA. Additional components increase that fraction only marginally. It can thus be concluded that only three different processes play an important role for the given data set. In the following only the Isomap results will be considered.

Fig. 1
figure 1

Cumulative variance explained by eight principal components (PCA) and isometric feature mapping (Isomap) components

The next step aims at identifying these three processes. The Isomap analysis ascribes a value for each of the eight components to each of the 718 samples of the data set. These component values, called “scores”, can be considered a measure for the strength of the effect of the respective (yet unknown) process. The relationship of the measured variables with the Isomap scores reveals the “fingerprint” of the respective process. This relationship might be blurred due to the fact that other processes have an impact on the measured variables as well. Those effects can be subtracted from the measured data as follows: A multivariate linear regression of the measured values with all but one Isomap component gives the fraction of the variance of the measured data that is explained by these components. Then the residuals of that regression are compared with the scores of the remaining component.

Figure 2 gives the results for the first component which depicts 59 % of the variance of the data set. It is positively correlated with SO4 and Cl concentration, pH value and electric conductivity. Negative and partly non-linear correlations are observed for NH4, NO3 and P concentration and redox potential. Figure 3 shows that component scores are close to zero at most sites. Only four sites exhibit very large values and substantial temporal variance. These results were unexpected prior to the study. However, both the “fingerprint” as well as the spatial pattern provide strong evidence that upwelling saline groundwater plays a major role at these sites: very high SO4 and Cl concentration, pH value and electric conductivity are indicative for tertiary deep groundwater that is known to upwell at many sites in East Germany. Low redox potential, and hardly detectable agricultural contaminants like NH4, NO3 and P support this hypothesis. Only few piezometers at two different sites were affected. It has been found at other sites that plumes of upwelling saline groundwater can have in fact very small extent.

Fig. 2
figure 2

Scores of the first component versus the residuals of the measured variables

Fig. 3
figure 3

Scores of the first component for different piezometers. The boxes give the 25th, 50th and 75th percentiles, and the whiskers the range of the Isomap scores

The second component accounts for 23 % of the data set’s variance. The scores of that component increase with NO3 concentration and redox potential and decrease with NH4 and P concentration and with pH value. The remaining variables do not show any clear relationship with the second component (Fig. 4). This pattern is consistent with the well known sequence of redox processes. High scores reflect oxic conditions where NO3 is stable. Being the anion of a strong mineral acid, high NO3 concentration reduces the pH value. In contrast, low scores of the second component characterise hypoxic conditions where NO3 is partly denitrified, partly reduced to NH4, and P is released from sorption to Fe oxides and hydroxides, that become reduced and dissolved under hypoxic conditions. That interpretation is confirmed by the fact that very low scores have been found at the wetland site (MK6, MK28) (Fig. 5).

Fig. 4
figure 4

Scores of the second component versus the residuals of the measured variables

Fig. 5
figure 5

Scores of the second component for different piezometers. The boxes give the 25th, 50th and 75th percentiles, and the whiskers the range of the Isomap scores

This component exhibits consistent temporal patterns at many sites. Figure 6 gives time series of the component scores at selected sites. Many sites show very low values during winter 2000/2001, and in the second half of 2002. Precipitation in September 2001 was about four times that of mean monthly precipitation, and continued to exceed the mean until the end of 2002. Thus, soil saturation was higher as usual, which might have promoted oxygen consumption and a decrease of redox potential. It is remarkable that other components exhibited less clear and consistent temporal patterns in this study (not shown).

Fig. 6
figure 6

Time series of scores of the second component for selected sites

Another 8.5 % of the variance of the data set is explained by the third component. It is positively correlated with NO3 and P concentration and pH values, and negatively with NH4 concentration and redox potential (Fig. 7). Thus this component is determined by the same variables compared to the second component, although partly with inverted sign. The third component explains some scatter in the data that has not already depicted by the second component. As will be shown later (Fig. 10), this component mainly differentiates between samples with low scores of the second component, that is, reduced groundwater. Samples with high scores of the third component exhibit higher NO3 and P concentration and pH value, and lower NH4 concentration and redox potential than would be expected based on the scores of the second component only. This might be either due to kinetic constraints, or to site-specific peculiarities. In fact, temporarily high scores of the third component have been observed at three sites only (Fig. 8).

Fig. 7
figure 7

Scores of the third component versus the residuals of the measured variables

Fig. 8
figure 8

Scores of the third component for different suction cups. The boxes give the 25th, 50th and 75th percentiles, and the whiskers the range of the Isomap scores

Statistics cannot replace the expert’s knowledge about biogeochemical processes. The interpretation of the Isomap results might be matter of debate, or might need to be refined and proved by additional measurements. E.g., nitrate isotope data can be used as an independent assessment of denitrification processes. But the Isomap analysis does reveal relationships within the given data set that both challenge and deepen our understanding of the prevailing processes.

4 Visualization

Visualization means more than generating fancy graphs for presentations or papers. Visualization makes use of the most powerful interface between the data stored in a computer and the human brain. The latter is very good in extracting relevant information from large data sets. Within less than a second information from 250 million sensory cells in the man’s eyes, that is, a 250 million dimensional data set, is reduced to a handful of relevant objects. Without this skill man would not have been able to escape predators, enemies, or other road users, and to hunt prey or to find food.

Efficient visualization can help a lot in identifying clusters and outliers, in analysing multivariate trends, or to detect any peculiarities in large multivariate data sets. Here the SOM-SM approach is presented and is applied to the same data set as the Isometric Feature Mapping. It is a combination of Sammon’s Mapping (SM; Sammon 1969) and Self-Organizing Maps (SOM), which have been termed Kohonen Feature Mapping as well (Kohonen 2001). Again, the reader is referred to the literature (Sammon 1969; Kohonen 2001; Lischeid 2009) for further details. Here only the application to environmental monitoring data is demonstrated.

The SOM-SM approach yields a two-dimensional graph where every symbol denotes a single sample of the data set. The same graph is used in various ways, where colours indicate different features of the data (Fig. 9). In contrast, the location of the symbols remains the same. The symbols have been allocated in an iterative procedure to ensure that the distance between any two symbols in the graph is proportional to the similarity of the respective two samples with respect to all measured variables. In this study, the two-dimensional graph depicts 89 % of the variance of the data set with eight measured variables. Please note that the scaling at the axes is given for orientation only but has nothing to do with measured data, location of the sampling site in the Quillow catchment, or alike. It is an entirely arbitrary configuration, except for the above stated correlation between distance and similarity.

Fig. 9
figure 9

Measured values of the samples, represented by the SOM-SM

In a certain sense, any of the graphs given in Fig. 9 can be interpreted analogously to a topographical map where the colour indicates the altitude, with numerous mountain tops and valleys. Similar to a topographic map, a rather smooth “landscape” develops where neighbour points exhibit roughly the same values. However, in these graphs measured values for the samples are given rather than topographic height. Moreover, different “maps” are shown for different measured variables but for the same symbols, that is, the samples of the data set (Fig. 9).

Ammonium concentration is very low and close to the detection level for most samples (Fig. 9). Only samples that plot in the lower right corner exhibit higher NH4 concentration values. In contrast, NO3 concentration of these samples is very low, and the highest NO3 concentration is found in the upper right corner of the graph. Samples with enhanced P concentration plot close to those of high NH4 concentration, but are not identical. Similarly, Cl and SO4 concentration and electric conductivity exhibit similar patterns in the SOM-SM with the highest values on the left side (Fig. 9). The pattern is less clear for pH values. There is a slight tendency of increasing pH values from the top to the bottom of the graph. For the redox potential values increase from the left to the right.

These patterns reflect the correlation structure of the data set: Variables with similar patterns, that is, same directions of steepest increase of measured values, are positively correlated, like Cl, SO4 and EC (cf. Table 1). In contrast, the directions of steepest increase of P and Cl are approximately perpendicular to each other, that is, the correlation is close to zero (Fig. 9, cf. Table 1). The gradients for P and NO3 are inverse to each other, which is symptomatic for a negative correlation.

Similar to the measured variables the Isomap scores can be represented in the SOM-SM graph (Fig. 10). There is a more or less monotonic increase from the bottom right to the upper left corner for the first component, which corresponds roughly to that of SO4, Cl and EC which in fact are closely related to this component (Fig. 2). In contrast, the scores of the second component increase from the bottom to the upper right for the second component, which is parallel or inverse to those of the redox-dependent variables (Fig. 4). Please note that the Isomap scores have not been used for generating the SOM-SM. Rather, organisation of the samples within the SOM-SM makes use of the internal relationships of the data set, similar to that of the Isomap analysis.

Fig. 10
figure 10

Isomap scores of the samples, represented by the SOM-SM

The SOM-SM graph can be used for analysing spatial and temporal patterns as well. In Fig. 11 samples from single sites are colour-coded. Although there is some overlap, samples of different sites usually cluster in different sub-regions of the graph. E.g., the C6 samples are located in the upper part of the graph, those of site FeS more to the lower right, and those of K10 in the upper left corner. These patterns can be related to those of the measured variables (Fig. 8) or of the Isomap scores (Fig. 9). Moreover, the temporal variance can be investigated for single sites separately. For example, the samples from the MK6 site spread over a larger range compared to those of C6 (Fig. 11). In addition, the MK6 samples scatter more in a direction parallel to the gradient of the second Isomap component rather to that of the first component (cf. Fig. 9). Thus, for this site changes of the redox conditions play a larger role compared to the process which is represented by the first component.

Fig. 11
figure 11

Time series of soil solution quality at selected sites, represented by the SOM-SM. Samples of the respective site as indicated in the title are coloured according to sampling year, all remaining samples are given in grey

At the K8 and at the C6 site some samples plot far away from the centre of the respective clusters. These “outliers” should be checked for possible erroneous assignment to the sampling site. As an alternative, these deviations from the centre of the respective clusters could be due to extreme conditions during or immediately prior to sampling. This is a common phenomenon especially for stream and river water samples taken during major floods, where deviations from the respective clusters occur roughly in parallel at different sites.

Colour coding in Fig. 11 indicates the year of sampling. This allows investigating trends in the SOM-SM graph, that is, systematic shifts over time. Please note that the SOM-SM represents information about all measured variables. Thus any clear shift in the graph usually is due to a (possibly non-linear) trend of more than one variable. Figure 11 indicates trends only at some of the sites, and none of these is linear. At K8 and K9, the first samples plot close to the right border, those of subsequent years more to the upper left, and those of 2004 more close to the upper right corner. Thus, this non-linear trend firstly follows an increase of the scores of the first component, and then an increase of those of the second component (cf. Fig. 9), indicating less reduced conditions. In contrast, samples at C6 follow more a spiral from the lower right in anti-clockwise direction and with decreasing radius towards 2004. Correspondingly, the data could be tested for seasonal patterns etc. (not shown).

5 Conclusions

Managing water and soil resources requires sound data that can only be provided by systematic monitoring (Muller 2012). Advanced statistical approaches have been developed to make more efficient use of large data sets in spite of substantial spatial and temporal heterogeneities that are immanent to natural systems. Here only two out of a vast zoo of different approaches have been presented. Natural water resources management as well as science could and should benefit a lot from a variety of approaches that have been developed in other disciplines.

Dimensionality reduction techniques have been applied to time series of groundwater head, lake and river water levels, and stream discharge (Longuevergne et al. 2007; Lewandowski et al. 2009; Lischeid et al. 2010; Lischeid and Kalettka 2012; Thomas et al. 2012). They were successful in differentiating between different effects on the observed dynamics at minor data requirements. The author is firmly convinced that much more can be learned from data by advanced data analysis techniques and will be happy to assist in any related efforts.