Introduction

Over the last 30 years, chemometrical analysis of water quality data has become widespread [1]. In order to manage the water quality of river basins, a great variety of static and dynamic chemometrical procedures are used for data analysis, data interpolation and approximation of time series [2, 3, 4, 5, 6], for trend estimation of water quality indicators [7, 8, 9, 10, 11], as well as for process identification and modelling [12, 13, 14, 15, 16]. Observations and measurements of concentrations of chemical, physical and biological water quality indicators serve as pointers to changes in pollution loads and ecosystem development [17, 18]. In a holistic sense, freshwater ecosystems, or more generally environmental systems, may be considered from an information theory perspective to be complex dynamic communication networks or transfer systems with biotic (living) and abiotic (non-living) components and their interrelationships [19]. Therefore, system theory should be added to the interdisciplinary context of chemometrics [20]). Usually, water quality data records contain missing data or data based on different sampling intervals. When applying methods of system theory to model environmental problems two main questions arise: how do we handle environmental time series with missing data intended for modelling and decision making [21, 22], and how do we identify dynamic environmental processes by means of chemometrical methods [1]? In this paper, different interpolation procedures are compared for water quality time series from rivers with different hydraulic characteristics. Selected results from approximating the data using analytical functions are also discussed. Using equidistant daily data records of water temperature and orthophosphate phosphorus, one subprocess of shallow lake eutrophication was identified. In this case, the raw data were taken from observations of the River Havel.

Freshwater ecosystems as filters

From a holistic point of view it is clear that measured concentrations of chemical and bio-chemical indicators within water bodies result from various processes working in parallel (transport processes, internal chemical reactions and external environmental disturbances) on different spatial-temporal scales. The present mode of sampling and environmental data acquisition [23, 24, 25, 26] represents certain process dynamics. Nonlinear feedbacks within a water body cause changes in the characteristics of input signals and system states during signal transfer processes via modulation and/or by quantification. If we view freshwater ecosystems as controllable transfer systems, concepts of general system theory become useful [19]. Ecosystems act as filters where the input time series {X(t)} are transformed into the output time series {Y (t)} by means of a filter operator F (response operator, transfer operator): Y(t)=F (X(t)). Figure 1 shows this general system concept. F is a nonlinear operator and describes the transfer behaviour of the ecosystem under consideration. The operator F can be divided into the parts state transition operator and the output operator where its parameters are independent from input variables. The system information is then stored in output data sets. Generally, each filtering system is characterised by smoothing and redundancy. Smoothing can be thought of as signal dumping, while redundancy means that the structures of the signals are changed due to signal transfer [27]. Different filter characteristics are obtained depending to the type of ecosystem.

Long-term water quality time series often contain missing data and/or irregularly sampled data. To be able to model and simulate ecological processes, all data sets need to be based on a regular time grid. Any missing data need to be replaced. Some methods of data interpolation and approximation are reported in [28], while interpolation and estimation procedures for irregularly recorded time series are discussed in [29]. Another procedure that can be used in this situation is the so-called re-sampling method [30], which requires data interpolation and—in the case of noisy data—data estimation in order to place sampled data into a regular time grid. The goal of this method is to reconstruct the time series with small sampling intervals. The following tasks are equivalent to re-sampling:

  1. 1.

    Filling the gaps in irregularly-recorded data by interpolation.

  2. 2.

    Reconstruction of the data in irregularly-recorded time series by means of analytical functions or signal estimation.

  3. 3.

    Filling the gaps in irregularly-recorded data by measuring values of reference curves.

By applying interpolation procedures to irregularly-sampled raw data sets, we can obtain time series with equidistant sampling intervals. The application of approximation methods to these time series produces functions that we can use to fill in gaps in the data under stationary conditions.

Fig. 1
figure 1

General description of an ecosystem as a filtering (transfer) system

Interpolation

To describe time-varying water quality processes it is necessary to have data available for each day, and any missing data produces complex problems [22]. The efficiencies of known interpolation methods differ [8, 9, 10, 11, 15, 16, 31, 32, 33, 34, 35, 36, 37, 38]. Four interpolation methods that can be used to obtain water quality time series with daily sampling intervals from irregularly sampled raw data with missing values are compared in Table 1. They differ in terms of the orders of the polynomials and the degrees of continuity of the derivatives of the interpolation functions \( \tilde{x}(t) \) over the interval [t k , tk+1] (see the third column in Table 1). The suitability of each interpolation method is given by the standard error SE=s/√n between the interpolated and original time series, where s is the standard deviation and n is the number of data pairs. Because of randomly—but not normally—distributed raw data and nonlinear effects within the water quality processes observed, R2-like statistics is not used. Interpolation functions are not used to make predictions for noisy data.

Table 1 Methods of interpolation

Table 2 contains an overview of the raw data sets of the River Spree at Berlin we will use as an example.

Table 2 Sample raw data from the River Spree at Berlin taken at daily intervals

Missing data and stochastic spikes occur irregularly and in different seasons in noisy records. Daily, weekly, fortnightly and monthly data series were investigated in order to find suitable interpolation methods. The standard errors that were derived are given in Table 3 for a 14-day sampling interval, and in Table 4 for a monthly sampling interval, where the smallest standard errors are written in bold type for each case.

Table 3 Standard errors for interpolated time series, with fortnightly sampling intervals
Table 4 Standard errors for interpolated time series, with monthly sampling intervals

It is apparent that, for all of the years investigated, the linear interpolation method gives better results than the nonlinear interpolation methods. This can be explained by the interpolation time step of one day which compares with most ecological process rate constants. The standard errors of nonlinear interpolation methods are often equal, which is caused by the underlying third order polynomial. However, there are differences between the continuities of their derivatives. The Nearest-Neighbour method gives stepwise interpolations parallel to the axis. This method gives the largest standard errors, so it is not well suited to the interpolation of water quality data.

Water quality concentration levels depend strongly on the quantity of water and the hydraulics of the river. To check the suitabilities of the interpolation methods described above, the long-term data sets of the Rivers Elbe, Oder, Havel, Spree, and Dahme, and those of the Teltow Channel were also investigated. Table 5 contains an overview on the results for fortnightly sampling intervals.

Table 5 Interpolation methods applied on fortnightly-sampled data from various rivers

Data gaps within time series of inorganic nitrogen compounds, turbidity, dissolved organic carbon and dissolved oxygen with two-weekly sampling interval can be filled in by linear interpolation. Nonlinear interpolation methods are more suited for orthophosphate phosphorus. Time series of UV-absorption and conductivity are stronger influenced by hydraulic and morphometric parameters of the investigated rivers. Therefore, nonlinear procedures of interpolation are more suited than others. The morphometry of the Lower Elbe leads to better equalisations of chemical and biological concentrations. For the River Oder mostly linear interpolation method gives out best results.

As in the case of two-weekly sampling intervals linear interpolations are also well suited for monthly sampling intervals (Table 6) while nonlinear methods are preferred if data gaps within time series greater than 28 days. As a result it comes out that for modelling and simulation linear interpolation will be sufficient to fill in data gaps in water quality time series.

Table 6 Interpolation methods applied to data from different rivers sampled monthly

Approximation

The central concept behind the pre-processing of time series is their approximation by particular functions. Such functions can also be used to fill gaps in data records. Water quality time series contain large (essential) and small (non-essential) process nonlinearities that can be expressed by deviations. A simple way to compensate for missing data is to use smoothing algorithms as weighted averages. Based on a time series x0(t), x1(t),..., x n (t), a new time series will be produced by the transformation y k =∑ gj-k x j for k=l, l+1, ..., lr with ∑ g j =1. Different smoothing filters are obtained depending on the choice of weights g j . The exponential smoothing of water quality time series has been demonstrated in [17, 32, 39]. Other methods used to fill gaps in water quality time series have been adapted from econometrics [40, 41]). These procedures describe medium and long-term temporal and spatial changes by some simple computable and explicitly given linear or nonlinear functions. In its simplest case, a linear function y(t)=a0(t)+a1(t) x(t) describes the data record sufficiently. Least squares methods or maximum likelihood methods are mostly used for parameter estimation. The linear approximations can be evaluated by R2 performance indices. If an approximation is not sufficient the order of polynomial is increased by 1: y(t)=a0(t)+a1(tx(t)+a2(tx2 (t). A squared function is obtained with three free parameters. These parameters can be interpreted as a mean initial value (a0), a mean rate of change (a1), and a mean process acceleration (a2). For higher order polynomials a similar interpretation of the parameters is usually not possible. Complicated nonlinear functions will then be transformed into linear functions. Another way to describe a water quality time series is to us exponential functions of the type x(t)=x(0) ekt+E(t), where x(0) is the initial value, k the rate of change, and E(t) represents a stochastic part. x(0) and k can easily be determined by linearisation.

Different classes of models may be distinguished for dynamic signal approximations [42]. The basic idea behind all of these models is the assumption that stochastic signals x(t) can be extracted by applying filtering procedures to white noise ε(t). For equidistant values of x(t) and ε(t), only the stochastic difference equation a0x(t)+a1x(t−1)+...+a r x(tr)=b0ε(t)+...+bεs(ts) is valid. Using autoregressive (AR) models of order (r), and moving average (MA) models of order (s), it is possible to obtain autoregressive process of moving averages (ARMA) models of order (r, s) using the relationship x(tn)=x(t) qn and a shifting operator q−1. By adding a measurable input signal to an ARMA model, a so-called ARMAX model is obtained, where the external signal can be considered as an external control option. All of these models belong to the class of stationary signal models. Instationary models include the class of arithmetic integrated moving average (ARIMA) models, and all classes of combined models (including polynomial/AR models, periodic regression/AR models, and others). AR models are mostly used for signal extrapolations from historical data [43, 44]. In Tables 7 and 8, results for regression type approximations of the water quality time series of the Lower Havel River are presented. In Table 7, data records for a river-like channel were investigated. Second order polynomials we found to give sufficient approximations for these data. The crosses “+” in the right column indicate the significance of the approximation at the 95% confidence level. For the river-like ecosystem, only six significant approximation functions could be found. In the case of an acceptable significance figure, missing data can be filled in by approximation functions. Exponential functions were be found instead of polynomials for nitrogen compounds.

Table 7 Approximation functions for the water quality time series from the Sacrow-Paretz-Channel (data from 1990 to 1997)
Table 8 Approximations for the water quality time series from a shallow lake (data from 1990 to 1997)

Table 8 gives analogous results for time records from the River Lower Havel, which is characterised by a chain of shallow lakes. Comparing both tables, it can be seen that the number of positive significances for the River Lower Havel data is higher because of the higher retention times in the lakes. The application of approximation functions to the water quality time series to fill in missing data will be more suitable in this case than for the rivers. However, it is clear that none of these approximation functions is suitable for predicting the varying behaviour of the water quality in the future.

Another method used for water quality modelling is the reconstruction of the time series using multiple linear regression functions [7, 8, 17, 32, 37]. From the point of view of systems theory, a so-called MISO-system (multiple input, single output) considered for signal transfer. It is often the case that this gives very good statistical approximations for reconstructing water quality time series but no causal explanations of the basic processes expressed by the chosen indicators.

Process identification

Sediments play an important role in matter cycles of freshwater ecosystems. In contrast to pelagic water, components of phosphorus and nitrogen show much higher values in sediments. Direct interrelations exist between sediment and water quality, influenced by physical, chemical, biological, hydraulic, hydrological and geomorphological driving forces. Depending on these conditions, the sediment layer works either as a nutrient trap or as a nutrient source. Phosphate remobilisation from sediment can be considered to be the result of some contradictory processes of matter transformation [45]. In partuclar, phosphate remobilisation depends on a lot of different conditions, like dissolved oxygen conditions, the load of organic substances, water temperature, pH, bacterial activity, total phosphorus content of the sediment and interstitial waters, methane convection, wind stress and hydraulic conditions, and also zoobenthic organisms and fishes. The depth of the sediment layer varies between 1 mm and 15 cm [46, 47, 48]. In order to identify the phosphorus remobilisation process, interpolated data time series for water temperature and phosphate phosphorus from two different sampling points of the River Havel were investigated. Figure 2 compares phosphate values from the River Havel at Potsdam-Humboldt Bridge with those at the end of the shallow lake area at Ketzin.

Fig. 2
figure 2

Phosphate concentrations at two different sub-watersheds of the Lower Havel River

The levels are much higher at Ketzin, which can be explained by a higher secondary load caused by remobilisation of phosphate from sediments. In general, the active sediment layer of the River Havel is 2–6 cm thick. The phenomenon of phosphate remobilisation for shallow riverine lakes has been described by various authors [47, 49, 50], who reported on the temperature dependence of the phosphate remobilisation in shallow lakes in Northeast Germany, where decreasing water temperature causes increased phosphate remobilisation from sediment. Figure 3 shows this annual behaviour for both indicators.

Fig. 3
figure 3

Dynamic behaviour of water temperature and phosphate phosphorus

A temperature-controlled turnover of nutrients was first supposed, considering well-known chemical and microbial reactions. Schettler [50] stated that, if the temperature of the upper sediment layer is higher than the temperature of the pelagic water, the phosphate content of the overlying water increases, and vice versa. Therefore, correlations between water temperature and phosphate were studied by wavelet decomposition with a db3 wavelet at level 5 [51, 52]. A distinction could be made between the basic phosphate-phosphorus content of the water body and the remobilised portion of phosphate. For the sampling points investigated, the same results were obtained where the remobilised portion increased to 500% of the basic phosphorus content. Figures 4 and 5 show details of the Daubechies wavelet analysis between both indicators at levels 4 and 5. Heating and cooling of the lake water is accompanied by opposite phosphate phosphorus behaviour. This result was confirmed by residual cross correlations, where highest negative values could be found at lag 0.

Fig. 4
figure 4

Detail d4 of the wavelet analysis of water temperature and phosphate phosphorus

Fig. 5
figure 5

Detail d5 of the wavelet analysis of water temperature and phosphate phosphorus. Correlation results are given in Figs. 6 and 7

Fig. 6
figure 6

Residual cross-correlation function of detail d4

Fig. 7
figure 7

Residual cross-correlation function of detail d5

From the wavelet decomposition results, we can see that the control of phosphate storage in sediment and phosphate remobilisation from sediment depends on the value and direction of the water temperature gradient. Analysing the gradients, the control process can be divided into four phases (Fig. 8):

Phase 0::

Increase in water temperature: f′(WT)>0, f′′(WT)>0, storage of phosphate into sediment.

Phase 1::

Diminished increase in water temperature: f′(WT)>0, f′′(WT)<0, beginning of phosphate remobilisation.

Phase 2::

Decrease in water temperature: f′(WT)<0, f′′(WT)<0, phosphate remobilisation increases.

Phase 3::

Diminished decrease in water temperature: f′(WT)<0, f′′(WT)>0, phosphate remobilisation stops.

Fig. 8
figure 8

Phases of the temperature dependency of phosphate remobilisation

As water temperature increases, the control process starts over again at phase 0.

The results from this analysis mean that phosphate remobilisation is partially controlled by the temperature gradient, not by the water temperature itself. Phosphate will be remobilised from sediment only if the second derivative of the water temperature is negative.

Conclusions

If we view ecosystems as information systems, their transfer behaviour depends on internal and external feedbacks and signal transformations. Within freshwater ecosystems, signal hazards arise through time delays between internal sources and information sinks. This means that ecological processes will be changed by switching the processes of the input and state signals. Such time delays strongly influence the dynamics of communication within an ecosystem. Missing data in water quality time series cause problems when modelling and decision making in water quality management. For small sampling intervals and for stationary water quality processes, MAs or linear interpolation methods are well suited to filling in the gaps. No general interpolation and approximation methods exist for instationary processes. As has been shown in this work, linear interpolation methods are useful tools for filling in missing data. Such time series can be used for process identification. The advantage of interpolation is an estimation of the missing data using the time series itself, in contrast to the use of approximation methods, where a fixed or well-known function is used. In either case an error estimate will be obtained.