Introduction

The aims of the Clean Water Act (1948) and EC Water Framework Directive 2000/60/EC cannot be met without a proper water quality monitoring system. The most important criteria of monitoring systems are that they should be as representative in time and space and as cost-efficient as possible (Chilundo et al. 2008). The optimization of sampling networks—generally based on professional and economic considerations—may be a solution in achieving representativeness as the basic goal and requirement of monitoring. However, the optimization of a monitoring system can only be executed when the whole system (with all the sampling sites) is analyzed at once, because only then can the differences or similarities between the sampling locations (SL) be revealed.

In the case of monitoring system optimization, the most frequent approaches are either deterministic or the stochastic. Additionally, geographic information system (GIS)-based tools exist as well, which can be combined with deterministic or stochastic ones. To give a view of the diversity of methods used in monitoring network optimization, in the following paragraphs, examples will be shown of (i) deterministic methods both in the planning of monitoring networks and the optimization of already existing ones and (ii) their combination with GIS-based methods. This will be followed by (iii) a selection of studies of the use of stochastic modeling, and finally (iv) the combination of stochastic and GIS-based methods will be discussed.

The first step when one wants to set up a monitoring system is precise planning. Telci et al. (2009) used deterministic methods in their study. First, flow dynamics were determined using numerical modeling, and then an optimization model was created, which was in fact a genetic algorithm and Pareto optimal analysis. The methods of Sharp (1971) and Sanders (Sanders and Adrian 1978; Sanders 1980; Sanders et al. 1983) are also frequently used in the planning of monitoring networks (Strobl and Robillard 2008). These approaches can be successfully applied in the case of larger riverine systems as well. Using Sharp’s method, the sampling locations are set out based on the locations of the tributaries’ mouths, while Sanders and his coauthors suggest a more general consideration of point sources. If it is supposed that the tributaries function as point sources, the two methods are very similar. However, if there are multiple branches which differ from each other to a great extent, these approaches will not provide a satisfactory result. Another problem which arises with the use of these approaches is that the sampling site locations are not determined precisely enough. Do et al. (2012) found a solution to this problem by combining the deterministic approach with data obtained from GIS. In this way, they were able to study the effects of the non-point sources (e.g. pollutant of anthropogenic origin arriving to surface waters, mainly due to agricultural land use) as well. Continuing along this line, Do et al. (2011) in their previous study dealt with the question of using non-point sources to design the sampling network. They combine a nutrient export coefficient model and Sharp’s method in designing a monitoring network. Another example of combining deterministic (genetic algorithm) and GIS methods is presented by Park et al. (2006), who managed to determine the effectiveness of a riverine monitoring network in the Nakdong River system, Korea, while, e.g. Preziosi et al. (2013) used solely GIS methods successfully in the design of a ground water monitoring system in a pilot area in central Italy.

Besides planning new networks, there are many studies which deal with the optimization of already functioning monitoring networks, such as that of Chen et al. (2012). Here, a one-dimensional flow model, a water quality model, and a matter element analysis (MEA) were used to find similar sections of the 1890-km-long upper and middle reaches of the Heilongjiang River in northeast China. They state in their paper that “the definition of the reach of homogeneous water quality is not absolute and depends on the methodology used.” The detailed description of MEA can be found in the work of Wang (2001). Additionally, entropy theory is also a frequently applied approach, e.g. (i) Masoumi and Kerachian (2010) successfully applied entropy theory and the transinformation–distance curves to determine the efficiency of sampling sites and the optimal sampling frequency of ground water quality monitoring system, while (ii) Mahjouri and Kerachian (2011) used a micro-genetic algorithm-based optimization method to optimize the spatio-temporal functioning of the monitoring network of the Jajrood River (Iran). Icaga (2005) also used a genetic algorithm to optimize the monitoring network of the River Gediz’s watershed (Turkey). Lee et al. (2014) determined the optimal water quality monitoring locations in the Logan and Albert River network (USA) applying cost functions coupled with a genetic algorithm. Staying with the deterministic approaches, Kao et al. (2012) suggested the application of two different deterministic linear optimization models instead of the previously used simulated annealing method to optimize the monitoring network of the Derchi Reservoir watershed.

Naturally, besides deterministic, stochastic approaches are used as well when it comes to optimization of monitoring networks. Many studies use trend analysis to explore the redundancy of a certain system (Naddeo et al. 2007, 2013; Scannapieco et al. 2012). Other frequently used methods, such as cluster analysis (CA), discriminant analysis (DA), principal component analysis (PCA), and factor analysis (FA), have been combined in the studies of Alberto et al. (2001) on the watershed of the Suqia River (Argentina), Varol et al. (2012) on the catchment of the Tiger River (Turkey), and Wang et al. (2012) in their investigation of the Xiangxi River (China) catchment. Combining only CA and PCA, Fan et al. (2010) and Razmkhah et al. (2010) sought the spatio-temporal variability between riverine sampling sites. Hatvani et al. (2011; 2014a) also used CA to group similar sampling sites in the Kis-Balaton Water Protection System (the mitigation wetland of Lake Balaton, Hungary) and the shallow groundwater system of the Seewinkel (Hatvani et al. 2014c), while Kovács et al. (2012a) used a special coded CA to determine the water quality areas of Lake Balaton.

Multivariate statistical methods are frequently combined with GIS-based methods. For example, Shen and Wu (2013) used kriging during the optimization of the monitoring in the Yangtze estuary. Naturally, kriging can be used alone as well (Karamouz et al. 2009). Júnez-Ferreira and Herrera (2013) also combined two multivariate statistical methods (static Kalman filter combined with a sequential optimization method) to optimize spatio-temporarily the Valle de Querétaro aquifer monitoring network (Mexico).

The aim of this study is to provide an alternative to the previously discussed methods when it comes to the optimization of monitoring networks in time and space. This alternative is called combined cluster and discriminant analysis (CCDA, Kovács et al. 2014). CCDA is capable of handling spatio-temporal data obtained from an entire system over the course of decades. It facilitates the frequently difficult decision of whether an obtained grouping should be further divided or not to find homogeneous groups. This capability distinguishes it from the previously mentioned methods, be they deterministic, stochastic, or even combined with GIS.

The detailed aims of this study are (i) to point out within-year similarities (hydrochemical seasonsFootnote 1) and (ii) to find spatially homogeneous sampling sites on the River Tisza, with special attention paid to anthropogenic activity. These homogeneous groups can later on serve as the bases for the optimization of a monitoring system.

Materials and methods

Description of the study area

The River Tisza collects the waters of the Carpathian Basin’s eastern region; it is therefore a highly important ecological corridor (Zsuga et al. 2004). It stretches from its source in the Eastern Carpathians in Ukraine to its confluence with the Danube at Titel in Serbia. According to Lászlóffy (1982), the area of its watershed is 157,186 km2, almost one third of which is located in Hungary (approx. 47,000 km2). The average amount of water brought by the Tisza into the Danube is 25.4 billion m3 year−1 (Pécsi 1969). The main branch (966 km; Sakan et al. 2007) stretches through five countries (Ukraine, Romania, Hungary (594.5 km), Slovakia, and Serbia). Heading downstream in the river’s Hungarian section, its tributaries are the following: the Szamos, Bodrog, Sajó, Zagyva, Körös, and Maros, all of which—except for the Zagyva—come from abroad (Fig. 1). It becomes clear from the runoff values that the affluent having the strongest effect on the main flow is the Szamos (at its mouth, its average runoff exceeds half of the average runoff of the Tisza) and a considerable “changing effect” is expected from the Bodrog, Sajó, Zagyva, Körös, and Maros Rivers (Table 1).

Fig. 1
figure 1

Hungarian catchment of the River Tisza

Table 1 Characteristics of the Hungarian section of the River Tisza

It has been documented that, besides the tributaries, other, mostly anthropogenic, factors such as water barrage systems (WBS; e.g. Tiszalök WBS; Fig. 1) or lakes (e.g. Lake Tisza; Fig. 1) can also have an effect on the water quality of the analyzed river section (Kentel and Alp 2013; Moreira and Poole 1993). Even the river ice regime may change due to the installation of WBSs (Takács et al. 2013; Takács and Kern 2015). Lake Tisza is an artificial reservoir on the river, built in 1973. It was planned to function as a part of a future WBS. Nowadays, it is a frequented recreation zone and nature reserve. The lake’s length is 27 km, its mean depth is 1.3 m, and it has a total area of 127 km2. Moreover, non-point source nutrient loads arriving from agricultural areas have to be accounted for as well (Mandera and Forsberg 2000). Regarding large cities, there are several along the river (e.g. Szolnok at SL-10 and Szeged at SL-13; Fig. 1) which could also have an environmental impact on the river’s water quality.

Dataset used

Regarding systematic measurements, the Hungarian section of the Tisza (594.5 rkm) fell under the jurisdiction of many different water authorities over the studied decades (1975–2005). These different authorities did not harmonize the timing of the sampling; hence, samples were taken on different days (∼2250 sampling dates) at different SLs. Fortunately, they did take the trouble to intercalibrate the sampling methodologies (details can be found in Hungarian Standard No. MSZ 12749:1993).

In the course of the analyses, the time series of the period between 1975 and 2005, the following parameters from 14 sampling sites (Fig. 1) were examined: runoff (m3 s−1), pH, dissolved O2, BOD-5, Ca2+, Mg2+, Na2+, K+, Cl, SO4 2−, HCO3 (mg l−1), NH4-N, NO2-N, NO3-N, and PO4-P (μg l−1). After 2005, the sampling frequency was rarefied and the set of parameters changed. The total number of data analyzed added up to ∼33,500.

Data preparation was performed so that the dataset would meet some basic requirements of the applied method: (i) no missing data were allowed; all time points were discarded where there was missing data in any of the parameters; (ii) no mistyped extreme values were allowed. These were sought manually, because there were occasions when an “act of God” (e.g. flood) caused a certain parameter to behave differently from the general tendencies, although its measurements were probably accurate.

Data from the tributaries (rivers—Szamos, Bodrog, Sajó, Zagyva, Körös, and Maros) were acquired as well with the same set of variables as the ones used for the analysis of the Tisza. Regarding the tributaries, the data measured at the closest sampling location to the confluences on the tributary were used.

Methodology

A new method, combined cluster and discriminant analysis (CCDA; Kovács et al. 2014), formed the backbone of the research. The main aim of CCDA is to find homogeneous groups based on data with known origins, i.e. sampling locations in this case. It consists of three main steps (Fig. 2):

  1. (I)

    a basic grouping procedure, e.g. using hierarchical cluster analysis (HCA), to determine possible groupings;

  2. (II)

    a core cycle where the goodness of the groupings from step I and the goodness of random classifications are determined using linear discriminant analysis (LDA); these are then compared in the form of a “difference value”;

  3. (III)

    a final evaluation step, where a decision about iterative further investigation of sub-groups is taken.

Fig. 2
figure 2

Flowchart of CCDA (after Kovács et al. 2014)

The main idea here is that once the ratio of correctly classified cases for a grouping is higher than at least 95 % of the ratios for the random classifications (i.e. the difference value is positive), then at the level of α = 0.05, the given classification is considered as inhomogeneous. For a detailed description of the method, see Kovács et al. (2014) and the corresponding R package (“ccda”, http://cran.r-project.org/web/packages/ccda/) used for the computations. It has to be underlined that CCDA is generally applicable for many types of data, be they water chemistry data originating from surface (lake, wetland, river), sub-surface water systems (the watershed of a steppe lake; Kovács et al. 2015).

HCA and CCDA were applied to the data in order to (Fig. ESM 1)

  • determine the time interval of interest for investigations (HCA based on annual averages),

  • determine the seasons (HCA for monthly averages),

  • determine the similarities between the consecutive months pairwise using CCDA (note that in order to ensure the comparability of pairwise difference values, an equal number of samples was taken for every month by resampling),

  • find homogeneous groups of sampling sites in space in the chosen time interval using CCDA to provide the basis for monitoring network recalibration (“overall picture”), as well as to

  • assess the monitoring network in space within the determined seasons (also using CCDA).

In every case where a cluster group is formed, the role of each parameter in determining the formation of the previously obtained cluster pattern can be analyzed. Using Wilks’ λ statistics (1932), a Wilks’ λ quotient is assigned to every parameter, where the quotient is

$$ \lambda =\frac{\sum_i{\sum}_j{\left({\mathrm{x}}_{ij}-\overset{-}{{\mathrm{x}}_{\mathrm{i}}}\right)}^2}{\sum_i{\sum}_j{\left({\mathrm{x}}_{ij}-\overset{-}{\mathrm{x}}\right)}^2} $$

where \( {x}_{ij} \) is the jth element of the ith group, \( {\overset{-}{x}}_i \) the ith group’s mean, and \( \overset{-}{x} \) is the total mean.

The value of λ is the ratio of the within-group sum of squares to the total sum of squares. It is a number between 0 and 1. If λ = 1, then the mean of the discriminant scores is the same in all groups and there is no inter-group variability, so, in this case, the parameter did not affect the formation of the cluster groups (Afifi et al. 2004). On the contrary, if λ = 0, then that particular parameter affected the formation of the cluster groups the most. The smaller the quotient is, the more it determines the formation of the cluster groups (Kovács et al. 2012b).

Results

The first step in the analysis was to find the most appropriate time period for the detailed analyses. Based on the time series of the water quality variables, a breakpoint was suspected (e.g. Fig. ESM 2). Therefore, HCA was applied to all the annual averages of the variables together (1975–2005). It pointed out that the dataset consisted of two main time intervals: 1975–1992 and 1993–2005 (Fig. ESM 3). In further investigations, the time interval 1993–2005 was examined for the following reasons: (i) data from an additional sampling site were available in the latter period and (ii) it is to be expected that the second period could provide the more accurate picture of the current status of the monitoring system.

As a next step, the intra-annual similarities were sought on a monthly scale. HCA pointed out that there are basically four seasons (Fig. 3).

Fig. 3
figure 3

Results of HCA conducted on monthly averages to find seasons

CCDA also provided the opportunity to explore homogeneity within the seasons. It showed that even though the previously determined seasons consist of similarly behaving months, these seasons cannot be considered as homogeneous. Moreover, the months would even form 12 separate homogeneous groups, each month forming a “group” of its own. To point out the significant differences, CCDA was applied to the pairwise comparison of consecutive months (Fig. 4). To ensure the comparability of the resulting difference values, an equal number of samples for each month was selected using resampling.

Fig. 4
figure 4

Pairwise differences between the months pointed out by CCDA

The next step was to find homogeneous spatial patterns, to be more specific those SLs which contain redundant information. For this, CCDA was used on all the available data for the time interval 1993–2005, resulting in an “overall picture” where 11 homogeneous groups were formed from the 14 SLs considered. There were eight groups consisting of one SL in each. Although some of these are quite close to each other, such as SL-5 and SL-6 at 2 rkm distance from each other, the information obtained from them makes them highly informative, even alone. In the meanwhile, three groups consisting of two sampling sites were obtained. These pairs of sampling sites were always located next to each other (Fig. 5). However, these were sometimes quite a long distance, ∼70 km, from each other (SL-3 and SL-4).

Fig. 5
figure 5

Overall picture with colored sampling locations forming homogeneous groups

The other approach to determining the spatial pattern in the sampling sites is to analyze their similarity, not on a long but on a much shorter temporal scale, to be more precise seasonally. This was considered to be an important step since the hypothesis is that the sampling sites’ relationship to each other changes with the seasons. Therefore, the question was raised, to what extent does the “overall picture” coincide with the seasonal one?

As a first step, let us explore all the winter months for the time period between 1993 and 2005. At this time of the year, the 14 SLs can be divided into 10 homogeneous groups (Fig. 6a) using CCDA. Six SLs form groups alone while there are four groups consisting of two sampling sites each.

Fig. 6
figure 6

Homogeneous groups of sampling sites (colored ones) on the Hungarian section of the River Tisza for each of the four seasons

In spring, there are only three SL pairs, because SL-10 and SL-11 separated into two distinct groups. Apart from this change, all the other SLs remained in the same group as observed in winter. Therefore, there were 11 homogeneous groups in the spring compared to the 10 in winter (Fig. 6b). This pattern is in accordance with the overall picture obtained from the data for the whole year.

Moving onto the two last seasons, a similar pattern was obtained. In summer and fall, SL-2 connected to the group containing SL-3 and SL-4, unlike in winter or spring. This was the only group containing three SLs (Fig. 6c, d). As a result, the number of homogeneous groups decreased by one from 11 to 10 in summer and in fall.

To summarize, concerning the similarities and differences of the SLs in space (downstream North to South) with regard to the “overall” and the seasonal picture, the following can be said:

  1. i.

    SL-1 formed one group alone in the case of every approach;

  2. ii.

    SL-2 in summer and fall connects to the group formed by SL-3 and SL-4: the latter two were together in all cases;

  3. iii.

    SL-5 and SL-6 always remained separated;

  4. iv.

    SL-7 and SL-8 always formed one group together;

  5. v.

    SL-9 formed one group alone in the case of every approach;

  6. vi.

    SL-10 and SL-11 in winter form one group;

  7. vii.

    SL-12 and SL-13 always formed one group; and

  8. viii.

    SL-14 formed one group alone in the case of every approach.

Discussion

Separation of the time period

As presented, the dataset was split at 1992/1993 into two similarly behaving time intervals. The reason for this separation was the change in the concentration of the parameters measuring the trophic conditions and saprobity at the turn of the decade (1990). For example, PO4-P, BOD-5, and NO3-N started decreasing at the beginning of the 1990s, while DO started to increase (Fig. ESM 2). The presence of these phenomena concurs with the findings of Mandera and Forsberg (2000), who pointed out that anthropogenic activity was responsible for the elevated concentrations of nutrients in the waters of the Tisza. According to the new political thinking at the end of the 1980s and the beginning of the 1990s, high state subsidies were withdrawn from artificial fertilizers, while high fertilizer prices and the unpredictable future of state farms and cooperatives resulted in a drastic reduction in fertilizer usage. Hungarian agriculture and industry were severely restructured after the collapse of the Soviet Union in 1991. There was a tenfold increase in fertilizer prices in the 1980s. The fall in fertilizer use was much more dramatic (between 10 and 20 times) than in the Western European countries (Csathó et al. 2007; Hatvani et al. 2014b).

These two facts played a great role in the decrease of PO4-P concentrations in inland waters located in agricultural areas in last years of the twentieth century (Grimvall et al. 2000).

Separation of the seasons

Since Hungary is located in the continental climate zone, from a meteorological perspective four seasons exist. Huschke (1959) in his work entitled “The Glossary of Meteorology” defines the seasons as follows: The warmest period in the year is summer everywhere in the world, except for a couple of tropic regions, while the coldest is winter. According to Trenberth (1983), between these two, spring and fall serve as transitional periods. Here, Trenberth suggested that the seasons should be determined based on annual mean temperatures.

The seasonal pattern obtained did not concur with the general meteorological aspect of the four seasons with equal length in the continental climate zone. Summer and winter were both 4 months long, while spring and fall were each 2 months long (Fig. 3). This observation is supported by the difference values obtained for consecutive months using CCDA (Fig. 4). Since the difference value between February and March (9 %) is much smaller than the one for March–April (20 %), it is reasonable that March rather joins the winter months. Similarly, September is closer to August than October (differences of 8 and 10 % respectively). The other bordering months between the seasons, December and June show a similar behavior as well, justifying that they belong to the winter and the summer seasons, respectively. This finding that the seasons are of unequal length concurs with the research of Alpert et al. (2004) conducted on meteorological data. It is suspected that the reason behind this phenomenon lies in the fast transition of the characteristic processes of winter into summer in spring and the other way around in fall.

To answer the question of which parameters were responsible for the separation of the consecutive months, their Wilks’ lambda statistics were determined (Table 2). Parameters closely related to water temperature (e.g. dissolved oxygen) and/or seasonal effects such as floods, composition, and decomposition processes (e.g., ammonium or nitrate; Table 2; Fig. ESM 4A) influenced the separation of the months the most and are likely to vary between the different parts of the year. Regarding the less influential variables, their distribution was less variable across the months (Table 2; Fig. ESM 4B). Nitrate and dissolved oxygen are both periodic. In the meanwhile, runoff reaches its maxima during green floods, as a consequence becoming an important grouping factor. However, not all of the changes in the parameters’ concentrations were connected to the transitional seasons or even the points of transition. Ammonium reaches its maximum in winter because of the dominating decomposition processes (Hatvani et al. 2011), standing out from the other seasons and, therefore, becoming another considerable factor, while other (less determining) parameters are more or less stable during the year and, thus, do not play such an important role in the grouping procedure.

Table 2 Wilks’ lambda statistics of the parameters. The smaller the coefficient is, the more the parameter was responsible for the separation of the months

Separation of the sampling sites in space

The separation of the seasons was related to the intra-year fluctuations of the parameter values, while the spatial patterns were primarily caused by the tributaries and anthropogenic obstacles. Irrespective of the seasons, the tributaries separated the SLs on every occasion. This finding concurred with those of Sharp (1971) and Sanders (1980). The effect of the tributaries consisted of two factors, (i) the geochemical composition of the waters brought from their watershed, and (ii) their load of anthropogenic origin. Latter mainly consisted of nutrients. After the mouths of the tributaries the concentrations of the parameters changed (increased or decreased). These phenomena were enough to separate two SLs into two distinct groups. For example SL-13 and SL-14 were separated by the Maros River (T6) (Fig. 7a, b).

Fig. 7
figure 7

Box-and-whiskers plots showing separating effects

Unlike in the studies of Sharp (1971) and Sanders (1980), in which only the tributaries were identified as being responsible for the separation of the sections of the river, here, other separating factors were identified as well. Besides the tributaries, man-made obstacles also played a role in the separation of the SLs into different groups. This finding concurs with the new wave of perception when it comes to recalibrating monitoring networks, i.e. taking anthropogenic effects into account (Do et al. 2011). SL-5 and SL-6 (up- and downstream of WBS) separated, supposedly because of the WBS. Although the riverbed morphology changed in the area resulting in slower water velocity, the concentration of the parameters related to halobity did not change, unlike the concentration of parameters related to biological activity. Their values changed significantly up- and downstream of the WBS; DO decreased between the two, while BOD-5 and NH4-N increased (Fig. 7c, d). It is suspected that the effect of the WBS could be decreased by fine-tuning its operation through integrated basin management e.g. using the water quantity and quality optimization model of Zhang et al. (2011).

A further separation between the SLs was caused by Lake Tisza. It buffers the nutrient loads arriving from upstream and stabilizes their concentrations (e.g. BOD-5; Fig. 7e, f). Therefore it separated the group consisting of SL-7 and SL-8 and the group formed by SL-9 alone into two groups, although they are relatively close to each other.

Conclusions and outlook

In this paper, homogeneous groups of sampling sites were sought. First, it was shown that there are four temporally similar sections for the period 1993–2005. The months within hydrochemical seasons, however, can only be regarded as similar and not as homogeneous. The most explicit changes could be observed in April–May and October–November. Hence, a higher temporal sampling frequency is suggested in spring and fall, compared to summer and winter months, to obtain a more precise picture of the underlying processes. Nevertheless, for similar cases of unequal season lengths, it should be kept in mind that many statistical methods require equidistant sampling, for which the harmonization of the temporal sampling frequency is essential. Thus, the temporal sampling frequency should be aligned according to the seasons with the highest difference values, i.e., spring and fall; this will comply with the needs of the whole monitoring as well.

In space, the borders between the homogeneous spatial groups tend to be located at the mouths of the tributaries, as shown by Sharp (1971), but not exclusively. Human-made obstacles such as the WBS can also cause such disturbances, leading to the separation of close neighboring sampling sites. As a result, out of the 14 SLs, only 11 are necessary.

The study shows a good example of the spatio-temporal optimization of a monitoring network using the generally applicable combined cluster and discriminant analysis. The authors are convinced that similar situations can be found all over the world; therefore, the conclusions, the method, and the message can be generalized.