Introduction

The Water Framework Directive (WFD) requires that European water bodies are classified according to their ecological status. Ecological classification systems for rivers are already proposed or in use by EU countries, based on, for example, macroinvertebrates (Hering et al. 2004; Verdonschot and Moog 2006) and fish (Degerman et al. 2007; Pont et al. 2007). For lakes, however, ecological classification systems are less developed. An important task for the EU-funded project REBECCA (http://www.environment.fi/syke/rebecca) was therefore to analyse relationships between chemical pressures and ecological responses in lakes. The aim of this project was to provide scientific support for the development of new ecological classification systems and for validation of existing systems. For this purpose, we have collated available monitoring data from all projects partners, as well as from external data providers. The data from all countries have been compiled into common databases for each major taxonomic group: phytoplankton, macrophytes, macroinvertebrates and fish. These taxonomic groups will be referred to as “biological quality elements” (BQEs), as defined by the WFD. Altogether there are more than 30,000 samples of these biological elements, representing more than 5,000 lakes in 20 countries (Table 1). Most of the biological samples are identified to species level. In addition there are >80,000 chlorophyll a samples, representing total phytoplankton biomass. Most of the samples are from the period between 1988 and 2003.

Table 1 Overview of contents of the REBECCA Lakes databases per country: number of lakes and samples for chlorophyll (as a proxy for phytoplankton abundance), and number of lakes, samples and taxa per biological quality element

An important motivation for developing the common databases was that the larger datasets would enable us to analyse pressure–response relationships for different lake types separately. A set of lake types based on geological and chemical properties (see Table 2) has been defined for five groups of countries within Europe (Geographical Intercalibration Groups; GIGs) by the pan-European WFD Common Implementation Strategy (European Commission 2003). These lake types are expected to have specific ecological reference conditions (i.e. community composition in non-disturbed conditions) and specific ecological responses to pressures. This lake typology will not serve as an optimal categorisation for all biological elements and for all pressure types (see e.g. Verdonschot 2006b), but we expect that type-specific analyses will at least reduce some of the unexplained variation in the ecological responses. In REBECCA, we have been able to characterise ecological relationships for all lake types separately: chlorophyll (Carvalho et al. 2008; Ptacnik et al. 2008a, b), phytoplankton (Ptacnik et al. 2008b); or for combined groups of lake types: macrophytes (Penning et al. 2008a, b) and macroinvertebrates (Schartau et al. 2008).

Table 2 Definition of lake types used in the REBECCA Lakes databases, based on the Intercalibration typology developed by ECOSTAT

Another motivation for the common database approach was to assist the Intercalibration process, i.e. the intercalibration of class boundaries of existing national ecological classification systems within the GIGs (European Commission 2005). Datasets compiled within the GIGs were provided to the REBECCA databases, and results of data analyses (or data tables formatted for analysis) were returned to the GIGs. This collaboration gave synergies for both parties: the REBECCA project obtained a considerably larger empirical foundation for characterisation of pressure–response relationships, and the GIGs obtained more precise results for the intercalibration of their classification systems (Lyche Solheim et al. 2008).

A database strategy was not actually planned from the beginning of the project, nor was a trained database manager involved, but the need for proper relational databases became obvious after the start of project. Two main factors necessitated the databases: the amount of data that we received was far greater than expected, and the data formats were more heterogeneous than foreseen. An explanation for this development is that the external interest in the REBECCA databases grew throughout the project: as preliminary results from the project were presented in meetings, we were offered more data, both from REBECCA partners and from institutions that were not project partners. In particular, the Intercalibration project contributed a substantial amount of phytoplankton data from large parts of Europe. Although a template was developed for data submission by the project partners, we eventually decided to accept data in any format from external data providers (and to some degree from partners). Therefore, the earlier versions of the databases had to be modified in order to accommodate this variety of data formats.

The strategy employed for REBECCA Lakes of holding data in common databases had obvious benefits compared with single-country analyses, but it also involved considerable efforts and challenges. This article does not attempt to describe the optimal way of constructing and operating ecological databases, but rather to share the experiences of researchers who were faced with the challenge of handling vast amounts of ecological data. Thus, the aims of this article are:

  1. 1.

    To give an overview of the contents of the REBECCA Lakes biological databases (Phytoplankton, Macrophytes, Macroinvertebrates and Fish). The purpose is both to provide more background information for the results presented in the other REBECCA articles of this special issue, and to inform about the availability of these data for future projects.

  2. 2.

    To share our experiences from the database processes, from data submission through standardisation to extraction for data analysis.

  3. 3.

    To discuss the cost and benefits with of the common database approach, and give recommendations for compilation of ecological monitoring data for future projects.

We believe it is likely that other European environmental research projects will run into similar problems, and we hope that our experiences regarding data compilation can become useful. Our experiences should be especially relevant for projects addressing the WFD and assessment of ecological status. Challenges regarding the analysis and interpretation of these large datasets will be addressed in the subsequent REBECCA articles in this special issue.

REBECCA database contents

Abundance data were collated for the biological quality elements phytoplankton, macrophytes, macroinvertebrates and fish, together with accompanying chemistry data and geo-referenced station information. Data on phytoplankton, macroinvertebrates and fish were compiled and managed at the Norwegian Institute for Water Research (NIVA). Macrophyte data were compiled and managed at the Centre for Ecology and Hydrology (CEH, UK). In addition, chlorophyll was used as a proxy for phytoplankton biomass, because the number of chlorophyll observations is almost an order of magnitude higher than the number of phytoplankton abundance observations (Table 1). The number of lakes with observations of the different BQEs is shown per country in Table 1, whereas Table 2 shows the number of different lake types per GIG. Reference lakes are by definition those with insignificant anthropogenic pressures, and the reference status of lakes is assigned by the data providers. Lakes belonging to the same lake type are assumed to have similar ecological reference conditions. The composition of lakes is further characterised by the range of each typology factor for each database (Table 3). A similar overview of the chemical determinands associated with three of the biological databases is given in the appendix (Table A1).

Table 3 Characterisation of lakes in the REBECCA databases: range of values of typology factors (25 percentile, median and 75 percentile)

Examples of multi-national ecological databases that have been compiled within other EU projects are given below. Most of these databases contain data from rivers or coastal zones, and they usually focus on one main taxonomic group only. To our knowledge, the REBECCA Lakes databases are currently the most extensive databases for biological data from lakes that are compiled at EU level.

  1. Data from eight river basins across Europe were collected within the EU FP5 project HarmoniRiB (http://workplace.wur.nl/harmonirib). Although some biological data are available, these databases contain mostly physical, chemical and hydrological data (see e.g. Refsgaard et al. 2007). This project is mainly focussed on quantifying and storing information on uncertainty associated with the data (Refsgaard et al. 2005).

  2. Several other projects have compiled large-scale data from catchments, but with a lesser focus on biological data than in REBECCA, for example EUROHARP (http://www.euroharp.org) and Euro-limpacs ( http://www.eurolimpacs.ucl.ac.uk).

  3. Large databases on marine phytoplankton from the Baltic Sea have been compiled within e.g. the project CHARM (http://www2.dmu.dk/1_Viden/2_Miljoe-tilstand/3_vand/4_Charm/charm_main.htm). Data from this database has also been used in the coastal part of the REBECCA project (Carstensen and Heiskanen 2007), and in the on-going project THRESHOLDS ( http://www.thresholds-eu.org).

  4. Data on macroinvertebrates in rivers were collected by the EU projects AQEM (http://www.aqem.de; Hering et al. 2004) and STAR (http://www.eu-star.at; Furse et al. 2006). The AQEM/STAR databases contain 1,660 samples representing 16 countries and 48 stream types. They contain data on both occurrence and autecology of species (Schmidt-Kloiber et al. 2006), as well as a software tool for ecological assessment of rivers.

  5. A dataset on benthic invertebrates in coastal waters has been compiled for intercalibration of coastal classification systems: 589 abundance samples from different locations in seven countries along the European Atlantic coasts (Borja et al. 2007).

  6. Existing data on fish in streams from 12 countries were compiled in the project FAME (http://fame.boku.ac.at/). These data have been used for correlating fish metrics used in the European Fish Index with environmental component scores (Beier et al. 2007).

  7. The Modelkey Database (http://www.modelkey.ufz.de) contains monitoring data including macroinvertebrates and fish from three river basins. The data have been used for identification of probable cause–effect relationships on the basis of data on chemical pollution, habitat, toxicity and biological inventories (Brack et al. 2005) and comparison of ecological assessments methods for environmental pollution (Ohe et al. 2007). Examples of use of monitoring data on environmental pollution and ecological responses can also be found in Schriever and Liess (2007) and Schafer et al. (2007).

Publications from other projects that are based on existing monitoring data often give a good overview of the database contents, but less information on the construction and use of the databases, and on the challenges and solutions (but see Beier et al. 2007).

REBECCA database structure

When developing a structure for the databases, we tried to meet two conflicting needs. On one hand, the database structure should be both detailed and flexible enough to accommodate the different data formats and the many updates and corrections. Moreover, since the aim of the project was to analyse biological responses to chemical pressures, an important requirement to the databases was to allow the linking of chemical and biological data in various ways. On the other hand, because the project did not have resources for a professional database manager who could extract tables for data analyses, it was desirable to have a relatively simple database structure so that at least some project partners were able to extract their own tables.

The phytoplankton database, being the largest and most frequently updated, needed to have a flexible construction (Fig. 1; see further description below). For consistency, the macroinvertebrate database was constructed in the same format. We chose to store the data in a form that was close to the original, to facilitate data updating and checking by the providers. However, this structure made it difficult to use for most project partners. The macrophyte database was initiated later in the project, when some lessons had already been learned from the work on phytoplankton and macroinvertebrates. For this database, the data were standardised as much as possible prior to import to the database. Data providers were requested to provide the data in a standard format (with partial success), and the remainder of the data was standardised by the database manager. The fish datasets, which were simpler (no taxonomic information) and more homogenous, were stored in Microsoft Excel.

Fig. 1
figure 1

Illustration of database structures: main tables and relationships between fields within tables. The example shows the macroinvertebrate database. The following properties were found to be particularly useful: (1) Two levels of station identity (lake and site within lakes); (2) separate sample tables for chemistry and biology; (3) separate sample identities (country code + entry no.); (4) tables for standardisation of determinands and taxa

We did not attempt to combine and harmonise the station lists for the databases among different biological quality elements, because this would be very time-consuming. Many datasets did not contain a unique station code, only station names, for which the spelling was not always consistent among datasets. It was therefore demanding enough to combine the stations for biological and chemical samples within the same database. Thus, for the time being, we were not able to analyse the combined responses of two or more different biological quality elements. Such a combined analysis might nevertheless be possible in a future project.

Phytoplankton and macroinvertebrate databases (NIVA)

In each database, the data were organised into five main tables (Fig. 1): station information, chemistry sample information, biology sample information, chemistry values (incl. pressure variables such as pH or phosphorus) and biology values (such as biomass or abundance per taxon). Chlorophyll values were stored in the chemistry table, even though it represents a biological element, because it is usually measured together with chemical parameters, and it does contain any taxonomic information. Separate tables for identifying chemistry samples and biology samples were necessary because the biology samples did not always have corresponding chemistry samples from exactly the same station and date. More chemistry samples than biology samples were provided (see Appendix 1). All unique combinations of original chemistry determinand names and units were stored in a separate table for standardisation. A total of 527 unique combinations of original names for determinands and units were reduced to 139 unique determinands with standardised names and units. Storing the data with their original units and values provided better traceability of the original data. It also allowed extraction of data into tables that had similar format as the original data supplied, and this facilitated data verification by the data providers. However, storing data with their original units also implied that the data must be linked to a standardisation table in order to harmonise the names and the units for each data extraction.

Uniqueness of records was determined by multiple fields both in the station table and in the sample table (Fig. 1). For example, the uniqueness of samples was defined by country code and a sample code that was unique within countries. Defining unique samples by these two fields facilitated the addition and numbering of new samples from countries that were already represented in the database. At the same time, this multiple-field definition of relationships between tables made it more difficult for other project partners to extract data from the database.

Macrophyte database (CEH)

The “Determinands” table was somewhat simpler than for the NIVA databases, as all physical/chemistry data were standardised before importation to the database. A table of “Sources” was kept with a source code and description. These sources related to the data provider and allowed traceability of individual records to their provider. The source information from this table was used in multiple tables, wherever it was possible to attribute a record to a single data provider. Macrophyte abundance data were stored using various provided abundance measures. These included the ECOFRAME scale (a categorical scale from 1 to 3), the DAFOR scale (another categorical scale from 1 to 5), Relative Point Frequency (a continuous scale between 0 and 1) and the Finnish Vegetation Index (a semi-continuous scale with values from 2 to 8,192). These data were stored in their original form, but later converted to common measures (further described in Penning et al. 2008b).

In contrast to the NIVA databases, uniqueness of records in all tables was often determined by an arbitrarily constructed field, e.g. for lake station codes, from the country code concatenated with the data provider’s own lake code. This structure was easier to use by project partners, but required more work in its construction. It also made data quality checking more difficult, as the data had been altered from their original form.

REBECCA database processes

The main steps from receiving data until the extraction of tables for use in analyses are summarised below. These steps were, in principle, similar for all of the databases presented here (see Fig. 2).

Fig. 2
figure 2

The data pathway in the REBECCA: from raw data via databases to reformatted tables for statistical analyses. Each step is described in detail in the section “REBECCA database processes”

Data cleaning

Checking and correction of data were usually required before the raw data could be used. A common problem was erroneous units, such as mg l−1 instead of μg l−1. (Note that if “μ” is typed as “m” with symbol font in one software, it may be changed into “m” in a different software). Moreover, both the comma “,” and the period “.” are used as decimal symbols in Europe; data with a decimal symbol that does not match the computer’s settings may be interpreted as text. Missing values were coded in many different ways in the raw data. Plotting of coordinates on maps revealed that longitude and latitude were sometimes mixed (e.g. when lakes appeared to be positioned in the Mediterranean Sea). There were variations in spellings of physical/chemical determinands and in the names of biological taxa. Numerous other irregularities were encountered. This process of data checking often revealed inconsistencies and errors the data providers themselves were not aware of. Despite our initial screening and correcting of irregularities before importing the data, new errors were often discovered by the data analysts, or by the data providers themselves when preliminary results were presented.

Data reorganisation

A template for data collation was initially developed and distributed to the partners. However, many partners experienced the reorganisation of their data into the specified format as a very time-consuming job. We therefore decided to accept data in any format also from project partners. The raw data were usually organised in so-called cross-tabular or pivot format, i.e. with samples arranged in rows and each physical, chemical and biological determinands arranged in separate columns. In a database, on the other hand, all determinand values for a particular type of data (chemical/physical or biological) are listed in the same column. This avoids empty cells, so the space is used more efficiently, and facilitates extraction of data into various table formats. Additional information such as flags to denote reliability of the measurement (e.g. “<”, meaning “below detection limit”) is also stored more appropriately in a separate field. In many cases text data, such as “less than 3”, was stored with numeric data. We used a Microsoft Excel™ macro for the reorganisation of data into database format (B. Bjerkeng, unpubl.), combined with extensive manual checking.

Import to Access database

Although the databases managed by NIVA and CEH differed in some aspect (cf. Fig. 1), the key aspects are common to both. Data were generally separated into location, chemistry and biology. The location data comprised the name of the lake (and sometimes sampling stations within the lake), reference status (reference lake or not), typology factors such as size, depth and altitude and geographical coordinates of the lake. The sample data (in the NIVA databases) contained sampling location and date, as well as information about sampling method, where available. The chemistry data, as well as containing chemical data such as concentrations of nutrients and pH, also included some physical determinands, such as Secchi depth (transparency), turbidity and temperature. “Chemistry” data also included chlorophyll concentrations (cf. explanation in the previous section). The biological data generally consisted of a list of taxa recorded at a site with some measure of abundance (count of cells/individuals and/or estimated biomass per unit volume, length of filamentous algae or some type of abundance class). Each database also included a species list, linked to the biological abundance data table. Data were added and adjusted to the main databases by a series of so-called append queries and update queries.

Typification (assignment of lake types)

Lakes were assigned to lakes types according to the Intercalibration Lake Typology, as used in the WFD Common Implementation Strategy process of intercalibration of assessment systems (see Table 2). Since analysis of lake-type-specific relationships was a highly prioritised issue in REBECCA, we aimed at typifying as many stations as possible. We therefore used not only the information on lake types given by the data providers, but also all available station and chemistry information. Nevertheless, a large proportion of the lakes could still not be typified due to lack of data. Eventually a request was sent to all data providers for expert judgment on the levels of typology factors, according to the categories agreed in the intercalibration process (Table 2).

The designation of each station to one or more lake types was thus a very elaborate process, because it was necessary to combine up to 20 fields of information from each data provider. For each of the main typology factors that were stored in the station table (altitude, mean depth, surface area, alkalinity (or calcium) and colour (or TOC, for reference lakes), we used primarily numeric values (if available), and secondarily the information on typology categories, as provided by expert judgement. In addition to this information from the station table, we used the available chemistry values for alkalinity (or calcium, if alkalinity values were not available) and colour (or, for reference lakes, TOC if colour was not available). Finally, in cases where there were not sufficient data for typification, we used information on IC types as given directly by the data providers (where this was available). Most countries belong to only one of the five current GIG regions (see Table 1), but four countries (Ireland, Italy, Romania, UK) belong to more than one region. For these countries, alternative IC types were also designated where possible.

Reference status was set solely by the data provider, according to either pressure criteria, impact criteria, expert judgement or a combination of these.

Taxonomy: standardisation and harmonisation

For taxonomy, we distinguish between the standardisation of taxonomic names, and the harmonisation of taxonomic levels. The standardisation of names was a crucial and time-consuming task. The phytoplankton taxonomy was elaborated in collaboration with data providers (P. Brettum, NIVA and L. Lepistö, SYKE, Finland). The number of taxa was thus reduced from 5,500 unique names (including various spellings) to 1,900 unique taxa codes. The macroinvertebrate taxonomy is based on the EU projects AQEM and STAR (Schmidt-Kloiber et al. 2006). The macrophyte taxonomy was based on a species list provided by for the Central GIG (J. Hanganu, DDNI, Hungary) and extended by species recorded from the Northern and Atlantic GIGs. For the phytoplankton and macroinvertebrate elements, all observations were stored with their original names, and linked to the standardised taxonomy tables by a unique species code, allowing us to look up the original names if needed. Macrophyte names were standardised before importation into the database, and stored as species codes, which were linked to a master species table. Details of the standardisation were kept separately for each dataset imported.

Harmonisation of taxonomic levels was necessary because different datasets could have different degree of taxonomic resolution (e.g. some identified to species level, others to genus or higher levels). This may result in spurious country-wise differences in the number of taxa observed. Moreover, a mix of taxonomic levels within samples may artificially increase the number of recorded taxa. For example, individuals of the species Baetis rhodani can be recorded as Baetis rhodani, Baetis sp., family Baetidae, and order Ephemperoptera—apparently four different taxa. Moreover, different taxonomic levels may lead to apparent absence of certain taxa in certain regions. For example, Cryptomonas (a cryptophyte alga) was not split into species in Sweden and Finland, but was analysed to species level in Norway. Thus, a comparison of number of cryptophyte taxa across these countries is not feasible on species level. The biological data were therefore coded at all possible taxonomical levels (species, genus, family and order), allowing the data users to perform analyses at the appropriate taxonomic levels.

Extraction of tables

In order to be used for statistical analyses, the data had to be rearranged into single files, usually as cross tables. Practically any kind of table can be extracted by a combination of so-called select queries and cross-table queries. The queries can be designed in a graphical interface, which are interchangeable with both datasheet format and SQL format (programming language). The main steps were as follows.

  1. Selection of data (e.g. reference lakes; Northern GIG; summer samples).

  2. Standardisation of units and taxonomy (link value tables to standardisation tables and multiply values by standardisation factors).

  3. Calculation of metrics per sample (e.g. proportion of cyanobacterial biomass or counts per sample).

  4. Aggregation of chemistry samples and biology samples at the same time unit (e.g. per season or month), so that chemical and biological data can be linked.

  5. Calculation of summary statistics for data (e.g. mean total phosphorus concentration, mean proportion of cyanobacteria biomass).

  6. Reorganisation into cross tables (if more than one chemistry determinand per station is wanted).

  7. Linking (aggregated) biology samples with corresponding chemistry samples, via the station table.

The database thus allowed reorganisation and aggregation of data in ways that would be virtually impossible with so-called flat files (such as Excel).

Costs and benefits of the REBECCA databases

The main cost of our common lakes database approach was the vast amounts of time required for data inspection, correction and standardisation. Requests to the different data providers for explanations and for missing information were an unavoidable and time-consuming task. A better planning in advance of the data compilation procedures would probably have saved much time. In particular, the database structures and functions should have been discussed with a professional database manager before the data templates were developed. Nevertheless, some modification of the database structure and instructions to data providers were unavoidable, since the data sources and the heterogeneity of data increased during the project.

Since we eventually accepted data in any format, it was necessary to standardise the station, chemistry and biology data. The standardisation of names and conversion to common units for physical and chemical determinands was relatively trivial. However, taxonomic standardisation and harmonisation of biological data were considerably more demanding.

For constructing the databases we chose the software Microsoft Access, which is commonly available to researchers and is relatively easy to use also for beginners. However, as the complexity of the databases grew (in order to accommodate the various formats of the raw data), it became increasingly difficult for the project partners to extract their own tables. Table extractions were also done by the database manager upon request from the data analyst, but this procedure required precise communication and could be inefficient. Hence, the more complicated table extraction was a significant additional cost of the increased data intake.

The benefits of our common-database approach should be reflected in several other REBECCA Lakes publications in this issue of Aquatic Ecology (e.g. Carvalho et al. 2008; G.-Tóth et al. 2008; O’Toole et al. 2008; Penning et al. 2008a, b; Phillips et al. 2008; Ptacnik et al. 2008b; Schartau et al. 2008). We have been able to analyse reference conditions and pressure–response relationships across a range of countries, which has made the results more representative within each GIG region. For some biological elements we have also been able to analyse lake type-specific relationships, particularly within the Northern GIG (chlorophyll—Phillips et al. 2008; phytoplankton composition—Ptacnik et al. 2008a, b; Schartau et al. 2008). For other elements, one or more typology factors have been used in the analyses, either as covariates in the model (macroinvertebrates—Schartau et al. 2008; macrophytes—Penning et al. 2008a, b), or to split the dataset into groups of lake types (chlorophyll—Carvalho et al. 2008). These type-specific results were essential information for the GIGs as a basis for boundary setting between the different ecological status classes. The assessment of type-specific reference conditions has also been made possible by these databases particularly for chlorophyll (European Commission 2003). Defining reference conditions is a critical first step for setting Ecological Quality Ratio values, and is thus also important for assessment of ecological status (Ptacnik et al. 2008b).

Another benefit of analysing combined datasets was that the data covered a larger range of the pressure gradient, and a more complete picture of the pressure–response relationship could be described. In fact, an apparent lack of significant relationships within a national dataset might be due to analysis on a too narrow pressure range. On the other hand, if different ranges of the pressure gradient are dominated by data from different countries, it may be difficult to separate the real effect of the pressure from spurious effects of, e.g., differences in national sampling methodology. For the fish analyses, however, the large-region analyses were usually not in conflict with the single-country analyses (T.O. Haugen pers. comm.).

As a side effect, analyses of the multi-national data also resulted in interesting discoveries that were not directly related to REBECCA. For example, analyses of phytoplankton data within the Northern GIG revealed hitherto unknown geographical trends in phytoplankton species richness (see Ptacnik et al. 2008a, b). There is a large potential for more interesting results from further analyses of these data within other projects, for example related to large-scale patterns in biodiversity.

The large number of observations should generally increase the precision of the estimates, and thus make the results more statistically reliable. For the fish data, for example, variance explained by the combined datasets decreased to about 50% of the variance explained at country level (T.O. Haugen pers. comm.). However, because of the vastness of data there is also greater heterogeneity, which can be difficult to disentangle or to reduce (discussed below). These uncertainties made it more difficult to interpret the large-scale results. For example, apparent geographical trends for macroinvertebrates metrics may have been blurred by country-specific factors (Schartau et al. 2008). More robust results may be obtained if the biological responses are analysed as presence/absence or proportions instead of absolute abundances (Schmidt-Kloiber and Nijboer 2004; Verdonschot 2006a; Moe et al. 2007). In some cases, local knowledge about lakes and their biota might be required for the interpretation of the results (E. Penning pers. comm.).

When taxonomic resolution varied among the datasets, we had to exclude the datasets with too low resolution, or to aggregate all data to the lowest common level (e.g. from species to family or order). Taxonomic aggregation apparently did not influence results for macrophytes (E. Penning pers. comm.), while the impact was more variable for phytoplankton (R. Ptacnik unpubl.). Other studies have demonstrated that certain metrics do not perform properly if family-level data are used instead of species level (Schartau et al. 2008).

The flexible structure of the databases implied that it was relatively easy to aggregate data in different ways, e.g. taxonomically, temporally or geographically. Thus, this structure enabled testing of metrics calculated for different taxonomic levels, or to aggregate chemistry and biology samples for different time periods (see also Borja et al. 2007).

The process of data compilation may have positive side effects beyond those originally intended. The construction of a database develops a feedback process between data providers, researchers and end-users (Beier et al. 2007). The process can contribute to organising large amounts of data that are otherwise not easily accessible. For example, the extensive Norwegian dataset on macroinvertebrates in lakes consisted of >200 Excel sheets of somewhat varying formats. The process of standardisation provides a mechanism for quality control, thus making each national dataset more valuable (Beier et al. 2007). In REBECCA, the analysis and plotting of national data in a larger context made it easier to identify errors and outliers in individual datasets.

There is a large potential for further use of the REBECCA databases in research projects. Most of the data providers have given consent to further use of the data after the end of the projects, although with some restrictions (e.g. requiring co-authorship). Data on phytoplankton, macrophytes and macroinvertebrates have already been used in the Intercalibration process within the Northern GIG, and will also be used in the next intercalibration exercise by this GIG. Although there exists a European intercalibration register, it has been recognised by both the European Commission and member states that additional data from non-intercalibration sites may be required to progress the intercalibration exercise (Refsgaard et al. 2007). Experiences from the REBECCA databases will also be used by the European Environment Agency, who intend to start compiling biological data from all EU member states for State of Environment information in WISE (Water Information System for Europe; http://water.europa.eu) (A. Künitzer pers. comm.). The databases provide an opportunity for analysing combined pressure–response relationships for two or more biological quality elements, provided that the station lists of the different biological databases are harmonised. More generally, the data should be valuable for research on large-scale patterns in biogeography and biodiversity. However, one should keep in mind that each dataset is originally collected for a specific purpose (e.g. presence of acid-sensitive macroinvertebrate taxa), and that it may not contain information that would be required for a different purpose (e.g. temporal trends in abundances).

Challenges with compilation of ecological data

There is a need for further data collection to fulfil the WFD requirements, since much of the characterisation and classification has been carried out based on expert knowledge (Refsgaard et al. 2007). However, since collection of data from the field is very resource-demanding, new environmental research projects will often be based on existing data. Environmental data are becoming more accessible, for example through EU initiatives such as WISE and INSPIRE, following the Aarhus Convention on access to information in environmental matters. Nevertheless, there are still large problems with regard to data access in practice. The main problem is not necessarily the data availability, but accessibility, quality, and relevant information about the data (including uncertainty). The constraints can be of different types: economic, political, data formats, fragmented databases or transboundary barriers (harmonising or exchanging data across national barriers) (Refsgaard et al. 2007). As reported by the HarmoniRiB project (Refsgaard et al. 2007): “In projects where existing data are used the data collection is often cumbersome and requires a lot of resources, because the data access is difficult with many practical and economic constraints.” “Often data collected in one research project is not used in many other projects due to lack of proper data documentation and dissemination after the termination of old projects. The same data are therefore often collected several times by different research projects. This is obviously non-optimal and requires a lot of research resources both in terms of costs and manpower that could have been utilised much better.” Moreover, scientists who produce data are often unwilling to share them, due to strong traditions, competition for funding or other circumstances (Beier et al. 2007). Other practical problems have been reported (Lorenz et al. 2004; Vlek et al. 2006): data are not always available digitally; different institutes have been collecting the same data without cooperation; closely related data are stored in different databases at different institutes or even in private companies.

Biological data, in particular, are often collected by different experts, and data for different taxonomic groups are stored separately by the researchers. For biological data, the practical problems regarding formats and standards are also likely to be even greater than for other environmental data, for several reasons. (1) Biota are usually heterogeneously distributed in the water body, both in space and time. (2) Sampling methods are more difficult to standardise. (3) Taxonomic systems are changing continuously, resulting in numerous synonyms. This causes problems when combining datasets from different researchers and/or countries. Important properties of the samples, such as number of species recorded, can be affected by sample size (Clarke and Hering 2006). (4) Methods for quantification of abundance are more variable and more imprecise (e.g. estimates of biomass, density of individuals or coverage of surface or semi-quantitative scales). (5) Additional sources of variation arise from sample processing and taxonomic identification error, and from effects of environmental stress on the biota (Refsgaard et al. 2005, 2007). This implies that detailed sampling information may be even more critical for biological data than for other environmental data.

All data have some degree of associated uncertainty, let alone biological data. Refsgaard et al. (Clarke and Hering 2006; Haase et al. 2006) recommend that information on data quality and uncertainty is stored as a part of the data documentation. This implies a need for modification in database structure, as standard databases today are not designed to enable storage of data uncertainty. However, the data quality may not have equal importance for all purposes. Exploratory analyses of large-scale patterns (e.g. comparing a pressure–response relationship for different lake types) may be more robust to data uncertainty than predictive modelling, which has higher demands for accuracy (e.g. trying to predict the amount of cyanobacteria for a given phosphorus concentration level). Thus, the results of the REBECCA analyses, which are mostly exploratory, may not be critically dependent on information on data uncertainty.

Recommendations

Based on our experiences within the REBECCA project, we would like to give some recommendations for data compilation within other research projects. Our recommendations are not meant to be general guidelines for database management; they apply to the particular challenges of compiling multi-national ecological data.

Planning and resource allocation

Sufficient time should be allowed within the project for the necessary data processing. A trained database manager should be allocated to data-handling tasks, especially with respect to the complexity and quantity of data that will be involved in the project. A combination of data-processing skills and ecological knowledge is required for this task. The database manager should be informed about the needs and planned uses of the database, and be involved in designing templates for data request, database structures, and data transfer tools in close collaboration with the project leaders. If there are not enough resources for having a data manager to provide extracts for the data users, then key users should be trained in basic database skills for extracting their own tables.

Organisation of files by data providers

Data providers should be given precise instructions on the required data format, or at least be informed on the most important aspects of data organisation. Data should be stored in raw format (not aggregated). Unique standardised codes should be used for all localities and sampling stations, and be used to identify both chemistry and biology samples. All dates should be recorded as day, month and year separately, because date formats in different softwares are not always compatible. There must be a unique identifier for missing values, to avoid artefacts by numeric codes like “-999”. For sampling information, as many details as possible should be requested. The data templates should also make room for additional, potentially relevant information at all levels. For biological data, the best solution may be that all data providers add a common taxonomic code to the observations (e.g. the AQEM/STAR code for macroinvertebrates, and the so-called REBECCA code for phytoplankton). In addition, a complete taxonomic list (with spell-checked names) should be provided. Data providers should be requested to check for suspicious values before submitting the data, for example by box-and-whisker plots or at least by checking minimum/maximum values. A high rate of errors in environmental data has been discovered in other projects (Beier et al. 2007) as well as in REBECCA.

Data submission and sharing

All REBECCA data were submitted by e-mail or by direct transfer from computer to computer. The REBECCA toolbox (http://www.rbm-toolbox.net/rebecca) was also used for returning data extracts to providers after the database compilation, but only as a tool for document sharing. We recommend the use of more efficient tools to facilitate data management and sharing. For example, Beier et al. (Refsgaard et al. 2005) report on the use of an input database and manual, automated quality control tools and a series of input and export queries developed using Data Transformation Services (DTS). Guidelines or templates should be developed for reporting of suspicious values and for providing corrected values or additional information. All updates and corrections in to the database should be logged.

Database construction

The database software Microsoft Access is relatively easy to use and mostly worked well for our purposes, except that the table size is limited to 256 columns, which can cause problems for extracting tables with either species or samples in columns. Although a database structure should be planned from the beginning of the project, it should also be possible to change the structure during the project if this turns out to be favourable. For example, when it turned out that our phytoplankton database had a large number of chemical samples that did not match with the dates of the biological data, we decided to split the common sample table into separate sample tables for chemistry and biology, which enabled a higher match of chemical and biological samples (after temporal aggregation). Information on data sources and data providers should be stored in the database. The database should include complete taxonomy for each biological observation, so that aggregation at any taxonomic level is possible. If possible, information on data quality and uncertainty should also be stored.

Data analysis and interpretation

Interpretation of results requires some special considerations when the data are compiled from many different countries. Some of the differences can be standardised as described above, while other differences are inherent to the data. A typical inherent problem is differences in geological, geographic and climatic conditions. Dividing the data into IC lake types may account for some of this variation, but including typology factors as continuous covariables might increase precision further. Another inherent problem is differences in methodology. For example, different mesh size for macroinvertebrate samples may result in different number of taxa as well as individuals. For macrophytes, different semi-quantitative abundance measures were used. A short-term solution might be to analyse responses within countries and compare results qualitatively. For example, one can check whether different abundance measures show breakpoints or abrupt changes in the same interval along the pressure gradient. Sampling information can also to some degree be used to standardise data. For example, coastal benthic invertebrate samples were standardised for sample area, sieve mesh size and sediment type in the Intercalibration (Borja et al. 2007). A longer-term solution would obviously be standardisation of sampling and analysis methods, as is being initiated by CEN (Comité Européen de Normalisation).