Introduction

In the early 1990s a countrywide monitoring network for ground- and surface water quality was built up on a legally regulated basis in Austria. Since then about 2,000 groundwater and 240 river water sites have been tested regularly and the water samples have been analyzed for numerous inorganic and organic parameters [1, 2]. Such a permanent and comprehensive measurement program requires reliable and accurate results of analysis. Thus, a sophisticated quality assurance program had to be implemented [3] and the compulsory participation in interlaboratory comparisons was an integrative part of this quality assurance program. Due to the rather infrequent organization of suitable interlaboratory comparisons, the Federal Ministry for Forestry and Agriculture, Environment and Water Management decided to initiate a regional PT scheme. In 1995, the IFA-Tulln Centre for Analytical Chemistry started to develop a PT scheme called the “IFA-Test Systems PT Scheme” for external quality assurance in water analysis . The scheme started with regular interlaboratory comparisons for inorganic parameters in water (major ions, heavy metals, and trace elements) and was successively extended to a multiplicity of inorganic and organic parameters (trace ions, herbicides, volatile halogenated hydrocarbons, organochlorine pesticides, polycyclic aromatic hydrocarbons (PAHs) and, recently, methyl-tert-butylether (MTBE). Depending on the parameter group, the number of series amounted to 1 to 6 per year and group. The operation of the scheme has closely followed the framework given in ISO/IEC Guide 43 [4]. The series reports provide data that allow evaluation of laboratories’ performance, but do not contain any performance evaluation. Calculation of performance statistics using Z-scores and evaluation of performance will be implemented in the near future. This implementation will make the PT scheme fully consistent with [4]. The series reports are published on the web site of the PT scheme. Preparation of samples, uncertainties of the target values, and behavior of the samples for the interlaboratory comparisons are described elsewhere [5, 6, 7, 8] or will be published in the near future. Participation in the PT scheme has always been open to all interested laboratories. Thus after 1 year of operation, the number of voluntarily participating laboratories superseded the number of compulsory participants. In 2002, the ratio of voluntary to compulsory participants was 4:1. Over 100 different Austrian and foreign laboratories have participated in the IFA-Test PT Scheme.

Course of PT rounds

Within the IFA-Test PT Scheme every round is a complete small (5 to 50 participants) interlaboratory comparison study. Normally, two different synthetic water samples with well-defined target concentrations are sent out in a round. The samples are identified by a letter code for the parameter group and the series number, e.g., H36A and H36B for the two samples from the 36th PT round on herbicides. In total, 190 PT rounds were conducted by June 2003: 56, 40, 19, 9, and 8 rounds on inorganic parameters, herbicides, volatile halogenated hydrocarbons, organochlorine pesticides and PAHs, respectively. At the end of each round the laboratories’ measurement results are transmitted and collected in a database. The measurement results of a certain parameter in one specified sample together with its basic statistic evaluation is called a data set in this paper. An example of a typical data set is given in Fig. 1 Not all substances of a certain parameter group are present in every sample, so the number of data sets available for every parameter may vary widely. Each PT round is completed by the release of a series report. The reports are designed to meet the requirements of the laboratories. They illustrate graphically all recoveries of each participating laboratory, which allows evaluation of the analytical performance of the laboratory. The reports also contain diagrams for each analytical parameter with a brief statistical evaluation: laboratory mean, standard deviation between the laboratories, and confidence intervals (Fig. 1). This allows identification of possible analytical problems for some parameters and yields a rough estimate of the accuracy that can be expected from an analytical laboratory. Thus, the series reports facilitate the comparison of the analytical performance of a selected laboratory with the performance of other laboratories.

Fig. 1
figure 1

Example data set: results for nitrate in water sample N52B with basic statistical evaluation. The target value was 31.1 mg/L nitrate

Evaluation of long-term performance

For both the end-users of the analytical results and the laboratories, the answers to the following questions may be of interest: How does the analytical performance of a selected (contract) laboratory behave over a longer period of time? Do significant deviations of results from the target values occur often or rarely? What deviations between a measurement result and the assigned reference concentrations can be expected for each parameter and which parameters or analytical techniques cause most problems? The present project was initiated to answer these questions on the basis of the collected results of all previous PT rounds. It was also designed to be a helpful tool for monitoring the performance of the contracted laboratories.

Aim of this study

The combination of scores, like the sum of (squared) scores and running scores, is a common technique for the assessment of long-term trends [9]. This paper describes the determination of performance criteria (Z-score criteria σ) derived from the results of multiple PT rounds. The results of 190 PT rounds were evaluated and coefficients of variation (CV) were calculated from every data set as estimates for the between laboratory precision (“estimates of dispersion”). The CV were visualized for every parameter over the target concentrations and the operation time of the PT scheme in order to check the influence of concentration and time. The CV of every analytical parameter were combined to a single Z-score criteria σ representing the average laboratory performance over a given concentration range. These performance criteria are much more reliable than those calculated from single rounds. They are highly suitable for assessment of long-term trends via Z-scores and combination scores.

Realization

First, the laboratories’ results of all interlaboratory comparisons were transformed into a uniform format and collected in a single database. At this stage of the project, 130 different laboratories had participated in the scheme. The total number of analytical parameters was 67 and the database contained nearly 4,000 data sets containing more than 60,000 analytical results from about 300 interlaboratory comparison samples. All data sets were checked by the Hampel outlier test, which is often referred to as “Huber test”. This outlier test is based on the use of robust statistics. Davies suggested using it for the statistical evaluation of interlaboratory tests in 1988 [10]. The Hampel test has been used in the IFA-Test PT Scheme since its first year of operation. Laboratory mean values, mean recoveries, standard deviations, and CV were calculated from both the whole sets of data and the outlier-free sets of data. The agreement of laboratory mean values and target concentrations is permanently observed in the PT scheme. At this stage it could be shown that even for the evaluation of parameters over consecutive rounds no statistically significant deviations between target concentrations and outlier-corrected laboratory mean values were found. This is a basic requirement for a reliable estimation of dispersion. Furthermore, robust estimates of central tendency µ’ (“robust average”) and dispersion σ’ (“robust standard deviation”) were calculated from all data sets according to “algorithm A” of the international standard ISO 5725–5 [11]. Both the Hampel test and algorithm A start with the median as “robust average” µ’0 and a “robust standard deviation” σ’0 that is calculated from the median absolute deviation (MAD) multiplied by a factor (1.483 for n → ∞). The Hampel test simply rejects all results that deviate by more than 3 σ’0 from µ’0. Algorithm A progressively transforms the original data by replacing results that deviate by more than 1.5 σ’0 from µ’0 and recalculating µ’ and σ’. According to this approach, the values are updated several times until the process converges. This process is called winsorization [12]. The data presented in this paper were obtained by algorithm A. CV were calculated from the estimates µ’ and σ’ of all data sets with three or more measurement results. These CV were graphically depicted against the target concentrations for each of the analytical parameters. Two examples are shown in Figs.  2 and 3. In order to check whether there were visible changes during the operation time of the scheme, the CV were graphically depicted versus time, too. An example is shown in Fig. 4.

Fig. 2
figure 2

Determination of metolachlor in 56 water samples sent out in 40 proficiency testing (PT) rounds. The diagram shows the coefficients of variation (CV) between laboratories calculated by algorithm A versus target concentrations. The line at 13% indicates the median of the CV

Fig. 3
figure 3

Determination of nitrate in 103 water samples sent out in 50 PT rounds. The diagram shows the CV between laboratories calculated by algorithm A versus target concentrations. The line at 3.4% indicates the median of the CV. The concentration range of application of the Z-score criteria was limited to >3 mg/L

Fig. 4
figure 4

The data of Fig. 3 versus operation time of the PT scheme. The diagram shows the CV between laboratories calculated by algorithm A versus months from July 1995. Each “+” represents a dispersion estimate for one water sample. The line at 3.4% indicates the median of the CV

In order to combine the CV for each parameter, the medians of the CV were taken as Z-score criteria. Thus, one Z-score criterion was defined for every parameter. In some cases the concentration range had to be limited at the lower end to make the criteria applicable, e.g., 3 mg/L for nitrate (Fig. 3). The criteria and their concentration limits were added to the database. In parallel, a computer program (a viewer) was developed which allowed convenient access to the laboratory results’ database. The viewer creates Z-score diagrams and tables after selection of laboratory, parameter group, one or more parameters, and a time interval. As the viewer was designed to facilitate monitoring of the analytical performance of contracted laboratories, sub-databases that contained exclusively results of these laboratories were created for this purpose. An example of the viewers’ capability is given in Fig. 5. A corresponding table contains target concentrations, results of the laboratory, recoveries, and the Z-scores. The viewer program supports the calculation of the sum of the Z-scores (SZ) [9], too.

Fig. 5
figure 5

Z-score diagram of a selected laboratory which regularly participated in the PT scheme. The diagram combines the recoveries of Metolachlor in samples from 40 PT rounds between September 1996 and May 2003. ● Indicates that no Z-score could be calculated. Missing bars indicate that no results were submitted. “FN” refers to false negative results

Results and discussion

Estimates of dispersion

Our comparison of the Hampel test and algorithm A showed both procedures yielded very similar estimates, µ’ and σ’, when they were applied on the data sets of the interlaboratory comparisons. Thus, the Hampel test and algorithm A can be considered to be equivalent for this task. Algorithm A was chosen for calculating CV because of its “more official character” as this procedure is recommended by ISO standards [11, 13].

Precision of interlaboratory comparisons

The procedure described above yielded a set of CV for every parameter (examples are shown in Fig. 2 and Fig. 3). It is possible to calculate CV from these sets. These “CV of the CV” obtained from different samples for each parameter were typically in the range of 40% to 50%. This is well above the theoretical value that can be expected for random samples of one population. It can be calculated by:

$$ {\sqrt {\frac{1} {{2n}}} } $$

where n is the number of results per sample [14]. This would be, e.g., 16%, for 20 results per sample. It was surprising too that 40%–50% were found in all parameter groups. As the average number of results ranged from 7 (herbicides) to 20 (major ions), it can be concluded that the observed “CV of the CV” were not dependent on the actual number of results per sample. There may be several reasons for this: The field of participants in the IFA-Test PT Scheme is not homogenous. The laboratories vary widely with the PT rounds, as the frequency of the series (3 or 6times a year) is higher than the participation frequency of the laboratories. The observed CV are most probably not normally distributed. It may also be possible that the sample-to-sample deviation of the CV is partially caused by the use of the estimation procedure for µ’ and σ’, the algorithm A. This will be the subject of further investigations. Anyhow, it should be widely considered that the accuracy of the CV determined from results of single interlaboratory comparisons might be low, even if the number of results is above 30, which is the number that is generally recommended for determination of CV [15]. Thus, for the definition of Z-score criteria it might be most useful to combine the results of multiple (periodical) interlaboratory comparisons, as presented in this paper.

Combination of CV

In the present study, the between-laboratory dispersion estimates of the results of many interlaboratory comparisons (6–106 data sets per parameter, on average 44 data sets) were combined. As the distribution of the single CV was not clearly identified, the median was chosen to determine the Z-score criteria from the CV. There may be doubts about the usefulness of applying consecutively robust statistics to a dataset, so the mean CV was calculated as square root of the arithmetic mean of the variances, too. The latter method yielded slightly higher Z-score criteria. All results are summarized in Table 1.

Table 1 Overview of the evaluated parameters. The medians were used as Z-score criteria: CV, coefficients of variation

Dependence of CV on concentration levels

In general no significant dependence of the CV on the actual concentration of the parameter in the sample was observed. One reason may be the relatively wide spread of the CV (see above). Another reason was the narrow concentration ranges, which were covered in the PT system. The target concentrations were within 1 order of magnitude for most parameters in the PT scheme. For some parameters obviously higher CV were found for low target values which were near the limits of detection that are demanded in the Austrian water monitoring program [16]. An example is given in Fig. 3. In this case the concentration range, in which the Z-score criteria can be applied, was limited. The lower limits given in Table 1 refer to the applicability of the determined Z-score criteria. They consider the legal limits of determination and the observed concentrations, too. In our experience, the introduction of variance functions as suggested, e.g., in [14], cannot be justified in a concentration range within 1 order of magnitude. Besides, single Z-score criteria are much more transparent to the participants.

Time dependence

No statistically clear trend to lower CV over the operation time of the PT scheme could be observed. One reason may be the relatively wide spread of the CV (see above). The CV of “nitrate in water” (Fig. 4) were all below the median line in the last year. Most probably the laboratories are improving their analytical performance by various measures like modernization of instrumentation and improving their quality assurance systems. So this will be seen in the future development of the CV values.

Conclusion and outlook

The collection of all measurement results in a single database in combination with computer-program controlled SQL queries appears to be a very powerful tool for the statistical evaluation of the collected data. The estimates of dispersion obtained by robust statistics have an unexpected low precision with typical CV of 40%–50% between the data sets of a selected parameter. Thus, Z-score criteria should never be calculated from a single PT round even if the number of participants is considerably high. No clear influence of concentration and time on the dispersion estimates was observed. Higher CV between laboratories were only found for concentrations close to the determination limits of the analytical methods. In our experience, a single Z-score criterion is applicable over a concentration range of 1 order of magnitude and we do not recommend the use of variance functions. Performance criteria evaluated from a number of PT series are the most suitable and transparent for the assessment of long-term performance. The Z-score criteria of Table 1 represent the average performance of (mainly) Austrian laboratories for the specified parameters and concentration ranges between 1997 and 2003. This limits their applicability and should be considered if the data is used for calculating Z-scores. The performance criteria will be implemented in the scheme for calculating performance scores for all participants on a voluntary basis in the near future. Assuming that there is, in fact, progress in the laboratories’ average performance, the Z-score criteria will have to be adapted periodically.