Abstract
After 7 years of successful operation, the database of the proficiency testing (PT) scheme for water analysis organized by the Institute for Agrobiotechnology (IFA), Tulln contains nearly 4,000 data sets from over 300 interlaboratory comparison samples. About 70 analytical parameters (major ions, metals, trace ions, herbicides, volatile halogenated hydrocarbons, organochlorine pesticides and polycyclic aromatic hydrocarbons) were covered by the PT scheme. The data were evaluated using robust statistics in order to determine a set of coefficients of variation (CV) for each analytical parameter. Concentration and time dependence of the CV were checked. The CV were combined to obtain standard deviations for proficiency assessment (Z-score criteria). Furthermore, a viewer program was designed to facilitate monitoring of the analytical performance of participating laboratories.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
In the early 1990s a countrywide monitoring network for ground- and surface water quality was built up on a legally regulated basis in Austria. Since then about 2,000 groundwater and 240 river water sites have been tested regularly and the water samples have been analyzed for numerous inorganic and organic parameters [1, 2]. Such a permanent and comprehensive measurement program requires reliable and accurate results of analysis. Thus, a sophisticated quality assurance program had to be implemented [3] and the compulsory participation in interlaboratory comparisons was an integrative part of this quality assurance program. Due to the rather infrequent organization of suitable interlaboratory comparisons, the Federal Ministry for Forestry and Agriculture, Environment and Water Management decided to initiate a regional PT scheme. In 1995, the IFA-Tulln Centre for Analytical Chemistry started to develop a PT scheme called the “IFA-Test Systems PT Scheme” for external quality assurance in water analysis . The scheme started with regular interlaboratory comparisons for inorganic parameters in water (major ions, heavy metals, and trace elements) and was successively extended to a multiplicity of inorganic and organic parameters (trace ions, herbicides, volatile halogenated hydrocarbons, organochlorine pesticides, polycyclic aromatic hydrocarbons (PAHs) and, recently, methyl-tert-butylether (MTBE). Depending on the parameter group, the number of series amounted to 1 to 6 per year and group. The operation of the scheme has closely followed the framework given in ISO/IEC Guide 43 [4]. The series reports provide data that allow evaluation of laboratories’ performance, but do not contain any performance evaluation. Calculation of performance statistics using Z-scores and evaluation of performance will be implemented in the near future. This implementation will make the PT scheme fully consistent with [4]. The series reports are published on the web site of the PT scheme. Preparation of samples, uncertainties of the target values, and behavior of the samples for the interlaboratory comparisons are described elsewhere [5, 6, 7, 8] or will be published in the near future. Participation in the PT scheme has always been open to all interested laboratories. Thus after 1 year of operation, the number of voluntarily participating laboratories superseded the number of compulsory participants. In 2002, the ratio of voluntary to compulsory participants was 4:1. Over 100 different Austrian and foreign laboratories have participated in the IFA-Test PT Scheme.
Course of PT rounds
Within the IFA-Test PT Scheme every round is a complete small (5 to 50 participants) interlaboratory comparison study. Normally, two different synthetic water samples with well-defined target concentrations are sent out in a round. The samples are identified by a letter code for the parameter group and the series number, e.g., H36A and H36B for the two samples from the 36th PT round on herbicides. In total, 190 PT rounds were conducted by June 2003: 56, 40, 19, 9, and 8 rounds on inorganic parameters, herbicides, volatile halogenated hydrocarbons, organochlorine pesticides and PAHs, respectively. At the end of each round the laboratories’ measurement results are transmitted and collected in a database. The measurement results of a certain parameter in one specified sample together with its basic statistic evaluation is called a data set in this paper. An example of a typical data set is given in Fig. 1 Not all substances of a certain parameter group are present in every sample, so the number of data sets available for every parameter may vary widely. Each PT round is completed by the release of a series report. The reports are designed to meet the requirements of the laboratories. They illustrate graphically all recoveries of each participating laboratory, which allows evaluation of the analytical performance of the laboratory. The reports also contain diagrams for each analytical parameter with a brief statistical evaluation: laboratory mean, standard deviation between the laboratories, and confidence intervals (Fig. 1). This allows identification of possible analytical problems for some parameters and yields a rough estimate of the accuracy that can be expected from an analytical laboratory. Thus, the series reports facilitate the comparison of the analytical performance of a selected laboratory with the performance of other laboratories.
Evaluation of long-term performance
For both the end-users of the analytical results and the laboratories, the answers to the following questions may be of interest: How does the analytical performance of a selected (contract) laboratory behave over a longer period of time? Do significant deviations of results from the target values occur often or rarely? What deviations between a measurement result and the assigned reference concentrations can be expected for each parameter and which parameters or analytical techniques cause most problems? The present project was initiated to answer these questions on the basis of the collected results of all previous PT rounds. It was also designed to be a helpful tool for monitoring the performance of the contracted laboratories.
Aim of this study
The combination of scores, like the sum of (squared) scores and running scores, is a common technique for the assessment of long-term trends [9]. This paper describes the determination of performance criteria (Z-score criteria σ) derived from the results of multiple PT rounds. The results of 190 PT rounds were evaluated and coefficients of variation (CV) were calculated from every data set as estimates for the between laboratory precision (“estimates of dispersion”). The CV were visualized for every parameter over the target concentrations and the operation time of the PT scheme in order to check the influence of concentration and time. The CV of every analytical parameter were combined to a single Z-score criteria σ representing the average laboratory performance over a given concentration range. These performance criteria are much more reliable than those calculated from single rounds. They are highly suitable for assessment of long-term trends via Z-scores and combination scores.
Realization
First, the laboratories’ results of all interlaboratory comparisons were transformed into a uniform format and collected in a single database. At this stage of the project, 130 different laboratories had participated in the scheme. The total number of analytical parameters was 67 and the database contained nearly 4,000 data sets containing more than 60,000 analytical results from about 300 interlaboratory comparison samples. All data sets were checked by the Hampel outlier test, which is often referred to as “Huber test”. This outlier test is based on the use of robust statistics. Davies suggested using it for the statistical evaluation of interlaboratory tests in 1988 [10]. The Hampel test has been used in the IFA-Test PT Scheme since its first year of operation. Laboratory mean values, mean recoveries, standard deviations, and CV were calculated from both the whole sets of data and the outlier-free sets of data. The agreement of laboratory mean values and target concentrations is permanently observed in the PT scheme. At this stage it could be shown that even for the evaluation of parameters over consecutive rounds no statistically significant deviations between target concentrations and outlier-corrected laboratory mean values were found. This is a basic requirement for a reliable estimation of dispersion. Furthermore, robust estimates of central tendency µ’ (“robust average”) and dispersion σ’ (“robust standard deviation”) were calculated from all data sets according to “algorithm A” of the international standard ISO 5725–5 [11]. Both the Hampel test and algorithm A start with the median as “robust average” µ’0 and a “robust standard deviation” σ’0 that is calculated from the median absolute deviation (MAD) multiplied by a factor (1.483 for n → ∞). The Hampel test simply rejects all results that deviate by more than 3 σ’0 from µ’0. Algorithm A progressively transforms the original data by replacing results that deviate by more than 1.5 σ’0 from µ’0 and recalculating µ’ and σ’. According to this approach, the values are updated several times until the process converges. This process is called winsorization [12]. The data presented in this paper were obtained by algorithm A. CV were calculated from the estimates µ’ and σ’ of all data sets with three or more measurement results. These CV were graphically depicted against the target concentrations for each of the analytical parameters. Two examples are shown in Figs. 2 and 3. In order to check whether there were visible changes during the operation time of the scheme, the CV were graphically depicted versus time, too. An example is shown in Fig. 4.
In order to combine the CV for each parameter, the medians of the CV were taken as Z-score criteria. Thus, one Z-score criterion was defined for every parameter. In some cases the concentration range had to be limited at the lower end to make the criteria applicable, e.g., 3 mg/L for nitrate (Fig. 3). The criteria and their concentration limits were added to the database. In parallel, a computer program (a viewer) was developed which allowed convenient access to the laboratory results’ database. The viewer creates Z-score diagrams and tables after selection of laboratory, parameter group, one or more parameters, and a time interval. As the viewer was designed to facilitate monitoring of the analytical performance of contracted laboratories, sub-databases that contained exclusively results of these laboratories were created for this purpose. An example of the viewers’ capability is given in Fig. 5. A corresponding table contains target concentrations, results of the laboratory, recoveries, and the Z-scores. The viewer program supports the calculation of the sum of the Z-scores (SZ) [9], too.
Results and discussion
Estimates of dispersion
Our comparison of the Hampel test and algorithm A showed both procedures yielded very similar estimates, µ’ and σ’, when they were applied on the data sets of the interlaboratory comparisons. Thus, the Hampel test and algorithm A can be considered to be equivalent for this task. Algorithm A was chosen for calculating CV because of its “more official character” as this procedure is recommended by ISO standards [11, 13].
Precision of interlaboratory comparisons
The procedure described above yielded a set of CV for every parameter (examples are shown in Fig. 2 and Fig. 3). It is possible to calculate CV from these sets. These “CV of the CV” obtained from different samples for each parameter were typically in the range of 40% to 50%. This is well above the theoretical value that can be expected for random samples of one population. It can be calculated by:
where n is the number of results per sample [14]. This would be, e.g., 16%, for 20 results per sample. It was surprising too that 40%–50% were found in all parameter groups. As the average number of results ranged from 7 (herbicides) to 20 (major ions), it can be concluded that the observed “CV of the CV” were not dependent on the actual number of results per sample. There may be several reasons for this: The field of participants in the IFA-Test PT Scheme is not homogenous. The laboratories vary widely with the PT rounds, as the frequency of the series (3 or 6times a year) is higher than the participation frequency of the laboratories. The observed CV are most probably not normally distributed. It may also be possible that the sample-to-sample deviation of the CV is partially caused by the use of the estimation procedure for µ’ and σ’, the algorithm A. This will be the subject of further investigations. Anyhow, it should be widely considered that the accuracy of the CV determined from results of single interlaboratory comparisons might be low, even if the number of results is above 30, which is the number that is generally recommended for determination of CV [15]. Thus, for the definition of Z-score criteria it might be most useful to combine the results of multiple (periodical) interlaboratory comparisons, as presented in this paper.
Combination of CV
In the present study, the between-laboratory dispersion estimates of the results of many interlaboratory comparisons (6–106 data sets per parameter, on average 44 data sets) were combined. As the distribution of the single CV was not clearly identified, the median was chosen to determine the Z-score criteria from the CV. There may be doubts about the usefulness of applying consecutively robust statistics to a dataset, so the mean CV was calculated as square root of the arithmetic mean of the variances, too. The latter method yielded slightly higher Z-score criteria. All results are summarized in Table 1.
Dependence of CV on concentration levels
In general no significant dependence of the CV on the actual concentration of the parameter in the sample was observed. One reason may be the relatively wide spread of the CV (see above). Another reason was the narrow concentration ranges, which were covered in the PT system. The target concentrations were within 1 order of magnitude for most parameters in the PT scheme. For some parameters obviously higher CV were found for low target values which were near the limits of detection that are demanded in the Austrian water monitoring program [16]. An example is given in Fig. 3. In this case the concentration range, in which the Z-score criteria can be applied, was limited. The lower limits given in Table 1 refer to the applicability of the determined Z-score criteria. They consider the legal limits of determination and the observed concentrations, too. In our experience, the introduction of variance functions as suggested, e.g., in [14], cannot be justified in a concentration range within 1 order of magnitude. Besides, single Z-score criteria are much more transparent to the participants.
Time dependence
No statistically clear trend to lower CV over the operation time of the PT scheme could be observed. One reason may be the relatively wide spread of the CV (see above). The CV of “nitrate in water” (Fig. 4) were all below the median line in the last year. Most probably the laboratories are improving their analytical performance by various measures like modernization of instrumentation and improving their quality assurance systems. So this will be seen in the future development of the CV values.
Conclusion and outlook
The collection of all measurement results in a single database in combination with computer-program controlled SQL queries appears to be a very powerful tool for the statistical evaluation of the collected data. The estimates of dispersion obtained by robust statistics have an unexpected low precision with typical CV of 40%–50% between the data sets of a selected parameter. Thus, Z-score criteria should never be calculated from a single PT round even if the number of participants is considerably high. No clear influence of concentration and time on the dispersion estimates was observed. Higher CV between laboratories were only found for concentrations close to the determination limits of the analytical methods. In our experience, a single Z-score criterion is applicable over a concentration range of 1 order of magnitude and we do not recommend the use of variance functions. Performance criteria evaluated from a number of PT series are the most suitable and transparent for the assessment of long-term performance. The Z-score criteria of Table 1 represent the average performance of (mainly) Austrian laboratories for the specified parameters and concentration ranges between 1997 and 2003. This limits their applicability and should be considered if the data is used for calculating Z-scores. The performance criteria will be implemented in the scheme for calculating performance scores for all participants on a voluntary basis in the near future. Assuming that there is, in fact, progress in the laboratories’ average performance, the Z-score criteria will have to be adapted periodically.
References
Philippitsch R, Schimon W (2003) Oesterr Wasser Abfallwirtsch 55(3–4):41–53
Kralik M, Grath J, Philippitsch R, Gruber D, Vogel W (1999) Groundwater quality in Austria. A unique groundwater monitoring system. In: Extended abstracts of international conference on quality, management and availability of data for hydrology and water resources management. Koblenz, pp 89–90
Philippitsch R, Wegscheider W (1999) Quality assurance program for the assessment of water quality in Austria. In: Extended abstracts of international conference on quality, management and availability of data for hydrology and water resources management, Koblenz, pp 115–118
ISO (1997) Proficiency testing by interlaboratory comparisons—Part 1: Development and operation of proficiency testing schemes, ISO/IEC Guide 43–1:1997. ISO, Geneva
Linsinger TPJ (1997) Development of quality assurance systems for the determination of trace components in water under consideration of natural constituents. Dissertation, Vienna University of Technology
Kandler W (1999) Aufbau und Betrieb eines Kontrollprobensystems zur Qualitätssicherung in der Wasseranalytik. Dissertation, Vienna University of Technology
Apfalter S, Krska R, Linsinger T, Oberhauser A, Kandler W, Grasserbauer M (1999) Fresen J Anal Chem 364:660–665
Apfalter S, Kandler W, Krska R, Grasserbauer M (2001) Intern J Environ Anal Chem 81:1–14
Analytical Methods Committee (1992) Analyst 117:97–104
Davies PL (1988) Fresenius Z Anal Chem 331:513–519
ISO (1998) Accuracy (trueness and precision) of measurement methods and results—Part 5: Alternative methods for the determination of the precision of a standard measurement method, ISO 5725–5:1998. ISO, Geneva
Analytical Methods Committee (1989) Analyst 114:1489–1495
ISO (2002) Statistical methods for use in proficiency testing by interlaboratory, Draft international standard ISO/DIS 13528. ISO, Geneva
DIN (2003) German standard methods for the examination of water, wastewater and sludge—Part 45: Interlaboratory comparisons for proficiency testing of laboratories (A45), DIN 38402–45. DIN, Berlin
Funk W, Damman W, Donnevert G (1992) Qualitätssicherung in der Analytischen Chemie. VCH, Weinheim
Bundesgesetzblatt für die Republik Österreich (1991) 338. Verordnung: Wassergüte-Erhebungsverordnung—WGEV, Vienna
Acknowledgements
The authors gratefully acknowledge the financial support of the Federal Ministry for Forestry and Agriculture, Environment and Water Management.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kandler, W., Schuhmacher, R., Roch, S. et al. Evaluation of the long-term performance of water-analyzing laboratories. Accred Qual Assur 9, 82–89 (2004). https://doi.org/10.1007/s00769-003-0696-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00769-003-0696-7