Introduction

The reliability of analytical results has always been a concern of laboratories, a fact highlighted more than 20 years ago by Valcarcel [1]. This concern has led to the development of new concepts and tools [2, 3], also stimulating the development and revision of many standards applicable to quality management systems for laboratories, such as ISO/IEC 17025 [4], GLP [5], and ISO 15189 [6].

Due to the increasing need to provide reliable results, different regulatory agencies require implementing quality management systems by laboratories, promoting growth in the number of accreditations in the use of different standards, such as ISO/IEC 17025 (Fig. 1).

Fig. 1
figure 1

Accredited laboratories considering the accreditation bodies from approximately 114 countries that are ILAC members and associates. Data obtained from ILAC (ILAC is the international authority on laboratory and inspection body accreditation, with a membership consisting of accreditation bodies and stakeholder organizations worldwide) [7]

As shown in Fig. 1, the application of quality management in laboratories is a global trend, increasing the need to develop new knowledge in areas like validation, uncertainty estimation, control charts, production of certified reference materials (CRM), development of proficiency testing (PT), application of statistics, and laboratory management software, leading to the updating and development of technical standards and legislation supporting these concepts [8].

The analytical quality assurance cycle (AQAC) is a proposal for a specific quality tool for laboratories, involving three concepts considered essential: method validation, uncertainty estimation, and quality control (QC).

The AQAC [2] is a conceptual tool that correlates the three most important requirements, generally applied in an ISO/IEC 17025 accredited laboratory, that are essential to provide reliable results. Considering that these three requirements are supported by statistical concepts, it is possible to note that there is an important correlation between them. The AQAC starts with validation to demonstrate that the method is fit for the purpose. The second step is the measurement of uncertainty to evaluate the dispersion of the quantity values being attributed to a measurand (generally, the principal uncertainties sources are from validation, like precision or linearity). The third step is the quality control (QC) that provides ongoing validation of the method. Figure 2 shows the AQAC with the insertion of CRM and PT concepts, indicating that they are closely linked to this tool and are necessary for this quality cycle.

Fig. 2
figure 2

Analytical quality assurance cycle (AQAC) adapted with the CRM and PT concepts

Figure 2 highlights that PT is a tool correlated with all the concepts of this cycle. This way, the results of PT can be used:

  • to evaluate the performance of laboratories for specific tests or measurements;

  • to monitor laboratories’ continuing performance; and

  • to evaluate if the estimated measurement uncertainty of the result was underestimated or overestimated by comparing the result obtained by the laboratory and its uncertainty with the assigned value of the PT sample and its acceptance range.

The importance of participation of PT is also highlighted in the quality management systems for laboratories. This periodic participation is mandatory for laboratories that want accreditation (formal recognition) to ISO/IEC 17025 by an accreditation body [9], which increases demand and development of new PT, generating an increase in the volume of publications on this subject.

The development and application of PT involve a series of steps, which despite being harmonized by standards such as ISO 13528 [10] and ISO/IEC 17043 [11], include different approaches, which must be chosen depending on the matrix and analytes that will be evaluated by PT, such as assigning values obtained by consensus or a reference value (which can be calculated by different strategies); different ways to performance evaluation (z-score, Zeta-score, etc.); use of different graphical methods (Youden plot, histograms, etc.); and different strategies for stability and homogeneity evaluation of PT items, among others.

Therefore, considering the increasing number of accredited laboratories and corresponding rising demand for PT, combined with the different strategies regarding its development and application, this review article presents a bibliographic review of published articles, focusing on food and environmental matrices. In this respect, we emphasize the trends in PT developments for chemical analysis, seeking to assist those who intend to develop new PT protocols or want to know more about the theme and the main approaches currently applied.

Discussion

Search method

Three different searches were carried out to obtain an overview of PT publications, as illustrated in Fig. 3. The first search for papers used the expression “proficiency testing” in the title, considered only published papers not repeated in the last 10 years (2009–2019), obtaining a total of 317 papers. Three databases were used, Web of Science, Scopus, and Science Direct.

Fig. 3
figure 3

Method used for the search and exclusion of papers

Since Web of Science generated the most results, we performed a new search only in this database, considering all years with the expression “proficiency testing” contained in the title. This search resulted in 587 non-repeated papers published between 1977 and 2019.

Considering the necessity to select a sample to do an evaluation of the trends and the statistical strategies in PT development for chemical analysis, a third search was carried out in the years 2018–2019, including all evaluated databases focused on food and environmental matrices, considering that during 10 years (2010–2019) 75% of the PT was focused on these two areas of study. The number of non-repeated papers found was 23 quantitative PTs. Each paper was evaluated, considering the main steps of PT. The results are shown in Table 1.

Table 1 Summarized results from papers on Proficiency Testing, published in the last two years (2018–2019), focusing on food and environmental matrices

Arguments concerning these steps are detailed in this paper.

In general, the papers were “theoretical papers” or “PT applications.” Papers classified as “theoretical” focus on issues such as the best way to calculate and interpret results and compare techniques. In most circumstances, theoretical PT papers promote discussion using data from PT performed during different time intervals with different matrices. Papers classified as “application” generally describe PT planning and organization and discuss participating laboratories' performance.

Observed trends

The analysis of the data obtained by the flowchart in Fig. 3 revealed several trends. Since the database that provided the most results was Web of Science, we decided to check the frequency of publications over the years in this database (Fig. 4). A significant increase in publications was noted since the 2000s.

Fig. 4
figure 4

Frequency of publications with the expression “proficiency testing” contained in the title considering the Web of Science platform for the major areas of chemistry, physics and biology between 1977 and 2020

According to the study by Olivares et al. [35], it is possible to observe a growing interest in CRM research and development from 1976 to 2016, intensifying significantly since 1996, mainly after 2006. The volume of PT studies followed the same behavior, as can be seen in Fig. 4. This correlation is explained by the requirement for participation in PT of laboratories with accreditation according to ISO/IEC 17025, an international standard implemented in 2000. Another fact that could explain these numbers shown in Fig. 4 is that the accreditation of PT providers also predominantly started in the year 2000, with the publication of some ILAC documents, and with the publication of ISO/IEC 17043 [11] in 2010, which accelerated the accreditation of PT providers even further.

The 317 articles found between 2009 and 2019 were mostly about quantitative PTs, with really few papers about qualitative or interpretative results. These papers were focused mainly on three research areas: physics, chemistry, and biology, and thus could be separated according to the nature of the PT (execution or theoretical) reported (Fig. 5). In these research areas, the matrices submitted to PT were contained in five main categories: (1) food (meat, fruits, animal feed, beverages, cereals, dairy products, oils); (2) environmental (soil, waste, air, water); (3) biological material (blood, urine, serum, muscle tissue, hair, organs, microorganisms, saliva); (4) inorganic (metal alloys, polymers, radioactive materials, textiles, ceramics, oil); and (5) instrumentation associated with clinical trials (thermometers, manometers). Figure 6 shows a new division considering only the “execution” and the matrices evaluated. It shows that 75% of the PT was focused on two main areas of study: food and environmental. This trend is intuitive because these are precisely the areas considered strategic by most countries. For this reason, testing in these areas is under more significant pressure due to government inspections and accreditation requirements, mainly regarding ISO/IEC 17025. To ensure the validity of results, the standard requires the laboratory to monitor its performance by participating in PT or any other type of interlaboratory comparison [4]. Because of this trend, PT papers involving food and environmental matrices are the focus of this study.

Fig. 5
figure 5

World trends in publications about PT (period from 2009 to 2019 in the three databases: Web of Science, ScienceDirect, and Scopus)

Fig. 6
figure 6

Percentage distribution of PT papers classified as “execution” considering different matrices used

There are several types of PT providers: government agencies, universities, and private companies. Regardless of the origin, the discussions focus on the providers’ evaluation parameters, such as development of the PT items, stability, homogeneity, attribution of assigned values, performance evaluation, and several graphical methods for interpretation of results. From the data in Fig. 5, it can be noted that the chemical area had the most published papers, indicating the interest in developing PT in this area. The global impact can be explained by the increasing adherence of chemical testing laboratories to ISO/IEC 17025 accreditation (Fig. 1).

The European PT Information System (EPTIS) database gathers records of PT schemes regularly applied. It started in 2000 and currently there are 4,445 PT schemes registered in EPTIS by 52 collaborating countries [36]. It is possible to highlight some areas in this database, such as “Food and Drink” with 12.1 % of the total, and “Environmental (water, soil, and sludge)” with 10.6 %.

The concern with analytical results’ quality is growing, but perhaps still not enough. Monya Baker [37] pointed out a worrying trend: in a general context, the scientific community recognizes that currently there is a crisis of reproducibility of published experiments, particularly in the areas of chemistry, biology, physics, and engineering. Several factors were pointed out by the interviewed researchers as causes of this crisis, but we draw attention to these:

  1. 1.

    Poor statistical analysis of the data;

  2. 2.

    Original experiments performed with low repeatability;

  3. 3.

    Poor experimental design; and

  4. 4.

    Raw data unavailable from the original laboratory.

The adoption of the general requirements according to ISO/IEC 17025 forces the application of quality tools by testing laboratories, such as calibration, CRM application, validation of methods, and measurement uncertainty estimation, which can significantly improve the reproducibility of the published experiments. However, only interlaboratory comparisons (such as periodic participation in PT) can assess the performance of laboratories and the agreement among them [4], showing the capacity of the laboratory to reproduce its results.

Critical steps of PT planning and development

The planning to perform PT is based upon several stages, carried out in a logical sequence to achieve the specifications of each program. Among these stages, it is important initially to establish how the material will be produced. As an example, one can choose a naturally contaminated or spiked sample. Different techniques can be used to calculate the assigned value, such as by a primary method in the laboratory, two different techniques, or a consensual value between results reached from PT after adequate statistical evaluation. The homogeneity and stability must be evaluated to ensure comparability of all results. Based on the results of participants, the evaluation needs to consider issues such as the normal distribution and standard deviation of data, and graphic evaluation of results.

Considering ISO 13528, the provider of PT has to ensure that all PT items are stable and homogenous. However, the guide to express the uncertainty in measurements (GUM) shows that uncertainty calculation implementation is required to assure adequate stability and homogeneity. Due to lack of repeatability in measures or insufficient repetitions, some studies of homogeneity and stability fail to properly interpret the results [38]. The standards ISO/IEC 17043 and the ISO Guide 35 [39] demand uncertainty calculations linked to homogeneity, stability, and characterization to reach a standard uncertainty for PT.

During evaluation of PT, graphical methods also help interpret the testing results, enabling laboratories to implement corrective actions. [40].

The generalized need for PT programs to be accredited brought the need for variations in the evaluation of z-scores to encompass all statistics of the tested PT items. Thus, it is essential to consider other parameters, described by the ISO 13528 standard as z′-scores, Zeta-scores, D-scores, and En scores. However, in most cases z-scores are sufficient to evaluate the performance by a PT provider [41].

Considering the different statistical tools applied in different PT steps, we evaluated 23 papers published in the last two years (2018–2019), focusing on food and environmental matrices due to their importance and representativeness, as discussed above. The results are summarized in Table 1 and are discussed below.

Homogeneity

According to requirement 6.1 of ISO 13528-2015, “The PT provider shall ensure that batches of PT items are sufficiently homogeneous and stable for the PT testing scheme. The provider shall assess homogeneity and stability using criteria that ensure that inhomogeneity and instability of PT items do not adversely affect the evaluation of performance.”

The assessment of homogeneity and stability should use different approaches, as detailed in ISO. Apprising the manuscripts chosen in this review, we observed different ways to assess the homogeneity and stability. Therefore, we present a discussion of these descriptions in the next sections.

PT providers have different ways to ensure the homogeneity of PT items subjected to testing. Some of them do not consider homogeneity, such as PTs of krill oil samples [12] and screening of antibiotic residues in milk [25], due to the samples characteristics. The same was observed mainly in PT schemes focusing on radionuclide samples. Nakashima and Duran [17] requested a contractor to verify the uniform distribution of radioactivity in a filter. The uncertainty due to homogeneity during method validation should be included and disclosed. Other authors homogenized analytes manually after spiking test samples [14]. Even though not performing a homogeneity test, gamma-spectrometric measurements in predefined lines were performed to check for homogeneity issues in the prepared quartz sand for different grain size fractions. The homogeneity of the sample was studied only between containers, not within containers. The final uncertainties of the respective sums of relative deviations were calculated using Gaussian propagation of uncertainties [16].

Kim et al. [26] evaluated the homogeneity of mercury in samples of oysters by isotope dilution mass spectrometry. Twelve bottles of PT items were selected, and the isotope ratios were used, 202Hg and 200Hg, estimated based on the intensity ratios of the two isotopes (I(202Hg)/I(200Hg)) by ICP-MS. The isotope-enriched material and a standard solution with a certified mass fraction were used in the model equation to estimate the measurand. The laboratories’ development of adequate homogenization of samples under different conditions (good between-sample but poor within-sample homogeneity) was evaluated in a PT to measure metal contents in soil samples [29].

Yeltepe et al. [13] tested the within-bottle and between-bottle homogeneity by gamma-ray spectrometry. Target values of specific activities were determined for each radionuclide. One-way ANOVA was applied to evaluate the in-bottle and between-bottle homogeneity [14]. Another statistical procedure for homogeneity evaluation is to apply the IUPAC Harmonized Protocol [40]. Visual inspection, the Cochran’s test, and an estimate of analytical variance (ANOVA) have also been used to test whether samples’ homogeneity was achieved [27].

Among the PT studies reporting the use of ISO 13528, Cordeiro et al. [19] evaluated official control laboratories’ performance in quantifying fipronil reliably in eggs. The homogeneity experiment consisted of duplicate analysis of 10 PT items randomly selected along the filling sequence, and the analyses were performed in random order. The same procedure was employed to evaluate pesticide residues in olive oil [28], total petrol hydrocarbons in soil [33]. Furthermore, PT was applied using isotopic dilution to measure benzoic acid, methylparaben, and n-butyl paraben in sweet soy sauce [24] and to measure electrolytic conductivity in a synthetic formulated wastewater for water quality monitoring [32].

Following ISO 13528, the criterion for the homogeneity check was “between-sample standard deviation ≤ 0.3 for proficiency assessment (s(sample) ≤ 0.3σPT)” and one-way ANOVA at 95% confidence level. Fearn and Thompson [42] proposed an alternative statistical approach to correct errors in the test suggested in the Harmonized Protocol. They argued that some materials that are rejected are in fact satisfactory. To overcome this drawback, the authors presented a simple new experiment, namely a randomized replicated experiment using ANOVA to estimate the sampling standard deviation (σsam) from the results. In the experiment, the sample was homogenized and analyzed in duplicate. If the analytical method is precise, the standard deviation (σsan) is small. This way σsam can be reliably estimated. The authors suggested that the analytical precision of the homogeneity test method should satisfy σsam/σpt < 0.5.

ISO Guide 35 was applied to evaluate the homogeneity in a PT conducted to evaluate detection of organochlorine pesticides in drinking water. One-way analysis of variance (ANOVA) was used to evaluate water samples’ homogeneity concerning the concentration of pesticides [34]. It is necessary to emphasize the similarity between the established homogeneity and stability tests in the ISO Guide 35 and ISO 13528. Otherwise, some points can present differences. ISO Guide 35 includes both within- and between-unit homogeneity. Statistical power analysis can assist in choosing a suitable number of units and replicates for the homogeneity study. Also, in ISO Guide 35 the uncertainty should be calculated because any detected heterogeneity is included in the certified value. On the other hand, ISO 13528:2015 is less restrictive. Usually, only two replicates are performed. A comparison between them is made, and if necessary, a statistical test (e.g., Cochran’s test) is applied for outlier exclusion. The between-sample standard deviation, Ss, is compared with the standard deviation for proficiency assessment σpt, and the PT items can be deemed adequately homogeneous if Ss ≤ 0.3 σpt.

The justification for the factor of 0.3 is that when this criterion is met, the between-sample standard deviation contributes less than 10 % of the variance in evaluating the performance. Hence, the performance evaluation is unlikely to be affected.

Other authors did not refer to ISO standards. Instead, the use of CRMs for comparison in INAA and k0-INAA experiments using PT materials is one of the alternatives to the homogeneity test described in ISO 13528 [15].

Stability tests

As described in requirement 6.1.2 of ISO 13528, “for calibration proficiency testing schemes where multiple participants use the same artifact, the proficiency testing provider shall assure stability throughout the round, or have procedures to identify and account for instability through the progression of a round of the proficiency testing scheme. This should include consideration of tendencies for particular PT items and measurands, such as drift. Where appropriate, the assurance of stability should consider the effects of multiple shipments of the same artifact.”

Concerning the evaluated revised articles, we observed different options related to the stability process. Zailer et al. [12], in a PT to evaluate 31P NMR as the official method for measurement of phospholipid in krill oil, did not perform a stability test. The authors reported problems related to the hydrolysis process, in which PtdCho was gradually transformed into degradation products.

As proposed by ISO 13528, stability was evaluated in a three-year trial of isotope dilution mass spectrometric results in a PT for pesticide residue analysis [14].

Wiedner et al. [16] performed cursory sample stability measurements for a few weeks to suggest a prepared technologically enhanced naturally occurring radioactive material ((TE)NORM), used in a PT of gamma-ray spectrometry laboratories, finding it to be sufficiently stable as PT sample.

Two main strategies have been applied to evaluate stability. One of them is the isocratic experiments, where the samples are stored in different temperatures and humidity in laboratory conditions, and all analyses are performed simultaneously. This scheme was applied based on ISO 13528 to assess pesticide stability in eggs [19] and based on ISO Guide 35 to evaluate the stability of two reference materials of canned tuna containing histamine. In this issue, short- and long-term stabilities were studied. Short-term stability was satisfied at the three evaluated temperatures, and the materials were considered stable during the transport. This result was obtained as b1 < t(0.95, n−2), where b1 is the regression line's slope, t0:95, and n − 2 is the Student’s t value at 95% confidence level with n  – 2 degrees of freedom. A similar experiment was performed for the long-term stability at 4 °C, with the histamine determined after 12 months. The results were included in the combined standard uncertainty (u) of the designed reference values [27].

Stability has also been determined in PT studies with analyses before the PT items are distributed to laboratories and at different periods. Some PT providers analyze the PT items in the middle and at the end of the test period, such as a 16-year study to evaluate the stability of six different pesticides added in a commercial olive oil. Two bottles were randomly chosen and analyzed before shipping to a PT and in the day defined for reporting results [28]. However, most stability studies evaluate samples at different times and considering different temperatures. Samples of oyster powder and solutions were stored in low-density polyethylene (LDPE) bottle containers to check the stability of Hg during PT. Mercury was determined in 1, 3, 6, and 12 months after certification [26]. Robust means (Algorithm A of ISO 13528) was applied to assess the stability of more than 100 pesticides in freshwater. The samples were analyzed 2, 4, and 9 days after preparation, and most of the pesticides were found to be stable during the considered test period [30].

Difference between measures ӯ1 − ӯ2 ≤ 0.3σPT, where ӯ1 and ӯ2 are the averages of measurement and σPT is the standard deviation of the PT, has also been applied to assess the stability. This proposal was used to determine the best formulation for a PT candidate material of synthetic wastewater. The samples were stored at two different temperatures (25 °C and 40 °C), and the conductivity was analyzed in the first and fourth weeks [10, 32]. The same strategy was used to determine nutrients stability in five different foods exported by the Philippines. Three randomly chosen samples were separated, stored at controlled temperature, and analyzed after 1.5, 2, and 3 months. The stability considered the | ӯ1 − ӯ 2| for each measurement [21].

Assignment of target value—x pt

The PT target value is considered the best estimate of the true value, and it is one of the most critical steps defined by the provider. Sections 7.3 and 7.7 of ISO 13528 specify statistical procedures and models to determine the target value, but the choice of these procedures is the PT provider’s responsibility. Nevertheless, their documentation and disclosure to participants are mandatory.

The sample formulation through the standard addition was one of the strategies utilized in the PT items, and the target value defined by the provider was the average of the PT results after the exclusion of outliers through the Grubbs test [12, 34] or the robust average (x*) calculated from Algorithm A (ISO 13528—item C.3.1) [22, 28, 29].

Also considering the PT items’ preparation through standard addition, Yarita et al. [14] and Ebarvia et al. [27] characterized the materials with the application of the primary method through isotope dilution, which is a procedure used when the target value is defined by only one laboratory (ISO 13528—item 7.3.1.4).

We also observed that some of the providers utilized reference materials (RMs) as PT items. Wiedner et al. [16] produced a reference material (naturally occurring radioactive materials—NORM) constituted of quartz sand with high levels of 226Ra, which was characterized through gamma-ray spectrometry as a primary method by the National Metrology Institute of Austria, and this reference value was compared to the participants’ average.

Kim et al. [26] produced a CRM through standard addition, and the characterization was realized using the primary method of isotopic dilution, following the predicted procedures of ISO Guide 35. This CRM value was employed as a target value compared to the robust average obtained through Algorithm A (ISO 13528—item C.3.1).

Ziegler et al. [30] spiked water samples to evaluate pesticide stability through a PT, observing that 81 % of evaluated analytes were within 20 % of the theoretical spiked value.

Concerning the procedure that defines the target value from the consensus value of the PT participants, most of the studies mention Algorithm A (ISO 13528—item C.3.1) to obtain the robust average (x*) as a target value [22, 28, 29]. Becker et al. [33] utilized the Hampel robust estimator to determine the target value, as specified by ISO 13528—item C.5.3.

Based on the Wageningen Evaluating Programs for Analytical Laboratories (WEPAL) PT results, other authors evaluated instrumental neutron activation analysis (INAA) as the primary method. Its performance was evaluated by comparing the values reported by the laboratories, using the average of all laboratories as the target value, considering a regular distribution [15]. Coleman et al. [31] used other PT items and adopted the pre-established target value by those providers.

Calculating standard deviation of PT evaluation (σ pt)

In general, the standard deviation of PT evaluation must delimitate the confidence interval as the difference between the value reported by the laboratory and the target value, assigning a score in a specific confidence interval.

In the same way as the target value, the standard deviation of the PT evaluation can be defined using a statistical model suggested in requirement 8 of ISO 13528. It is important to define whether the σpt value will be calculated from the results reported by the laboratories or defined independently.

Among the procedures that utilize the classical statistic, Ziegler et al. [30] defined the σpt value from the estimate of the standard deviation of the integrated signals of the nuclear magnetic resonance. In this study, each sample spectrum was submitted to the Grubbs test, at α = 0.05 significance level for outlier exclusion. The target value and its standard deviation were used as the tolerance limit to evaluate each laboratory’s result with its standard deviation [16].

As described in requirement 6.5.1 of ISO 13528, robust methods are recommended to PT providers to define the standard deviation value, preferably through methods that exclude outliers. Thus, we observed in this review that some authors employed the normalized interquartile range (ISO 13528—item C.2.3), Algorithm A (ISO 13528—item C.3), and the Hampel estimator (ISO 13528—item C.5) as robust statistical techniques to estimate the PT standard deviation value.

Stöckel et al. [18] determined the standard deviation of PT calculated from the Hampel robust estimate. Since this procedure reduced the impact of outliers on the results, the authors found better estimates for the reproducibility standard deviation, in which one of the goals was the evaluation of this merit parameter. Another robust test used was the normalized interquartile range, applied to exclude outliers in the uncertainty calculation of laboratories’ performance evaluation through the Zeta-score [17].

Currently, Algorithm A is widely used for the PT standard deviation calculation because it is based on descriptive statistics. The average and the standard deviation are calculated interactively, replacing the outlying values several times. Algorithm A transforms the original data through a process called winsorization to provide alternative average and standard deviation estimators for almost regular data, which is more useful when the expected proportion of outliers is below 20 %. The winsorization process is based on replacing the extreme values by the maximum and/or minimum values in the dataset [43]. The convergence of the average and standard deviation values can be assumed when there is no change of an interaction in the third significant number of the robust average and the robust standard deviation (ISO 13528—item C.3).

Generali et al. [28] and Stefanelli et al. [23] calculated the PT standard deviation (σpt) with the results derived from laboratories that participated in a testing round. The authors utilized Algorithm A to estimate the robust standard deviation (s*) until the interaction stabilized at the third significant figures. The robust standard deviation was defined as the PT standard deviation, which was used for the participant laboratories’ performance evaluation.

As specified in ISO 13528—requirement 8.4, other authors used the Horwitz equation, which establishes an exponential relation between the standard deviation of reproducibility and the mass fraction of analytes. This is an empirical parameter applied to reproducibility standard deviation evaluation, considered a general model to designate the standard deviation of the PT evaluation [44]. This equation is utilized to predict the chemical measures of interlaboratory tests, and it was proposed by Horwitz [45] and modified by Thompson et al. [40].

Leyva-Morales et al. [34] and Kim et al. [26] used the Horwitz model to calculate the PT standard deviation, which was determined from the spiked sample’s target values, considering the analytes of interest. Kim et al. [26] found a high level of equivalence between the values reported by the participant laboratories and the reference value (target value), assigning the PT standard value from the Horwitz predictive model, which was satisfactory to obtain conclusions about the potential of the TDA-ASS method for measurement of mercury levels.

Likewise, Ebarvia et al. [27] assigned the standard deviation calculated from the Horwitz model. For the intended purpose, it was possible to observe a significant improvement in the evaluated laboratory performance. This confirmed that the Horwitz model is a good option for PT providers to assign the standard deviation of the PT evaluation.

Standard uncertainty calculation of the target value (u(xpt))

Generali et al. [28] and Stefanelli et al. [22] calculated the target standard uncertainty value (u(xpt)) according to the equation: \(u\left({x}_{\mathrm{pt}}\right)=1.25\times \frac{{s}^{*}}{\sqrt{p}}\), where p is the number of results and S* is the robust standard deviation. The authors determined that u(xpt) was below the criterion \(u\left({x}_{\mathrm{pt}}\right)<0.3{\sigma }_{\mathrm{pt}}\), so it was considered insignificant. Therefore, it did not need to be included in interpreting the results of the PT round. However, according to the established criterion, Dajay et al. [21] found nonsignificant standard uncertainty values of the target value for protein, ash, and sodium analytes, as specified in ISO 13528—item 9.2. But for moisture as well as the analytes fat, iron, calcium, potassium, and zinc, the study presented significant uncertainties—u(xpt) > 0,3σpt, which were assigned in the laboratories’ performance evaluation.

Applying a different approach, instead of considering the PT results, Aryana et al. [24] assigned the target standard uncertainty value from the combination of PT items with uncertain characterization (\({u}_{\mathrm{char}}^{2})\), the homogeneity test standard uncertainty (\({u}_{\mathrm{hom}}^{2})\), and the standard uncertainty due to instability (\({u}_{\mathrm{stab}}^{2})\), according to the formula:

$$u\left({x}_{\mathrm{pt}}\right)=\sqrt{{u}_{\mathrm{char}}^{2}+{u}_{\mathrm{hom}}^{2}+{u}_{\mathrm{stab}}^{2}}$$

.

Performance evaluation

The statistical project to be applied by PT providers for performance evaluation must provide objective information that shows the quality of the participant laboratories’ results based on the criteria pre-established through interlaboratory comparison.

ISO 13528 requirement 8 proposes PT evaluation criteria that generally must be established to meet the following purposes: performance evaluation through comparison with external criteria; performance evaluation through interlaboratory comparison; and performance evaluation through comparison with the stated measurement uncertainty.

One of the procedures described in ISO 13528 compares each participant’s reported results to a target value, establishing a confidence interval through the PT standard deviation. In this case, the standardized performance statistic usually is the z-score or z′-score, which allow direct comparison of the results of the different PT items and different units, since the score is not expressed in the measurand’s original unit, that is, it is normalized and described as the distance between the participant laboratory’s results and the target value in standard deviation units. The hypothesis of using the z-score or z′-score is based on the dataset’s normal distribution, with averages of 0 (zero) and 1 (one) standard deviation.

Among the evaluated works, the z-score was used for environmental matrices [15, 18, 31, 34] and food matrices for the purpose of performance evaluation [19, 26,27,28, 30]. The z′-score was cited by Dajay et al. [21] when the target value standard uncertainty (\(u({x}_{\mathrm{pt}})\)) was greater than \(0.3{\sigma }_{\mathrm{pt}}\), according to the criterion established in ISO 13528.

Leyva-Morales et al. [34] identified the need to harmonize methods for analysis of the levels of organochlorine pesticides in potable water; Stöckel et al. [18] reported that the participants’ performance evaluation based on the z-score after logarithmic transformation of the results was satisfactory.

Generali et al. [28] concluded that the combined z-score must be used with caution and that the ideal is use of the individual z-score for performance evaluation.

The Zeta-score (ζ) is also used as a statistical criterion [13, 17, 24]. The Zeta-score is mainly applied when the PT goal is to evaluate laboratories’ performance in comparison with a target value associated with an assigned uncertainty. The performance criterion is interpreted in the same way as the z-score (ISO 13528 requirement 9.6). Aryana et al. [24] evaluated laboratories’ performance in detecting benzoic acid, methylparaben, and n-butylparaben, and the results (60 %, 69 %, and 54 %, respectively) were considered satisfactory.

Other performance estimators, like the normalized error (\({E}_{n})\) and the laboratory’s tendency estimated through the relative or straightforward difference between the laboratory’s results and the target value, are also employed, depending on the proficiency test’s purpose. Some authors aimed to validate the method, mainly related to repeatability and reproducibility [12, 22, 34].

Graphical methods

Requirement 10.1 of ISO 13528 defines the application of graphical methods. Considering this requirement, the PT provider can prepare graphs such as histograms of results or performance scores and Kernel density plots. Similarly, with the PT report, the graphs can show the results or participant’s performance by codes, enabling each participant to know their own result. Although the graphs are important to evaluate and compare the laboratory results, the PT provider can determine if there is a need to review the criteria used to evaluate performance.

By using histograms, it is possible to evaluate the stability or degradation products of PT items [12], the percentage of acceptable and unacceptable scores for each measurement in the submitted results from the laboratories participating in a PT study [13], and to evaluate the laboratory performance during a long-term PT study [17, 22]. Graphical methods for combining performance scores over several rounds of a PT scheme were found to be useful to allow detection of trends, such as the ratio between the uncertainty of the reference value and the assigned standard deviation (u(xpt)/σpt), employed in the calculations of the z-scores in a 10-year PT of multiple pesticide residues in olive oil. The evaluation of the trueness was also performed via recovery (%) versus concentration levels [22].

Another alternative graphical method widely used is S-shape charts to present the results obtained by laboratory participants with their expanded uncertainties for k = 2. The results for each laboratory in the same plot are convenient to observe the performance of the participants [13, 19, 24].

The use of z-score graphs for an individual sample and performance evaluation (by the percentage of satisfactory results) was important for the PT provider to choose effective protocols for sample preparation, such as for the irradiation of samples included in a PT involving neutron activation analysis by the relative INAA and k0-INAA methods for 16 elements in soil and plant samples [15], to correlate different methods used by the PT participants [22], to analyze the production of new products [16], and to rank the performance of the participants [24, 26, 27, 34].

To assess global performance in critical evaluation of a long-term PT of pesticide residues in olive oil, Generali et al. [28] used the average of the squared z-scores (AZ2) to ascertain whether a laboratory had achieved the goal of detecting correctly at least 80% of the analytes of interest. This strategy is useful in the evaluation of methods to detect multiple pesticide residues in food.

The Youden plot is a very informative graphical method of studying the results when two similar PT items are tested in a round of a PT scheme. It can be useful for demonstrating the correlation (or independence) of results of different PT items and for guiding investigations into reasons for action signals (ISO 13528). Cordeiro et al. [19] applied Youden plots to evaluate the normal distribution of fipronil in egg samples. It was possible to visualize the samples with a bivariate confidence limit at 95% in a closed group, as an ellipse pattern in the center of the figure.

Thompson et al. [20], in a comparison between reproducibility standard deviations (SDR) derived from PT and collaborative trials, used log-transform the basic statistics (SDR and mass fractions) for regression and display purposes. The transformation provided a visual approach to a constant variance of the SDR x mass fractions. Simple regression was used to characterize each dataset over a wide range of mass fractions.

Box and whisker plots were used to indicate the variation among the interquartile results in a PT scheme. The whiskers indicated the maximum and minimum reported result [29]. In an olfactometric PT, Stöckel and co-authors [18] drew box plots of the recoveries per component to assess measurements’ behavior. They also used recovery rates for each component per dosage, participant, and PT. The recovery results were converted into z-scores after logarithmic transformation to harmonize the participants’ results from different components and PTs. Log-transformation before calculating z-scores is useful to establish near-symmetric distributions that are sufficiently close to normal to justify interpretation based on the normal distribution.

Another opportunity to visualize quantitative results for all samples is to use a chemometric approach, like principal component analysis (PCA). It was applied to perform multivariate analysis using quantitative data on phospholipid species to distinguish between five krill oil samples measured at different sites [12]. Becker et al. [33] used the Q score to evaluate participants’ performance over consecutive rounds of 15 years of PT of total petrol hydrocarbon determination in soil. It is calculated as Q = (xlabXref)/Xref, expressed in %.

Conclusion

This review shows that the growth of concern over the reliability of analytical results is increasing the interest in PT and has thus increased the number of publications about this subject in the past few years. Although most of the papers found were about quantitative PTs, it is important to highlight the importance to discuss about qualitative PTs too, considering the different approaches to performance evaluation in this area.

Since different standards describe the statistical tools applied in each PT stage, considering the characteristics of each PT is necessary to choose the best strategies, for example, during the evaluation of an assigned value (characterization by a primary method in the laboratory, two different techniques or a consensual value between results reached from a PT after adequate statistical evaluation); during the evaluation of the results from the participants by different approaches (z-score, z′-score, Zeta-score); and the application of different graphical methods; among others.

Many PT studies discussed in this review were not found in the EPTIS database, so it was not possible to compare all PT protocols. However, these papers can support PT developments, even if some stages in some papers are not in precise conformity with ISO standards.

At present, due to the pandemic the accreditation bodies have adopted the use of remote audits to evaluate compliance with the ISO/IEC 17025 standard. Therefore, the technical requirements of that standard, which involve in-person monitoring of laboratory tests and other procedures, have been hampered. This means that the results of PT stand out for the ability to establish the technical competence of laboratories in support of remote audits.