Introduction

There are many applications for radionuclide measurements. In a number of instances long-term monitoring programs require many thousands of measurements for safe use of radioactivity and nuclear technology, and there are instances where only a few measurements are made for critical national security decisions. In all of these cases, measurement capabilities must be of the highest integrity (accuracry and precision) and reliability (reproducible). Independent performance testing, and more importantly, independent traceability testing provides the link to accuracy to the SI unit of radioactivity—the Becquerel [Bq].

The National Institute of Standards and Technology [NIST] has been conducting the NIST Radiochemistry Intercomparison Program [NRIP] traceability evaluation program where the blind testing results provides the participating laboratories with the information to determine if their measurement processes are in long-term statistical control, improve capabilities and verify their quality control processes to meet programmatic measurement quality objectives, and to assess their capabilities to respond to time-sensitive emergency situations.

A broad suite of fifteen radionuclides for gross radioactivity, alpha, beta, and gamma assay are quantitatively spiked into five replicate acidified water, synthetic urine, synthetic feces, glass fiber air filters, and soil samples for the participant laboratories’ measurements. The attached Supplemental Information is the NRIP Statement of Work [SOW] for 2006. This SOW details the potential certified nuclides that could be use to spike the test samples, the maximum Bq/sample for each radionuclide for both the 60 day turnaround and 8 h turnaround programs, and traceability testing criteria. These NRIP program requirements were established during detailed discussions between the participating laboratories and NIST. Considerations for the agreed upon activity levels to be spiked into the test samples included the participating laboratories’ measurement sensitivity, potential for laboratory contamination, and sufficient activity levels to make sufficiently precise measurements to meet traceability testing criteria.

Since its beginning in 1997 more than twenty laboratories have participated in NRIP, and the program continues today. The routine monitoring aspect of the program has a measurement turn around time of two months while it is limited to eight hours for the emergency response measurements.

While providing radioactivity traceability is a unique mission to NIST, it does not provide a regulatory pass/fail assessment criteria for traceability. NIST depends on the individual programs under which the participating laboratories are operating to provide the necessary intra-program specific assessment criteria. In the absence of programmatically declared assessment criteria, NIST invokes with agreement by the participating laboratories, the use of the consensus ANSI N42.22 & ANSI N13.30 testing criteria [1, 2].

Periodically, NIST will conduct a global evaluation of the results received by NRIP [3,4,5]. This paper summarizes the 10 year evaluation for the routine monitoring part of the program. The laboratory performance issues to be evaluated were the difference [”bias”] from the NIST certified value, variation, and relative importance of each of the following factors:

  • Which matrix demonstrated best agreement with NIST certified values? Worst?

  • Which year demonstrated best agreement with NIST certified values? Worst?

  • Which radionuclide demonstrated best agreement with NIST certified values? Worst?

  • Were there any outliers affecting data?

  • Which laboratory demonstrated best agreement with NIST certified values? Worst?

Methods

The data from the routine monitoring section of NRIP was selected for assessment because it is less affected by the effects of counting uncertainty components that could dominate measurement uncertainty under emergency response time frame. All of the 1269 measurement results collected from the 24 participating laboratories across the 10 years [1997–2006] for all 5 matrices and 15 radionuclides were compiled into an Access database. No “outliers” were initially excluded to assess their effects on the data distributions. The Access database was then exported to Dataplot [6] for statistical evaluation where the variations of the mean, median, and data distributions could be visualized.

Results and discussion

The statistical evaluation of measurement results collected by NRIP over ten years are summarized in Figs. 1, 2, 3, 4 and 5.

Fig. 1
figure 1

What is the relative importance of each factor? The mean relative measurement difference from the certified NIST values [Y-axis] for the 24 participating laboratories, the years from 1997 to 2006, the test 5 matrices and the 24 radionuclides [X-axis] are presented

Fig. 2
figure 2figure 2

Which matrix demonstrated best agreement with NIST certified values? Worst?. a The effect of test matrix on variance performance. Y-axis = Relative difference for all 1269 measurements from the NIST certified value for each matrix. X-axis = matrix where AW = Acidified Water; GF = Glass Fiber Air Filter; SF = Spiked Fecal; SS = Spiked Soil; SU = Spiked Urine. The solid horizontal line is the NIST certified value. Also noted are the number of measurement observations for each matrix. b The effect of test matrix on the mean measurement performance. Y-axis = Mean relative difference for all 1269 measurements from the NIST certified value for each matrix. X-axis = Matrix where AW = Acidified Water; GF = Glass Fiber Air Filter; SF = Spiked Fecal; SS = Spiked Soil; SU = Spiked Urine. The dashed horizontal is the mean -1.06% measurement difference from the NIST certified value across all matrices for all measurements. Also noted are the number of measurement observations for each matrix. c The effect of test matrix on the median measurement performance. Y-axis = Relative difference from the NIST certified value for each matrix. X-axis = Matrix where AW = Acidified Water; GF = Glass Fiber Air Filter; SF = Spiked Fecal; SS = Spiked Soil; SU = Spiked Urine. The dashed horizontal is the median − 1.8% measurement difference from the NIST certified value across all matrices for all 1269 measurements. Also noted are the number of measurement observations for each matrix. d Outlier Identification using Normal Probability Plots. Y-axis = ranked % measurement difference from the NIST certified value for all measurements. X-axis = Standard Deviation

Fig. 3
figure 3

a Which Year demonstrated best agreement with NIST certified values? Worst? The effect of year on the median measurement performance. Y-axis = Median relative measurement difference for all measurements from the NIST certified value, and X-axis = each of the 10 years. The dashed horizontal line is the median measurement difference from the NIST certified value across all years. a Which Year demonstrated best agreement with NIST certified values? Worst? The effect of year on the mean measurement performance. Y-axis = Mean relative measurement difference for all measurements from the NIST certified value, and X-axis = each of the 10 years. The dashed horizontal line is the mean measurement difference from the NIST certified value across all years

Fig. 4
figure 4

a What Nuclide demonstrated best agreement with NIST certified values? Worst? The effect of test radionuclide on measurement performance. Y-axis = Relative difference of 1269 measurement results for all 5 matrices from the NIST certified values, and X-axis = all 15 radionuclides. The solid horizontal line is 0 percent difference from the NIST certified value. Also noted are the number of measurement observations for each radionuclide. b The effect of test radionuclide on measurement performance for the acidified water matrix. Y-axis = Relative difference of 299 measurement results from the NIST certified values, and X-axis = all 15 test radionuclides. The solid horizontal line is 0 percent difference from the NIST certified value. Also noted are the number of measurement observations for each radionuclide

Fig. 5
figure 5figure 5

a Mean Performance of the 25 participating laboatories. Y-axis = Mean relative difference from the NIST certified value [Y-axis] for all measurements, and X-axis = each of 24 participating laboratories. The solid horizontal line is 0 percent difference from the NIST certified value, and the dashed line represents the mean difference for the combined laboratory results. b Laboratory performance for individual radionuclide measurement. Y-axis = Relative difference of the 1269 measurements from the NIST certified value for all 15 radionuclides, and X-axis = each of 24 participating laboratories. The solid horizontal line is 0 percent difference from the NIST certified value. Also noted are the number of measurement observations by each laboratory. c How Do Long-Term Participants Compare? The solid line between dots reflect the relative mean measurement difference from the NIST certified value [Y-axis] across all radionuclides and matrices for lab 11 and the dashed line between dots represents the mean results for lab 22 from 1997 to 2006 [X-axis]. The upper horizontal dashed line is the mean results over the 1977–2006 timeframe for lab 22 and the lower horizontal dashed line is for lab 11’s performance over the same timeframe. The solid horizontal line at 0% difference represent the NIST certified values

What was the relative importance of each NIST programmatic factor?

Figure 1 depicts the variation of the relative mean difference from the NIST certified values as a function of laboratory, year, matrix and radionuclide. The factors that showed the largest mean deviation from the NIST values were due to radionuclides and laboratories [up to 50%] while year and matrices showed little effect [maximum < 5%].

Which matrix showed the best agreement with NIST certified values? Worst?

Figure 2a-d present the results sorted by test matrices. The data is displayed from left to right for acidified water [AW], glass fiber [GF] air filters, synthetic fecal material [SF], spiked soil [SS], and synthetic urine [SU]. Figure 2a shows that the bulk of the 1269 measurement result are within ~ 50% of the NIST values across all matrices but there are deviations as large as ~ 340%, predominantly due to 90Sr measurements. Figure 2b and c shows that the unweighted mean differences among the matrices from which measurement data were non-normally distributed, resulting in the median depiction being preferred. The maximum median deviation among the matrices was ~ 3.5%. Examination of Fig. 2c, indicated that the best unweighted median results come from analyses of radionuclides on glass fiber air filters, acidified water and synthetic urine while synthetic fecal and spiked soil matrices were more problematic.

Non-normal measurement result distributions were identified using Normal Probability Plots like that depicted in Fig. 2d where the 5 points on the right side of the curve deviate significantly from a normal distribution. The non-normal depiction of the measurement data distribution supports the preference for the use of the median characterization of the data. Identification of these 5 “outliers” provides an opportunity for future “root-cause” study and analytical method adjustments when warranted.

Which year showed the best agreement with NIST certified values? Worst?

Figure 3a and b show the variation of the relative unweighted median and mean difference from the NIST certified values. The difference between Fig. 3a and b indicates that there is a difference between the mean vs. median methods of doing the evaluation. This in turn points out that the distribution of yearly unweighted means is non-normal and that the median is preferred. For Fig. 3a, the maximum average deviation of median values from the NIST certified values was only ~ 3%, thereby indicating that time is not a major factor.

Which radionuclide showed the best agreement with NIST certified values? Worst?

Fi01gure 4a represents the results among the fifteen certified radionuclides [133Ba, 137Cs, 152Eu, 230Th, 232Th, 234U, 235U, 238Pu, 238U, 239Pu + 240Pu, 241Am, 54Mn, 60Co, 65Zn, 90Sr] that were gravimetrically spiked into all five NRIP matrices and verified by radiochemical assay. Globally, across all matrices the most challenging radioanalysis for the participating laboratories was for 90Sr. Even for acidified water samples, 90Sr measurements were challenging and had a wider dispersion of results than for the other radionuclides as shown in Fig. 4b. While radiochemical measurements of 238U, 234U, and 239Pu + 240Pu has the advantages of the use of internal tracers and alpha energy spectroscopic based measurement systems, the measurement of 90Sr is complicated by its complex radiochemistry and continuous energy spectrum of the beta particles that necessitates the use of non-spectroscopic detector measurement/calibration processes. Radionuclides determined by gamma spectrometry were fairly good, but care needed to be taken to calibrate the instruments adequately for geometry, sample density, and corrections for true coincidence summing.

A note is given here that gross alpha/beta results are not included in the Figures and will not be discussed in detail. However, as a general comment, the reported results varied widely from the NIST certified values [> 100%]. It is suspected that the discrepancy is mainly due to the difference in the average energy from the fifteen radionuclides in the NRIP test cocktail from that used for the participating laboratories’ instrument calibration. So as a general caution for the use of gross alpha/beta field screening measurements, the accuracy can be greatly improved by calibrating the measurement instruments with sources of appropriate energy, or by applying an appropriate correction factor to account for the difference in calibration energy from real samples.

Were there any outliers affecting data?

As pointed out previously, “outliers” represented a very small fraction of all of the measurement results reported during this ten year study. However, the topic of discovering underlying bias will need further examination in the future.

As it turns out, in general, this study shows that only an occassional”outlier” was noted to the point where it is within statistical expectations. 9 out of 1269 measurements were visually assessed as potential”outliers.” This is a rate of ~ 0.7%. An even better picture would be the 5”outliers” identified from the Normal Probability Plot on Fig. 2d. This assesssment indicates an”outlier” rate of 0.4%. It could be argued that the rate of”outliers” seen in this study was confirmation of good measurement control among the participating laboratories.

Which laboratory showed the best agreement with NIST certified values? Worst?

Figure 5a presents the mean laboratory performance over time, matrices and radionuclides. Over all the participating laboratories the combined average performance was within ~ 2% of the NIST certified values. However, there are two laboratories with performances beyond 10% and two more in excess of 20% of the NIST values. The other 20 laboratories are well within 5% of the certified values.

Figure 5b present the laboratory’s radionuclide measurement performance across all matrices and years. While it can be seen that the bulk of the results hover around “0” deviation from the certified values, there are ~ 9 apparent “outlier” entries, mostly due to 90Sr. Root-cause investigations of these “outliers” could benefit the participant laboratories to improve their process control, reduce analytical sources of dark bias, and reduce their measurement uncertainties.

Figure 5c shows the historical performance for laboratories 11 and 22. Laboratory 22 demonstrates consistent results with an overall deviation of ~ 3% over 10 years. By contrast, laboratory 11 started out with its first four years of evaluation with deviations from the certified values on the order of 12% but improved its performance over the next 6 years to show capabilities approaching that of laboratory 22. Participation in the NRIP traceability evaluation program provides documentation of continued high quality performance and also of improving performance.

Conclusions

The major conclusions from this assessment were: [1] no significant change in global data over time, [2] there were only a small number of “outlier” measurement results over all the factors investigated [3] global areas for improvement were identified, and [4] NRIP was important for laboratories to: [a] track performance, [b] improve performance, and [c] enhance quality assurance/control.

Further research identified by this assessment were: [1] more granular inspection of the database on a laboratory specific basis, e.g., dark sources of analytical bias, [2] root-cause basis for method improvements, [3] attention to answer “How good is good enough?” to establish programmatic-based traceability pass/fail criteria, and [4] Emergency Response, e.g., development of faster radiochemical procedures and more accurate measurement methods so more time can be spent on counting to reduce it’s contribution to combined uncertainty, and [5] advanced measurement methods for higher precision and accuracy. Additionally, the results from this study provides a platform for discussion of new radionuclides, matrices, and interferences to be incuded in future NRIP exercises.