Introduction

The certification of Standard Reference Materials (SRMs) by the National Institute of Standards and Technology (NIST) involves careful and accurate assessment of the measurand and its associated uncertainty for value assignment. NIST’s portfolio of 3100 series SRMs (single-element solutions) and 3180 series SRMs (anion solutions) supports the calibration of instrumental and classical methods of analytical chemistry.

These solutions, which are intended to be used as primary calibration standards, provide a clear and relatively short traceability chain linking measurements made in science, medicine, commerce, industry, and agriculture to the SI units of mass and amount of substance. In each of them, the certified value for the mass fraction of a single element or ion is obtained following the requirements described by Beauchamp et al. [1].

Of the 74 SRMs within the scope of these two programs, 67 of them are single-element solutions covering a majority of the elements in the periodic table, including but not limited to the transition metals, alkali metals, and alkali earths. Most of the single-element solutions are made from high-purity elements dissolved in acid and diluted to a final nominal mass fraction of 10 mg/g. These solutions are generally stored either in borosilicate glass ampoules or in high-density polyethylene bottles that are sealed in aluminized polyester bags to promote the long-term stability of the analyte in solution.

The 3180 series of SRMs currently comprises seven reference materials with certified mass fractions of non-metal anions in solution (referred to as anion solutions) that can include both mono-atomic and poly-atomic ions: bromide, chloride, fluoride, iodide, nitrate, phosphate, and sulfate. The anion solutions are prepared from high-purity salts dissolved in water to a nominal mass fraction of 1 mg/g.

The current procedure for certifying SRMs mandates that the mass fraction of the measurand in each of the 3100 or 3180 series SRMs be measured using two different methods applied independently of one another [1], although both methods might not be used to assign the certified value.

For the single-element solutions, the methods are gravimetric preparation from carefully assayed source materials, and high-performance inductively coupled plasma optical emission spectroscopy (HP-ICP-OES) [2,3,4,5]. For the anion solutions, the methods are gravimetric preparation from carefully assayed source materials and ion chromatography (IC), the latter following a high-performance calibration protocol similar to the HP-ICP-OES method [6]Footnote 1.

The uncertainties associated with the values measured using each method (IC, HP-ICP-OES, and gravimetric preparation) are evaluated in accordance with the Guide to the Expression of Uncertainty in Measurement (GUM) [8]. When prepared from a carefully assayed high-purity material, the final mass fractions of the solutions are determined by combining the results from the gravimetric preparation with the results obtained using an instrumental method, either IC or HP-ICP-OES.

Presently, value assignment and uncertainty evaluation for 3100 and 3180 series SRMs accounts for between-method uncertainty using the NIST Consensus Builder [9], in particular the DerSimonian-Laird procedure [10] as described by Koepke et al. [11]. This procedure captures and propagates dark uncertainty [12], which is the component of uncertainty manifest in the between-method difference in excess of what the uncertainty budgets of the individual measurement methods recognize.

This manner of combining gravimetric and instrumental results suffers from two kinds of shortcomings: those that pertain specifically to the DerSimonian-Laird procedure [13,14,15], and those that derive from evaluating the between-method uncertainty from a single difference (between the gravimetric and instrumental determinations). The main purpose of this contribution is to improve the latter, while also improving the former.

Weber et al. [16] have recommended recently that, for meta-analyses including only two studies, a Bayesian approach “using a weakly informative prior for the heterogeneity may help.” Since the value assignment and uncertainty evaluation for our single-element and anion solutions in fact is a form of meta-analysis, and in many cases there is significant heterogeneity (that is, the absolute value of the between-method difference significantly exceeds what should be expected in light of the method-specific uncertainties), such recommendation applies here.

The new procedure described in this contribution exploits the historical information from pairs of measurements (gravimetric and HP-ICP-OES) for 3100 series SRMs produced over the course of the past 14 years, and from pairs of measurements (gravimetric and IC) of the 3180 series SRMs produced over the course of the past 18 years, to characterize between-method differences, which will then be combined with the information about the between-method difference provided by the measurement results for a new SRM.

The improvements we describe in this contribution are these:

  1. (i)

    To capture and use the knowledge about between-method differences that has been acquired during the long history of development of SRMs in series 3100 and 3180, and to leverage it to enhance the reliability of the evaluation of the component of uncertainty attributable to between-method differences;

  2. (ii)

    To estimate the measurand (that is, to assign a value to the SRM), and to evaluate the associated uncertainty, taking into account method-specific uncertainties, between-method differences, and potential long-term instability, in an integrated and internally coherent way that also overcomes the aforementioned shortcomings of the DerSimonian-Laird procedure.

We emphasize three important facts:

  1. (a)

    The proposed new method is applied here to existing SRMs only for purposes of illustration, and to assess the impact that the new method has on the evaluation of the uncertainty resulting from blending the gravimetric and instrumental determinations — in the case of a new SRM, the blended value would become the certified value.

  2. (b)

    Neither assigned values nor uncertainties stated in currently valid certificates will change in consequence of this exercise. Therefore, the only authoritative value assignments to the existing SRMs in series 3100 and 3180, and associated uncertainty evaluations, continue to be those listed in their respective certificates as published by NIST, until they are replaced by new lots of the same SRM or reach the end of their periods of validity.

  3. (c)

    This proposed new procedure is applied retrospectively to existing SRM lots to demonstrate how the use of historical information enhances the quality of the evaluation of the uncertainty component attributable to differences between analytical methods. Such enhancement results from the use of historical data about between-method differences, and it does reduce the overall expanded uncertainty for the large majority of the SRM lots of both single-element and anion solutions.

The scope of this work includes only the uncertainty contributions from the gravimetric preparation (“Analytical methods-gravimetric preparation’’), instrumental analysis (“Analytical methods-instrumental measurement’’), long-term instability (“Long-term instability’’), and between-method differences (“Model specification’’ and “Model fitting’’). A reassessment of contributions from other, less influential sources of uncertainty, such as short-term (shipping) instability of the solutions, may be necessary as their relative contribution increases when the relative contribution from method differences decreases.

Historical overview’’ provides an overview of the history of the development of the SRMs in series 3100 and 3180. “Materials and methods’’ reviews the analytical methods employed for the measurements that contribute to the assignment of value to these SRMs: gravimetric preparations for both series, and different instrumental methods for the single-element solutions and for the anion solutions. In “Long-term instability’’ we describe how we evaluate the component of uncertainty attributed to potential long-term instability of the solutions.

Example of application’’ motivates the new procedure, introduces the underlying statistical model, and describes an example of application of the new procedure to an existing SRM, comparing the results with their counterparts in the corresponding certificate.

Model specification’’ specifies the statistical model in detail, and “Model fitting’’ explains how the model is fitted to the gravimetric and instrumental measurement results for a new SRM, and how the resulting output is used for value assignment and uncertainty evaluation.

Results’’ summarizes the results of applying the new procedure retrospectively to existing SRMs whose uncertainty component attributable to long-term instability has been evaluated separately from the uncertainty evaluations for the gravimetric and instrumental determinations of the measurand, and compares these results with the certified values and associated uncertainties for the same SRMs.

Discussion’’ discusses the results, highlighting some SRMs that stand out in one way or another in the graphical summaries presented in “Results.’’ “Conclusion’’ summarizes our conclusions. The supplementary material associated with this article lists relevant computer codes and details practical matters about their use.

Historical overview

During the early stages of the 3100 series SRM program, in the 1980s and 1990s, the value was assigned to the measurand based on the gravimetric preparation alone. Another method was used to confirm this value but did not contribute to the certified value that was listed on the certificate of analysis. As the ability to perform highly accurate and precise measurements using instrumental methods increased, a second method started being used to assign the certified value, reducing the risk that some unknown bias might remain undetected.

The combination of results from two different methods of analysis is an instance of consensus building or meta-analysis [11, 17]. However, since two measurement methods typically yield different results, usually there will be a component of uncertainty attributable to between-method differences, which has to be evaluated and propagated to the final results.

This component of uncertainty was recognized soon after a second method started being used to assign a certified mass fraction to the 3100 and 3180 series SRMs. The approach employed to combine the results from two different methods and to incorporate the between-method uncertainty into the expanded uncertainty has evolved over time. Possolo et al. [18] describe the procedure that has been used most recently for the single-element solutions.

The earliest attempts at determining the uncertainty component attributable to between-method differences were rudimentary and generally conservative, aiming mostly to capture the impact of any potential bias. The methods that have been in use more recently to quantify the between-method uncertainty are more rigorous but are still amenable to improvement, which will produce more realistic uncertainty evaluations and hopefully will decrease the uncertainty of measurements made using the resulting SRMs as calibrants.

Frequently, the results from the gravimetric preparation and from the instrumental method are in very good agreement, and the corresponding estimate of dark uncertainty turns out to be 0 mg/g. Since the same two measurement methods are used for value assignment throughout the 3100 series and 3180 series SRMs, the typical between-method uncertainty ought not to be overridden by such exceptional, possibly fortuitous mutual agreement resting on a single difference. For this reason, the new procedure entertains the possibility of there being some significant dark uncertainty that remains undetected even in such cases of good mutual agreement between the measurement methods.

Since around 2006, the evaluation of the between-method uncertainty has been based on comparisons between corresponding gravimetric and HP-ICP-OES measurement results for 60 single-element solutions whose elemental mass fractions have been certified individually. This evaluation involves forming the ratios between corresponding gravimetric and HP-ICP-OES measured values, computing the relative standard deviation of these ratios, dividing it by \(\sqrt{2}\), and using the result as estimate of the between-method uncertainty, which is then folded into the calculation of the combined standard uncertainty for both the gravimetric preparation and for the HP-ICP-OES determination.

This traditional method of incorporating the between-method uncertainty into the combined uncertainty has two significant shortcomings:

  1. (i)

    First, splitting the between-method uncertainty evenly between the gravimetric preparation and HP-ICP-OES, and then treating the two pieces as if they were method-specific, suggests that they could somehow counterbalance each other and neutralize the effect, when one knows that this is not how between-method differences impact the results;

  2. (ii)

    Second, the between-method uncertainty for the anion solutions, which are prepared gravimetrically and measured instrumentally using IC, was assumed to be the same as the method uncertainty for the single-element solutions, whose instrumental determination is made using HP-ICP-OES. This assumption was made because there was not sufficient data to evaluate the between-method uncertainty specific to the differences between the gravimetric and IC determinations for the anion solutions, and this approach was thought to provide the best estimate available.

The new procedure described in “Model specification’’ and “Model fitting’’ overcomes both shortcomings. However, it pools the historical information available for the single-element solutions and for the anion solutions because there are many lots of the former and few of the latter, and they appear to convey mutually consistent historical information about between-method differences. As additional lots of the anion SRMs are produced, if they will convey historical information different from the information conveyed by the single-element solutions, then the same method described here will continue to apply, except that two different repositories of historical information will be used, one for the single-element solutions, another for the anion solutions.

When the proposed new procedure is used to characterize a new SRM, as illustrated in “Example of application’’ for a specific case, it blends historical information about between-method differences with the actually observed difference in the case under consideration, weighing one and the other optimally by application of Bayes’s rule [19].

Materials and methods

Analytical methods — gravimetric preparation

The NIST 3100 single-element solution SRMs are generally prepared from high-purity metals assayed for purity to establish an initial link to the SI. These assays are usually performed indirectly, with the elements expected to be present measured except the element being assayed, and the overall purity being assigned by a mass balance approach [20, 21]. While counter-intuitive, this approach leads to a much lower assay uncertainty than measuring the high-purity element directly [22], a critical aspect when the assay uncertainty will dictate the precision of any measurements down the traceability chain.

For some 3100 single-element solution SRMs, it is more practical to use a high-purity salt or oxide for the gravimetric preparation, reserving the high-purity metal for calibration solutions measured with HP-ICP-OES. Unless the high-purity salt is another SRM, only one method (HP-ICP-OES) is used to assign a certified value and uncertainty, although any method biases or additional uncertainty attributable to possible instability over time would also be incorporated into the measurement result. The SRM 3180 series anion solutions are all prepared from high-purity sodium or potassium salts of the anion of interest. The anion salts are assayed for mass fraction of the anion and the associated uncertainty is evaluated.

The 3100 series SRMs are prepared by first carefully determining the mass of an appropriate quantity of the source material, then dissolving the source material in a small amount of solvent, typically concentrated acid. Dissolution is followed by dilution in a large carboy to the target element mass fraction of 10 mg/g analyte with an acidic aqueous solution fraction to promote stability of the solution over the shelf life of the material. For the 3180 series anion solution SRMs, salts are dissolved in a small amount of high-purity water and then diluted to the target anion mass fraction of 1 mg/g with high-purity water.

For both 3100 and 3180 series SRMs, solutions are homogenized by mixing and then sealed into borosilicate glass ampoules or portioned into high-density polyethylene bottles. Through careful measurement of the masses of both the primary materials and the resulting solutions, a gravimetric value can be determined, and its associated uncertainty can be evaluated. The uncertainty of the gravimetric preparation is combined with the uncertainty of the purity of the source material, to evaluate the expanded uncertainty of the gravimetric method [18].

As the stock of each of these SRMs is sold out or reaches the end of its period of validity, a new solution of the same element or anion is prepared and labeled with a new lot number for the same SRM. Thus, multiple lots of most of these SRMs have been prepared over time.

Analytical methods — instrumental measurement

HP-ICP-OES — procedure

Prior to analysis of the new lots of SRM solutions, a set of calibration standards are prepared using a high-purity material of the element of interest. The source materials for these calibration standards are assayed for purity and, if practical, are obtained from a source material containing the element or anion that is independent from the source material used to produce the respective SRM. The calibration solutions are prepared similarly to the preparation of the SRMs, but on a smaller scale and in multiple batches. Batches are validated for mass fraction against each other and against older lots to ensure continuity and stability, and to confirm that the new calibrants are fit for purpose.

Analysis of the SRM solutions by HP-ICP-OES follows an experimental protocol explicitly designed to give the most accurate and precise results possible. The details of the experimental design can be found elsewhere [2,3,4,5]. Briefly, the experimental protocol is as follows:

  1. (a)

    Working solutions are prepared from calibrants and the new lot of SRM using an exact matching protocol [5]. These working solutions have nearly identical amounts of analyte, internal standard, acid, and water, so that any non-linearity of the response of the HP-ICP-OES instrument is inconsequential. This is possible only because the new, “unknown” lot of SRM in fact is relatively well known, having already been carefully prepared gravimetrically.

  2. (b)

    The SRM solutions and calibration solutions are analyzed in a randomized order, and the intensities for the analyte and internal standard are extracted. The ratios of the signal of the analyte to the internal standard for each sample are calculated and corrected for instrumental drift using the method described by Salit and Turk [2].

  3. (c)

    The analyte mass fraction in the new lot of SRM is determined using the known masses of analyte in the calibration solutions.

  4. (d)

    Uncertainty of the determined HP-ICP-OES mass fraction is derived from the variability of the values measured for the analytical samples, variability associated with the instrument’s sensitivity coefficient, HP-ICP-OES method uncertainty, and the uncertainty of the calibration solutions. In most cases the relative uncertainty achieved is small because the critical measurements in the process are shifted from the HP-ICP-OES to the analytical balance. Therefore most, but not all, uncertainties present other than the between-method uncertainty are related to the ability to prepare solutions carefully using gravimetric methods.

IC — procedure

Analysis of new lots of SRM solutions by IC uses the same general protocols and calculations as HP-ICP-OES, as described above. However, there are significant differences between the two methods and the related solution preparations. The anion SRMs solutions and the dilutions for IC analysis are all prepared in water rather than in acid. For IC measurements, all the ions are separated before detection, thus limiting the matrix effects, which can present challenges for some HP-ICP-OES measurements. There are also differences between IC and HP-ICP-OES detectors and their respective sensitivities. For this reason, the method uncertainty for IC may need to be determined independently of HP-ICP-OES as more measurement results are accumulated.

Long-term instability

As the solutions age, the mass fraction of the element or anion in solution may remain invariant, or it may change, either continuously over time (with or without a well-defined trend), or abruptly. In “Example of application’’ and “Model specification’’ we model the result of such change as a random drawing from a probability distribution that has mean 0 mg/g and standard deviation set equal to the standard uncertainty that quantifies the potential long-term instability.

Using a probability distribution to characterize the impact of potential long-term instability means that the true value of the mass fraction in solution, at any particular time during the period of validity of the certified value, is unpredictable, and the best one can do is to characterize its typical size, in the form of such standard deviation.

By and large, materials in both series 3100 and 3180 do remain stable during the periods of validity stated in their certificates, which are quite long. For example, the certified value for SRM 3118a (Lot No. 200511) Gadolinium (Gd) Standard Solution, whose certificate was issued in 2021, has an initial period of validity that ends in 2032; and the certified value for SRM 3185 (Lot No. 170309) Nitrate Anion (\(\text {NO}_{3}^{-}\)) Standard Solution, whose certificate was issued in 2017, has an initial period of validity that ends in 2029.

Any processes at work that may induce changes in the mass fraction in solution are very slow, therefore difficult to characterize. Rather than study these processes in detail, NIST’s choice is to produce a new lot of the material as soon as there is a mere suspicion that the mass fraction may be changing.

Some of these solutions are delivered in glass ampoules, others in polyethylene bottles. The mechanism whereby ampouled materials may become unstable is not well known, but long-term stability of bottled materials has been studied and the mechanism for their change is understood better. For bottled materials, the main cause of change for the mass fraction in solution is evaporative loss through transpiration. The rate at which solutions lose solvent depends on the acid type, acid concentration, and identity of the dissolved element.

The use of aluminized polyester bags, which are heat sealed around the bottle during production, makes the transpiration across the different bottled materials more uniform and predictable. Furthermore, losses are negligible while the bottle remains in the sealed bag. However, after the bag has been opened the bottled solutions may lose approximately 0.2 % of their mass per year.

The uncertainty component attributable to long-term instability was evaluated using the procedure outlined by Linsinger et al. [23], from stability data collected during the period 2004–2008 for the 3100 series SRMs, and during 2002–2020 for the 3180 series SRMs. This procedure uses a linear regression of the mass fraction on age, to describe the change in mass fraction values over time for bottled or ampouled SRMs. The slope of this regression is used to determine the “shelf life” of the solutions. Based on this shelf life the expected change in mass fraction is quantified as a standard deviation, and propagated to the uncertainty associated with the certified value.

Note that this approach is conservative because it translates the expected change in mass fraction over the entire lifetime of the material into an uncertainty component that applies at all times until the expiration date. An alternative would be a variable uncertainty component that would increase over the lifetime of the material based on the time elapsed since the certified value was assigned. While this would result in lower uncertainties during most of the lifetime of the material, it would stand as an obstacle to the use of the material in practice.

Example of application

Table 1 lists the results from the gravimetric and instrumental determinations of the mass fraction of tin in SRM 3161a (Lot No. 140917) Tin (Sn) Standard Solution, the certified value, and associated expanded uncertainty, and their counterparts produced by the new approach that is described in detail in “Model specification’’ and “Model fitting.’’ All calculations described subsequently were done using all the digits available in the corresponding digital records of analysis, not only the digits listed in this table.

Interestingly, for this SRM at least, dark uncertainty makes the largest contribution to the combined uncertainty associated with the final result, and potential long-term instability makes the second largest. The results summarized in “Results’’ indicate that leveraging the historical information about between-method differences will reduce the contribution that between-method differences make to the uncertainty associated with the certified value of many SRMs.

Table 1 Measurement results and evaluation of the uncertainty component attributable to potential long-term instability for SRM 3161a (Lot No. 140917) Tin (Sn) Standard Solution, where w denotes the mass fraction, u(w) denotes the corresponding standard uncertainty, \(U_{95\,\%}(w) / (mg/g)\) denotes the expanded uncertainty for 95 % coverage, and df denotes the number of degrees of freedom that u(w) is based on

Figure 1 suggests that the gravimetric and instrumental results are mutually inconsistent, which is confirmed both by Cochran’s Q test (p-value smaller than 3.5e-16) [24], and by Welch’s t test (p-value smaller than 5.5e-6) [25].

Fig. 1
figure 1

The diamond outlines indicate the values measured gravimetrically (GRAV) and instrumentally (HP-ICP-OES) for SRM 3161a (Lot 140917) Tin (Sn) Standard Solution, and the solid diamonds indicate the certified value (C) and the corresponding value produced using the new procedure that uses historical information about between-method differences (H). The horizontal line segments represent 95 % coverage intervals centered at the measured values

The value assignment for these SRMs is currently being done using a random effects model that is able to detect mutual inconsistency between the gravimetric and instrumental results, and takes the corresponding “excess” dispersion into account [18], employing the DerSimonian-Laird procedure [10]. The corresponding statistical model represents the measured values as

$$\begin{aligned} w_{\text {G}} = \omega + \lambda _{\text {G}} + \varepsilon _{\text {G}}\quad \text {and} \quad w_{\text {I}} = \omega + \lambda _{\text {I}} + \varepsilon _{\text {I}}, \end{aligned}$$
(1)

where \(\omega\) denotes the true value of the measurand, the subscripts G and I refer to the gravimetric and instrumental determinations, \(\lambda _{\text {G}}\) and \(\lambda _{\text {I}}\) denote method effects, and \(\varepsilon _{\text {G}}\) and \(\varepsilon _{\text {I}}\) denote measurement errors.

The statistical model involves the additional assumptions that \(\lambda _{\text {G}}\) and \(\lambda _{\text {I}}\) are like two independent drawings from a Gaussian distribution with mean 0 and standard deviation \(\tau\), and \(\varepsilon _{\text {G}}\) and \(\varepsilon _{\text {I}}\) are like two independent drawings from Gaussian distributions both with mean 0 but possibly different standard deviations, \(\sigma _{\text {G}}\) and \(\sigma _{\text {I}}\). The assumption is also made that the effects attributable to differences between analytical methods, and the measurement errors, are, for all practical purposes, statistically mutually independent. The standard deviation \(\tau\) quantifies what is often called dark uncertainty [12, 26].

Traditionally, the uncertainty contribution related to potential long-term instability is incorporated after the data reduction that produces an estimate, \(\widehat{\omega }\), of the true value of the measurand, and an evaluation of the associated uncertainty. For example, using the DerSimonian-Laird procedure, as implemented in the NIST Consensus Builder [9, 11], to fit the model in Eq. (1) to the measurement results listed in the upper part of Table 1, yields \(\widehat{\omega } = 10.011\) mg/g, \(u(\widehat{\omega }) = 0.011\) mg/g, and \(\widehat{\tau } = 0.0152\) mg/g.

The main shortcoming of this approach is that the estimate of dark uncertainty, \(\widehat{\tau }\), is based on a single difference between measured values obtained using the two analytical methods used for certification. Even though the NIST Consensus Builder includes provisions that take this shortcoming into account, the fact remains that dark uncertainty is evaluated based on a single degree of freedom.

Furthermore, the DerSimonian-Laird procedure is more likely erroneously to conclude that \(\tau =0\) mg/g than the Bayesian procedure, which, differently from classical treatments like DerSimonian-Laird’s and variance component estimation based on the analysis of variance [27], expresses the knowledge of dark uncertainty in the form of a probability distribution over its conceivable values. The Bayesian procedure thus characterizes the state of knowledge about dark uncertainty thoroughly and comprehensively.

A more reliable estimate of \(\tau\) can be produced if one exploits the treasure trove of historical information that is available in the collection of paired gravimetric and instrumental measurements obtained for SRMs developed in the past, and combines it with the information that the difference \(w_{\text {I}} - w_{\text {G}}\) provides for a newly developed SRM. “Model specification’’ and “Model fitting’’ explain how this is accomplished using a Bayesian estimation procedure.

The following relationship, discussed in detail in “Model specification,’’ captures that historical information and expresses the size of the dark uncertainty relative to the gravimetric value as a function of the absolute value of the difference between the values measured gravimetrically and instrumentally relative to the gravimetric value:

$$\begin{aligned} \ln \left( \frac{\tau }{w_{\text {G}}}\right) = \alpha + \beta \ln \left( \frac{\mid w_{\text {I}}-w_{\text {G}} \mid }{w_{\text {G}}}\right) , \end{aligned}$$
(2)

where \(\ln\) denotes the natural logarithm (base e). The intercept \(\alpha\) and the slope \(\beta\) are estimated using historical data for all SRMs (both single-element and anion solutions) whose gravimetric and instrumental determinations yield mutually inconsistent results, hence a positive estimate of \(\tau\). One can then regard the value of \(\tau\) produced by Eq. (2) as an a priori estimate of \(\tau\) that can subsequently be updated considering the actual difference \(w_{\text {I}}-w_{\text {G}}\) observed for a new SRM.

The current estimates of those intercept and slope are \(\widehat{\alpha } = {}{-0.06846}{}\) and \(\widehat{\beta } = {}{1.05309}{}\), as detailed in “Model specification.’’ Therefore, the a priori estimate of dark uncertainty for the data in Table 1 is (solving Eq. (2) for \(\tau\))

$$\begin{aligned} \tau _{\text {M}}=\, & {} (10.0001\, \text {mg/g})\;\times \\&\exp \left\{ ({}{-0.06846}{}) + {}{1.05309}{} \times \ln \left( \frac{|{}{10.0217}{}-{}{10.0001}{}|}{{}{10.0001}{}}\right) \right\} \\=\, & {} 0.0146\, \text {mg/g}. \end{aligned}$$

Equation (2) is similar to the renowned Horwitz equation, for example as reviewed by Horwitz and Albert [28] and by Meija [29, 30]. Applied to the certified value listed in Table 1, and following the suggestion from [28, Page 1100] about converting an estimate of between-laboratory reproducibility into an estimate of within-laboratory variability, the Horwitz equation yields 10.011 mg/g \(\times\, 2 (10.011/1000)^{-0.15} / 2 = 0.2\) mg/g, which is more than 10 times larger than the corresponding value, 0.0146 mg/g, computed for \(\tau _{\text {M}}\) above.

This discrepancy questions neither the general usefulness of the Horwitz equation, nor the relevance of Eq. (2) for the specific application to the classes of SRMs under consideration here. Such discrepancy can be attributed primarily to the very close comparability of the solutions under consideration, and to the tightly uniform control over their production, which has been carefully maintained throughout the history of these SRMs, which by now involve a large proportion of the elements in the periodic table, and most anions of greatest practical interest.

NIST has used the Horwitz curve previously and for a similar purpose [31]: to update certificates of old NIST reference materials, mostly metals, whose stocks have not been exhausted yet, that were developed prior to the adoption of the current practices for uncertainty evaluation. However, in that application the Horwitz equation was tuned in light of empirical data to make it more accurate and specifically relevant to a particular class of materials of much narrower scope than had been used originally to develop said curve [32].

The historical estimate of \(\tau\) from the foregoing Eq. (2) will be updated using the measurement results for the SRM of current interest, by application of the procedure described in “Model fitting,’’ which also propagates the updated estimate of \(\tau\) to the evaluation of the uncertainty associated with the estimate of the mass fraction of the measurand in the solution.

If the gravimetric and instrumental results appear to be mutually consistent, hence \(\widehat{\tau } = 0\) mg/g, the a priori estimate of \(\tau\) is set equal to \(\tau _{\text {M}}/10\). This choice is introduced in item (b) of the model specification (“Model specification’’), and it is discussed under Practical Matters in the supplementary material.

The model for the measured values in Eq. (1) is incomplete because it does not recognize the potential, long-term instability of the solution. To include the effect of instability, we add yet another effect, \(\kappa\), to the model:

$$\begin{aligned} w_{\text {G}} = \omega + \kappa + \lambda _{\text {G}} + \varepsilon _{\text {G}}, \quad \text {and} \quad w_{\text {I}} =\omega + \kappa + \lambda _{\text {I}} + \varepsilon _{\text {I}}. \end{aligned}$$
(3)

The non-observable quantity \(\kappa\) represents the contribution from long-term instability, so that one can interpret \(\omega + \kappa\) as the value of the mass fraction in solution at any future time between when the SRM was first certified, and when it reaches the end of its period of validity.

Since, in general, there is no compelling reason to expect \(\kappa\) to be positive or negative, we model it as outcome of a non-observable random variable with mean 0 mg/g and with standard deviation \(u_{\text {S}}\) based on \(\nu _{\text {S}}\) degrees of freedom. The mean being 0 mg/g reflects our inability to state a priori whether long-term instability will cause the mass fraction to increase or decrease over time, while the uncertainty component \(u_{\text {S}}\) quantifies our estimate of the magnitude of the possible change, regardless of the direction, up or down, of this change. Using a random variable, \(\kappa\), to model long-term instability does not imply that long-term instability is “chancy” in the same sense that the number of pips shown by a casino die is chancy, when it comes to rest after being rolled: it means simply that we do not know and cannot predict the value of any such change in advance.

The mass fraction of the element or anion in solution may change continuously over time, with or without a well-defined trend, or it may change abruptly: modeling it as a random drawing from a probability distribution simply means that its value at any particular moment in time is unpredictable, and the best one can do is characterize its typical size by specifying its standard deviation, \(u_{\text {S}}\).

We model \(\kappa\) as a non-observable Student’s t random variable centered at 0 mg/g rescaled to have standard deviation equal to the reported standard uncertainty \(u_{\text {S}}\) associated with long-term instability, which is based on the specified number \(\nu _{\text {S}}\) of degrees of freedom. Since neither the gravimetric nor the instrumental results provide any information about the value of \(\kappa\), its a priori mean value of 0 mg/g is also the mean value that it shall have at the end of the model fitting process described in “Model fitting.’’

The probability distribution assigned to \(\kappa\) merely describes one’s lack of knowledge about the future: about how much the amount fraction of the solute may change from when it was first certified, up until it reaches the end of its period of validity. Even though the measurement results obtained during certification provide no information about the value of \(\kappa\), its presence in the model affords a seamless propagation of the uncertainty component \(u_{\text {S}}\).

The last two rows of Table 1 show that, for this particular lot of this SRM 3161a Tin (Sn) Standard Solution, both the certified value and its associated expanded uncertainty are very close to their counterparts obtained using the approach that exploits historical information. Such close agreement will prevail for most other SRMs, but the new procedure that exploits historical information will often yield smaller uncertainties than those listed in current certificates, as will become clear in “Results.’’

Model specification

Table 2 summarizes the symbols used for quantities that appear repeatedly throughout this section and the next, and explains their roles succinctly, supporting the description of the statistical model underlying the procedure already illustrated in “Example of application,’’ and the explanation, detailed in “Model fitting,’’ of how this model is fitted to the measurement data.

Table 2 List of symbols used in the definition of the statistical model and in the description of how it is fitted to the measurement data. Note that \(\kappa\), \(\lambda _{\text {G}}\), \(\lambda _{\text {I}}\), \(\varepsilon _{\text {G}}\), and \(\varepsilon _{\text {I}}\) all can be positive or negative, and have unit mg/g. The magnitudes of \(\tau\) and \(\tau _{\text {REML}}\) are either zero or positive, and both have unit mg/g

The new approach is based on the relationship introduced in Eq. (2) and depicted in Fig. 2, between \(\ln ({\tau _{\text {REML}}}/{w_{\text {G}}})\) and \(\ln (|w_{\text {I}}-w_{\text {G}} |/w_{\text {G}})\), for those SRMs whose preliminary estimate of the dark uncertainty, \(\tau _{\text {REML}}\), is positive.

These preliminary estimates of dark uncertainty were derived from the comparison of gravimetric and instrumental measurement results in the context of the model in Eq. (1) fitted by restricted maximum likelihood (REML) estimation [27, 33], using R function metagen defined in package meta [34, 35]. This method of estimation has been found to perform well across a wide range of situations encountered in interlaboratory studies and meta-analyses [15, 36].

Fig. 2
figure 2

The large, blue open circles pertain to 25 lots of the single-element solution SRMs for which the REML estimate of dark uncertainty, \(\tau _{\text {REML}}\), is positive, and similarly for the 4 small red dots, which pertain to individual lots of anion solution SRMs. Taken together, they were used to calibrate the relationship in Eq. (4) that captures the historical information about between-method differences. The small triangles at the bottom indicate the values of \(\ln (\mid w_{\text {I}}-w_{\text {G}} \mid /w_{\text {G}})\) for those SRM lots (13 for single-element solutions in blue and 7 for anion solutions in red) with \(\tau _{\text {REML}} = 0\). The labels in larger font size designate SRMs, and the labels in smaller font size are lot numbers

Only 38 lots of the current single-element solution SRMs have been selected for this exercise because these 38 have independent evaluations of method uncertainty, which the new procedure propagates coherently and simultaneously with the propagation of the other recognized uncertainty components. Of these 38, only 25 (blue circles in Fig. 2) are used to calibrate the relationship in Eq. (4) because only for these is there a positive estimate of dark uncertainty, \(\tau _{\text {REML}}\). For only for 4 (red dots in Fig. 2) of the 11 lots of anion solution SRMs is \(\tau _{\text {REML}} > 0\).

This relationship can be summarized quite accurately by the (green) sloping line also depicted in the same figure. The line was fitted to all the points represented by (blue) circles or (red) dots in Fig. 2 using the robust regression procedure implemented in R function lmrob of package robustbase [37]: its intercept is \(\alpha =\) \(-\)0.069 with standard uncertainty 0.215 (hence \(\alpha\) does not differ significantly from zero), and the slope is \(\beta =\) 1.053 with standard uncertainty 0.031.

The robust regression line down-weights automatically the three points that are labeled with the corresponding SRM number in Fig. 2 because they are clearly off the clear trend that the regression line aims to capture. The three labeled points all pertain to single-element solutions.

There are only 4 lots of the anion solutions with \(\tau _{\text {REML}} > 0\) (indicated by red dots in the same figure), and the corresponding points are fairly well aligned with the bulk of the points (blue circles) that correspond to individual lots of the single-element solutions. Thus, for now, the same relationship will be used to express the historical information about between-method differences both for single-element solutions and for anion solutions. However, since different instrumental methods are used for the single-element solutions and for the anion solutions, it is conceivable that, as the collection of anion solutions will keep growing, two different relationships may become warranted to capture the historical information separately for the two series of SRMs.

Given the values of the gravimetric and instrumental determinations made for a new SRM, \(w_{\text {G}}\) and \(w_{\text {I}}\), the aforementioned robust regression line can then be used to compute an a priori estimate of method uncertainty based on the historical record, as follows:

$$\begin{aligned} \tau _{\text {M}} = w_{\text {G}} \exp \left\{ \alpha + \beta \ln \left( \frac{\mid w_{\text {I}}-w_{\text {G}} \mid }{w_{\text {G}}}\right) \right\} , \end{aligned}$$
(4)

where, the same as in Eq. (2), \(\ln\) denotes the natural logarithm (base e).

The value of \(\tau _{\text {M}}\) produced by Eq. (4) will be used as the mean of the prior distribution for the method uncertainty in the Bayesian data reduction described in “Model specification,’’ which updates this initial assessment, \(\tau _{\text {M}}\), taking into account the actual gravimetric and instrumental determinations made for the new SRM being certified, and their associated uncertainties.

The model in Eq. (3) becomes a hierarchical (or multilevel) Bayesian model [38] because we envision its being assembled incrementally: starting by fixing \(\omega\), then adding the method-specific effects \(\lambda _{\text {G}}\) and \(\lambda _{\text {I}}\), next the method-specific measurement errors \(\varepsilon _{\text {G}}\) and \(\varepsilon _{\text {I}}\), and finally the future expression of the degradation of the solution, \(\kappa\).

Consistently with the Bayesian paradigm, all the unknown quantities in the model — so-called parameters (including \(\omega\), \(\kappa\), \(\tau\)) — are treated as non-observable random variables, whose (prior) probability distributions need to be specified, reflecting knowledge about their values before obtaining the gravimetric and instrumental measurement results for a new SRM.

The measurement results, \((w_{\text {G}}, u(w_{\text {G}}))\) and \((w_{\text {I}}, u(w_{\text {I}}))\), are modeled as observed values of random variables whose probability distributions also have to be specified and depend on the values of those parameters.

We are not interested in the measurement errors \(\varepsilon _{\text {G}}\) and \(\varepsilon _{\text {I}}\) themselves, only in their standard deviations, \(\sigma _{\text {G}}\) and \(\sigma _{\text {I}}\), which impact the uncertainty of the value assigned to the SRM. The standard uncertainties \(u(w_{\text {G}})\) and \(u(w_{\text {I}})\) are estimates of \(\sigma _{\text {G}}\) and \(\sigma _{\text {I}}\), which are parameters in the model.

Also, rather than estimating the method effects, \(\lambda _{\text {G}}\) and \(\lambda _{\text {I}}\), we focus on \(\mu _{\text {G}} = \omega + \lambda _{\text {G}}\) and \(\mu _{\text {I}} = \omega + \lambda _{\text {I}}\), which are the conditional expectations of \(w_{\text {G}}\) and \(w_{\text {I}}\) given \(\omega\), \(\lambda _{\text {G}}\), and \(\lambda _{\text {I}}\). The method effects capture possible biases inherent to the gravimetric and instrumental methods, while the goal is to estimate \(\omega\) unencumbered by any such biases. The modeling assumption made for \(\lambda _{\text {G}}\) or \(\lambda _{\text {I}}\) ensures that both \(w_{\text {G}}\) and \(w_{\text {I}}\) have \(\omega\) as their common (unconditional) expected value.

The Bayesian specification of the model in Eq. (3) involves the following choices:

  1. (a)

    \(\omega\) has a Gaussian prior distribution with mean equal to the REML estimate of \(\omega\) based on the gravimetric and instrumental results and computed using the aforementioned R function metagen, and standard deviation that is twice the standard deviation of the REML estimate of \(\omega\).

  2. (b)

    \(\tau\) has a prior distribution that is either gamma with mean \(\tau _{\text {M}}\) and coefficient of variation equal to 0.3 when \(\tau _{\text {REML}} > 0\), or exponential with median \(\tau _{\text {M}}/10\) when \(\tau _{\text {REML}} = 0\).

    The choice of 0.3 reflects the fact that, for most of the points supporting the trend line in Fig. 2, the actual value of \(\tau _{\text {REML}}\) lies within 60 % of the corresponding value of \(\tau _{\text {M}}\). This choice and the choice of prior distribution for \(\tau\) when \(\tau _{\text {REML}} = 0\) are discussed in section Practical Matters of the supplementary material.

  3. (c)

    \(\mu _{\text {G}} = \omega + \lambda _{\text {G}}\) and \(\mu _{\text {I}} = \omega + \lambda _{\text {I}}\) both have Gaussian distributions with mean \(\omega\) and standard deviation \(\tau\).

  4. (d)

    \(\kappa\) has a prior Student’s t distribution with \(\nu _{\text {S}}\) degrees of freedom, rescaled to have standard deviation \(u_{\text {S}}\).

    Since the measurement results provide no information about \(\kappa\), both its prior and posterior means are 0 mg/g (up to the fluctuations resulting from the vagaries of Monte Carlo Markov Chain sampling described in “Model fitting’’), and the sole consequence of including this term in the model is the effective propagation of the uncertainty associated with potential long-term instability of the solution.

  5. (e)

    \(\sigma _{\text {G}}\) and \(\sigma _{\text {I}}\) have prior gamma distributions with means \(u(w_{\text {G}})\) and \(u(w_{\text {I}})\) and coefficients of variation both equal to 0.1.

    The choice of 0.1 expresses the belief that the method-specific uncertainty evaluations are not grossly incorrect in the sense that \(\sigma _{\text {G}}\) and \(\sigma _{\text {I}}\) should be within 20 % of the corresponding evaluations of \(u_{\text {G}}\) and \(u_{\text {I}}\). Other choices may also be reasonable, depending on how the mass fraction is measured for a particular solution and on how accurate the corresponding uncertainty budgets are believed to be.

  6. (f)

    \(u(w_{\text {G}})\) has a gamma distribution with shape \(\nu _{\text {G}}/2\) and rate \(\nu _{\text {G}}/(2\sigma ^{2}_{\text {G}})\), and similarly for \(u(w_{\text {I}})\).

    This part of the model expresses the established fact that, under the Gaussian model, sample variances have rescaled chi-squared sampling distributions.

  7. (g)

    \(w_{\text {G}}\) has a Gaussian distribution with mean \(\mu _{\text {G}}\) and standard deviation \(\sigma _{\text {G}}\), and similarly for \(w_{\text {I}}\).

Model fitting

The immediate inputs for value assignment to a new SRM are these two triplets: \((w_{\text {G}}, u(w_{\text {G}}), \nu _{\text {G}})\) and \((w_{\text {I}}, u(w_{\text {I}}), \nu _{\text {I}})\). The Bayesian model specified in Eq. (3) and in “Model specification’’ is fitted to these data to obtain an estimate of the true value of the measurand, and an evaluation of the associated uncertainty.

The uncertainty component attributable to differences between analytical methods (gravimetric and instrumental) expresses both historical information about such differences, which is encapsulated in the prior distribution for \(\tau\) specified in item (b) of “Model specification,’’ and the difference \(w_{\text {I}}-w_{\text {G}}\) actually observed for this new SRM. These two pieces of information are merged using Bayes’s rule [19] during the process of fitting the model to the new measurement results.

The model fitting process explores the set of possible values for the parameters to gather information about where their true values are likely to be. The estimates of the quantities of interest are then derived based on the resulting “map” of probabilities for the location of the true value of \(\theta = (\omega , \kappa , \tau , \mu _{\text {G}}, \mu _{\text {I}}, \sigma _{\text {G}}, \sigma _{\text {I}})\).

Rather than applying Bayes’s rule directly, which would be impracticable for a model as complex as described in “Model specification,’’ we employ an established procedure that produces a large sample drawn from the conditional probability distribution of \(\theta\) given the measurement results for the new SRM. This procedure is called Markov Chain Monte Carlo (MCMC) sampling [39].

The (multivariate) probability distribution of \(\theta\) that MCMC samples is the so-called posterior distribution of \(\theta\), which results from updating the prior information specified in (a)(g) of “Model specification’’ with the fresh information provided by the measurement results obtained for the certification of the new SRM. MCMC is implemented in R [35] and Stan [40] codes listed in the supplementary material associated with this article.

The output of primary interest issuing from the MCMC procedure is a large sample of values of \(\omega + \kappa\) that reflect all the prior information about all the parameters in the model, as well as the fresh measurement results. The effect of potential long-term instability, \(\kappa\), is present in the model as a virtual carrier of the uncertainty associated with such instability, whose expected value is zero both a priori and a posteriori, but whose standard deviation is both recognized and suitably propagated.

The estimate of the measurand is the average of the MCMC sample of values of \(\omega\), which is the same as the average of the MCMC sample values of \(\omega +\kappa\) because \(\kappa\) has (both prior and posterior) mean 0 mg/g. The associated standard uncertainty is the standard deviation of the sample of values of \(\omega +\kappa\) (this standard deviation typically is larger than the standard deviation of the sample of values of \(\omega\) because it includes the contribution from potential long-term instability).

A 95 % credible interval for \(\omega\), the true value of the measurand, can be constructed in many different ways: any interval that encompasses 95 % of the sample of values of \(\omega +\kappa\) qualifies. Typically, one chooses the interval that is centered at the estimate of the measurand, so that half of the interval’s length can be used as the associated expanded uncertainty for 95 % coverage. This “direct” construction of an expanded uncertainty was employed in “Example of application’’ for SRM 3161a (Lot No. 140917) Tin (Sn) Standard Solution.

If the posterior distribution of \(\omega +\kappa\) is markedly skewed (that is, asymmetrical), then one may opt for the shortest interval that encompasses 95 % of the sample of values of \(\omega +\kappa\). In such case (because the distribution that expresses the uncertainty surrounding \(\omega\) is asymmetrical), the concept of expanded uncertainty is not particularly meaningful. However, in many such cases, \(\widehat{\omega } \pm 2 u(\widehat{\omega })\) is an approximate credible interval with 95 % confidence [41].

Results

Figure 3 compares the expanded uncertainties produced by the new procedure with their counterparts that qualify the certified values, separately for the single-element solutions and for the anion solutions. The SRM lots depicted (38 for single-element solutions and 11 for anion solutions) are those for which there are separate evaluations of the uncertainty component attributable to potential long-term instability.

For the majority of the lots of both the single-element and anion solutions, the new approach produces smaller expanded uncertainties, \(U_{95\,\%}(w_{\text {HIST}})\), than those that are listed in the corresponding certificates, \(U_{95\,\%}(w_{\text {CERT}})\).

For SRM 3103a (As) the new procedure increases the expanded uncertainty appreciably, relative to the expanded uncertainty associated with the certified value: this is one of the SRMs for which the REML procedure suggested that there is no between-method difference. For two of the anion solutions, SRM 3182 (Lot 060925) Chloride Anion (\(\text {Cl}^{-}\)) Standard Solution and SRM 3186 (Lot 170606) Phosphate Anion (\(\text {PO}_{4}^{3-}\)) Standard Solution, \(U_{95\,\%}(w_{\text {CERT}})\) appreciably exceeds \(U_{95\,\%}(w_{\text {HIST}})\). The labeled SRMs are discussed in detail in “Noteworthy SRMs.’’

Fig. 3
figure 3

Expanded uncertainties produced by the new approach, \(U_{95\,\%}(w_{\text {HIST}})\), versus the expanded uncertainties associated with the certified values, \(U_{95\,\%}(w_{\text {CERT}})\), separately for the single-element solutions (left panel) and for the anion solutions (right panel). The (green) lines have slope 1 and intercept 0. For the large majority of the single-element solutions, the new approach tends to produce smaller expanded uncertainties than are listed in the corresponding certificates. The labels designate specific SRMs, with the lot number given below them in a smaller font size

Figure 4 depicts both the expanded uncertainties (those listed in the certificates of these SRMs and the corresponding ones produced by the new approach) and the differences between the estimates of mass fraction produced by the new approach and the corresponding certified values. The results for the single-element solutions and the anion solutions are displayed in the left panel and right panel, respectively.

Fig. 4
figure 4

Both the expanded uncertainties produced by the new approach and the corresponding ones listed in the certificates of these SRMs, U95 %(whist), U95 % (wcert), versus the differences between the estimates of mass fraction produced by the new approach and the corresponding certified values, whistwcert. The single-element solutions are shown in the left panel and the anion solutions are shown in the right panel. Each pair of symbols linked by a vertical (gray) line segment pertain to the same SRM. The blue circles represent results of the new approach (HIST), and the red diamonds represent results listed in the certificates (CERT). The horizontal coordinate is the difference between the value assigned by the new procedure, \(w_{\text {HIST}}\), and the certified value, \(w_{\text {CERT}}\): all of these differences are negligibly small. The vertical coordinates of the circle and corresponding diamond are the expanded uncertainties for 95 % coverage, \({U}_{95\%}(w_{\text {HIST}})\) and \(U_{95\%}(w_{\text {CERT}})\). Both expanded uncertainties include the contribution from long-term instability, but evaluated and propagated differently for HIST and for CERT. The tiny (green) diamonds inside some circles indicate those SRMs for which the REML estimate of dark uncertainty is 0 mg/g. The labels in larger font size designate SRMs, and the labels in smaller font size are lot numbers

Discussion

Table 3 lists the SRMs in series 3100 and 3180 with largest absolute difference between \(U_{95\,\%}(w_{\text {CERT}})\) and \(U_{95\,\%}(w_{\text {HIST}})\). These differences are depicted graphically in Fig. 4. The table also lists the corresponding values of the ratio

$$\begin{aligned} R = \frac{U_{95\,\%}(w_{\text {HIST}}) / w_{\text {HIST}}}{U_{95\,\%}(w_{\text {CERT}}) / w_{\text {CERT}}} \end{aligned}$$
(5)

between the relative historical and certified expanded uncertainties. For example, \(R = 0.6\) for SRM 3101a Aluminum (Al) Standard Solution Lot No. 140903: this means that the proposed procedure for uncertainty evaluation would have produced a relative expanded uncertainty two times smaller than the relative expanded uncertainty corresponding to the measurement result listed in the corresponding certificate.

Table 3 SRM lots with largest absolute difference between the relative expanded uncertainties\(U_{95\,\%}(w_{\text {CERT}})/w_{\text {CERT}}\)and\(U_{95\,\%}(w_{\text {HIST}})/w_{\text {HIST}}\). R is the ratio of the relative historical expanded uncertainty to the relative certified expanded uncertainty.\(R < 1\)indicates that the proposed procedure, using historical information, produces a smaller relative expanded uncertainty than the procedure used during certification. Three of the four noteworthy SRMs discussed in “Noteworthy SRMs’’ are on this list

Figure 3 shows that for some SRM lots the proposed procedure would have reduced the reported uncertainty, while for others it would have increased it. This retrospective study reveals that, overall, the new procedure would have reduced the relative expanded uncertainty for 97 % of the lots of the single-element solution SRMs, and for 82 % of the lots of the anion solution SRMs. Also, the median of the ratios of expanded uncertainties corresponding to the proposed procedure and to the current procedure, across all lots of all SRMs, for the single-element solutions and for the anion solutions, was 0.8.

We carry out these comparisons retrospectively only for purposes of illustration of the new method, not to suggest that the certified values or their associated uncertainties should be revised. Next, “Noteworthy SRMs’’ provides details about the SRMs whose representative points are labeled in Figs. 3 and 4.

Noteworthy SRMs

The following examples show that the new procedure being proposed neither replaces knowledge about the chemical nature of these systems, nor does it dispense with the need for exercising professional judgment in all cases about how best to estimate the measurand and to characterize measurement uncertainty. Instead, the new procedure improves the quality of the uncertainty evaluations by drawing on the wealth of accumulated, relevant historical information and by blending it with the information provided by the measurement results specific to each SRM.

Future preparations of these and other SRMs may require the development of customized versions of the general approach described in the previous sections to fold in the unique information about them while still using the accumulated historical information about the between-method differences, to improve the quality of the uncertainty evaluations for them.

SRM 3103a (Lot No. 100818) Arsenic

SRM 3103a (Lot No. 100818) Arsenic (As) Standard Solution is markedly different from all the other single-element solutions in the collection under consideration. Arsenic is rather unique in that its valence state in solution can make quite a difference during measurement using ICP-OES.

Yu et al. [42] demonstrated that solutions used during development of this SRM contained a mixture of trivalent and pentavalent arsenic, and that ICP-OES was approximately 8 % more sensitive to pentavalent than to trivalent arsenic. Narukawa et al. [43] confirmed the qualitative difference, even if they estimated the differential relative sensitivity to be only 4 %, and Narukawa et al. [44] studied this phenomenon further.

For this reason, when preparing arsenic solutions for certification, an oxidizing agent is added to ensure that all the arsenic ions present are in their higher valence state, which makes the measurement of these solutions markedly different from how the solutions of the other elements are measured.

SRM 3156 (Lot No. 140830) Tellurium

SRM 3156 (Lot No. 140830) Tellurium (Te) Standard Solution stands out because \(U_{95\,\%}(w_{\text {CERT}})\) is almost four times larger than \(U_{95\,\%}(w_{\text {HIST}})\). For this element, too, \(\tau _{\text {REML}} = 0\) mg/g. However, during certification, the measured mass fraction of tellurium in solution varied appreciably from day to day while it was being measured repeatedly by HP-ICP-OES.

This day-to-day variability was recognized and expressed by first obtaining a consensus value and associated uncertainty based only on the replicated instrumental determinations. This consensus value was then combined with the gravimetric result to produce a final consensus value for assignment to the material.

The end result was that \(u(w_{\text {I}})\) was about 50 times larger than \(u(w_{\text {G}})\) for SRM 3156. Fewer than 5 % of the SRMs in this series have a ratio \(u(w_{\text {I}})/u(w_{\text {G}})\) this large or larger. However, since our model does not exploit the historical relationship between \(u(w_{\text {I}})\) and \(u(w_{\text {G}})\), the peculiarity just mentioned does not explain the large difference between \(U_{95\,\%}(w_{\text {HIST}})\) and \(U_{95\,\%}(w_{\text {CERT}})\).

This difference may be attributable to the fact that our model uses the reported standard uncertainties differently from how they were used during certification, to weight the gravimetric and instrumental determinations as they are combined for value assignment. Since the difference between \(u(w_{\text {I}})\) and \(u(w_{\text {G}})\) is so pronounced in this case, the impact of the different weighting schemes upon the evaluation of the uncertainty associated with the assigned value may be large enough to explain why \(U_{95\,\%}(w_{\text {CERT}})\) should be almost four times as large as \(U_{95\,\%}(w_{\text {HIST}})\).

SRM 3182 (Lot No. 060925) Chloride

The certificate of SRM 3182 (Lot No. 060925) Chloride Anion (\(\text {Cl}^{-}\)) Standard Solution lists a value for the expanded uncertainty that is 1.8 times larger than its counterpart, \(U_{95\,\%}(w_{\text {HIST}})\), produced by the new approach. This is noteworthy but not surprising because both the certified value and the associated \(U_{95\,\%}(w_{\text {CERT}})\) are based on measurement results different from those that we have used in the calculation of both \(w_{\text {HIST}}\) and \(U_{95\,\%}(w_{\text {HIST}})\).

This material was originally certified in 2007, its certification period was extended in 2011, and then expired in 2017, when it was replaced by a new lot. The original certification used X-ray fluorescence (XRF) as instrumental method of analysis, rather than IC. The instrumental result we have used to obtain \(w_{\text {HIST}}\) and \(U_{95\,\%}(w_{\text {HIST}})\) is not that XRF result: it is the first measurement of this lot that was made using IC, during a stability study supporting the 2011 extension of the certification period.

The original gravimetric and XRF results, which were markedly mutually inconsistent, were combined in 2006 using a statistical procedure proposed by Levenson et al. [45], which came to be known by its acronym, BOB (“Type B On Bias”). NIST deprecated this procedure in 2009, and replaced it with procedures based on conventional random effects models, either classical [46, 47] or Bayesian [48].

SRM 3186 (Lot No. 170606) Phosphate

The certificate of SRM 3186 (Lot No. 170606) Phosphate Anion (\(\text {PO}_{4}^{3-}\)) Standard Solution lists an expanded uncertainty that is 2 times larger than its counterpart from the new procedure, which makes this material stand out in the right panel of Figs. 3 and 4.

The certified value and the associated uncertainty for this material include a second IC measurement using an independent phosphate source material to account for a potential bias in the phosphate assay of the source material. The inclusion of this second IC measurement makes for a relative uncertainty larger than for the other anion solution SRMs.

Since the new procedure would have required a modification to be able to take the additional measurement into account, it neglected the second IC measurement, which impacted the estimate of the measurand substantially (apparent in the right panel of Fig. 4) and yielded a smaller expanded uncertainty than is associated with the certified values (Figs. 3 and 4).

Conclusions

A new procedure was presented for evaluating the uncertainty component attributable to the between-method differences between the gravimetric preparation and the instrumental method of analysis, HP-ICP-OES for the single-element solutions and IC for the anion solutions. This procedure blends historical information about such differences with the difference observed when measuring a new solution.

The procedure was illustrated using the methods of gravimetric preparation and instrumental measurement of series 3100 and series 3180 SRMs. Since the new procedure exploits the rich history of production of these materials, the resulting uncertainty evaluations are believed to be more reliable than those obtained neglecting such history.

The retrospective evaluation of the new procedure suggests that it is likely to yield appreciable reductions in expanded uncertainty for future lots of both the single-element solution SRMs and of the anion solution SRMs. In fact, the new procedure reduced the relative expanded uncertainty for 97 % of the single-element solution lots and for 82 % of the anion solution lots, by comparison with the original procedure.

As already pointed out toward the end of “Discussion’’ the median of the ratios of expanded uncertainties corresponding to the proposed procedure and to the current procedure, across all lots of all SRMs, for the single-element solutions and for the anion solutions, was 0.8. This means that the new procedure tends to yield expanded uncertainties that are about 20 % smaller than the current procedure. And what enables it to do so is the treasure trove of information accumulated in the long history of development of these reference materials at NIST.

The new procedure that we have described and illustrated in applications, improves the quality of the uncertainty evaluations and enhances their value as links in the traceability chain for measurements made by users of two flagship series of SRMs, because it draws on the wealth of relevant historical information accumulated in the course of decades of development and measurement of these materials [49], and blends it with the information provided by the measurement results specific to each SRM.

As the collections of SRM lots in these two series continue to grow, the value added by the blending procedure will continue to increase accordingly, provided the measurement methods used remain stable and are well-characterized.

We have illustrated the application of the new method retrospectively, using data for each SRM lot as if it were a new lot, building on the historical information provided by all the other lots of the same or other SRMs. However, we did it only for purposes of illustration, not to suggest that the certified values or their associated uncertainties should be revised.