Introduction

Vitamin C, a composite of L-ascorbic acid (AA) and dehydro-L-ascorbic acid (DHAA), is a micronutrient with many biological activities of potential clinical interest in addition to the prevention of scurvy [1, 2]. As with other measurement processes used to inform decisions, those used for vitamin C need to reliably provide metrologically valid results largely independent of when and where the measurements are made [3,4,5].

In 1984, what is now the National Institute of Standards and Technology (NIST) started investigating how to help improve the within- and between-laboratory comparability of vitamin C measurements in human plasma and serum. This work was conducted as part of a National Cancer Institute (NCI)-sponsored program charged with developing analytical methods, reference materials, and interlaboratory studies (ILS) for micronutrients hypothesized to have cancer chemopreventive activity [6]. Informed by a series of exploratory ILS, a chromatographic reference method for AA and DHAA was developed, and various approaches to providing stable reference materials were explored. [7,8,9]. While vitamin C test samples were distributed in several early ILS [10] conducted under the auspices of what became known as the Micronutrients Measurement Quality Assurance Program (MMQAP), only in 1993 was it established that the vitamin C-related measurand best suited for ILS study was the sum of AA and DHAA (total L-ascorbic acid, TAA) [11, 12]. Also in 1993, NIST established that the TAA concentration ([TAA]) of serum diluted 1 + 1 (volume fraction) with 10% mass concentration aqueous metaphosphoric acid (MPA) was: 1) stable for at least 5 years when stored at −80 °C and 2) could be successfully determined by all of the measurement methods then available. The relative reproducibility precision of the results provided by these methods was ≈ 15% [13].

The NCI funding for MMQAP expired in 1997. NIST elected to continue development of a certified reference material for vitamin C in human serum, using participation fees to partially defray direct ILS expenses. Standard Reference Material® (SRM®) 970 Ascorbic Acid in Frozen Human Serum was released for sale in 2000 [14]. SRM 970 provided two 1 + 1 mixtures of human serum and 10% MPA: the high-normal Level I and the low-normal Level II. The vitamin C ILS continued until 2015. The stability of the SRM 970 solutions was monitored and approaches to reducing between-participant measurement variability were tested. The MMQAP vitamin C program was ended due to declining participation, exhaustion of the supply of SRM 970, and the availability of international proficiency tests for vitamin C initiated in 2010 by the Royal College of Pathologists of Australia Quality Assurance Programs (RCPAQAP) [5] and in 2014 by the German INSTAND e.v. External Quality Assurance Study (EQAS) program [15].

Forty MMQAP vitamin C ILS were conducted from 1993 through 2015 that exclusively used MPA-stabilized serum test samples. These samples contained no other stabilizers. Most of the 23 test materials studied were distributed multiple times over this 22-year period. A number of lessons learned from the results from these studies may be relevant to current measurement practice, particularly as preservation with MPA continues to be a widely-used approach to stabilizing the vitamin C content of plasma and serum samples [16,17,18].

Methods and materials

MMQAP vitamin C Interlaboratory comparability studies

Table 1 lists the number of participants, test samples, and control samples distributed in the MMQAP vitamin C ILS from 1993 thru 2015. These studies focused on measurements of [TAA] in MPA-stabilized human serum. Table 1 also displays the number of aqueous ascorbic acid calibrants that participants were requested to prepare.

Table 1 Number of Participants, Test Samples, Control Samples, and Calibration Solutions per Study

Calibrants were solutions of high-purity AA in serum-free 5% mass concentration aqueous MPA, thus matching the MPA concentration in the test and control samples. NIST supplied the solid AA along with the test and control samples. Participants were asked to prepare the solutions according to a protocol provided by NIST, calculate their expected [TAA], measure the [TAA] of the solutions as if they were test samples, and then compare calculated and experimental results. In the 1997 to 2000 ILS, participants were asked to prepare a single solution having nominal [TAA] of 50 μmol/L; later studies also requested solutions of nominal concentration {25, 12.5, and 0} μmol/L. To confirm their preparation, participants were asked to determine the absorbance maximum of the 50 μmol/L solution and calculate the mass attenuation coefficient (E1% extinction coefficient). The results provided by the analysis of these solutions were intended to help participants evaluate the performance of their measurement systems. Participants were instructed not to analyze the control or test samples until they were satisfied that their measurement processes were performing properly.

Control samples were serum-based materials that had been previously distributed as test samples. Participants were informed of the target [TAA] of these materials and were asked to confirm that their measurement processes yielded results that were within about 10% of the target before analyzing the test samples.

The protocols for each study are documented in the 7880 series of NIST Internal/Interagency Reports (NISTIRs); these documents and associated data are publicly available free of charge. Table S1 in the Electronic Supplementary Material (ESM) provides links to these reports.

Test and control samples

During the 1993 to 2015 period, 23 MPA-stabilized sera were distributed as test samples; 6 of these materials were also distributed as control samples. These sample materials were prepared at various times over the course of the MMQAP vitamin C program, using ten different serum pools. To enable use with colorimetric methods, none of these samples contained dithiothreitol, a strongly colored antioxidant that had been included in some test materials prepared prior to 1993 [19]. To provide sample materials that were similar to the participants’ “real” samples, all samples were delivered as (frozen) liquids (stored at −80 °C and shipped on dry ice) rather than as lyophilized (and relatively temperature insensitive) solids. ESM Table S2 describes these materials.

The test and control samples delivered to study participants were coded with the study name and index; the sample materials themselves were assigned arbitrary names at the time of production. ESM Table S3 provides the association between the sample names and the identifiers used in the ILS.

Methods of analysis

Participants in every ILS were asked to report their analysis methods. The methods used included a variety of commercial autoanalyzers (≈20% of all methods), colorimetry (≈13%), liquid chromatography (LC) with electrochemical (EC) detection (32%), LC with absorbance (Abs) detection (≈26%), LC with fluorescence (Fluor) detection (≈8%), and LC with tandem mass spectrometric (MS/MS) detection (≈2%). ESM Table S4 lists the proportion of methods used during three seven-year periods: 1993 to 2001, 2001 to the first 2008 ILS, and the second 2008 study thru 2015. While the proportions did not change dramatically over time, the use of autoanalyzer, LC-Abs, and LC-MS/MS methods increased while the use of colorimetric and LC-EC methods decreased.

Reporting units

In the vitamin C studies before 1993, confusion was observed concerning the units used to report AA, DHAA, and TAA concentrations. Some participants reported results in terms of what was in the test materials (concentration in the final 1 + 1 solution) and others in terms of the implied [TAA] of the undiluted serum (i.e., twice the [TAA] in solution). Confusion was also apparent between reporting in mass concentration (μg/mL) or molar concentration (μmol/L). Since the majority of participants preferred μmol/L, from 1993 on results were recorded and analyzed in terms of micromoles of TAA per liter solution.

To facilitate comparison of the MMQAP vitamin C results with serum vitamin C levels as reported in the literature, the recorded results are here multiplied by a factor of 2 as a transformation into micromoles per liter serum.

Interlaboratory study robust summary statistics

Interlaboratory study data often contain “outliers” – results that are not concordant with the majority of the data. Summarizing the performance of the majority is typically accomplished by identifying and excluding the discordant results before summarizing as the mean and standard deviation or by using robust estimators that are relatively insensitive to the presence of discordant values [20]. Explicit identification can create tension between study coordinators and participants; use of robust statistics is both simpler and less susceptible to misunderstandings.

The MMQAP studies used the median as the robust estimator of “average” values. While not the most sophisticated statistic, the median is robust, reasonably efficient, familiar, and easily explained [21]. The MMQAP reports summarized the spread of the concordant results using the adjusted median absolute deviation from the median (MADE) [22]. However, this estimator tends to underestimate the dispersion of truly normally distributed data (poor statistical efficiency), particularly when the suspect values cannot be assumed to be symmetrically distributed about the median [23]. In this report we replace the MADE estimates with those from the robust, efficient, and symmetry-insensitive Qn estimator [23, 24].

Results and discussion

MPA-stabilized Total ascorbic acid stored at −80 °C

Figure 1 displays in a dot-and-bar format the consensus [TAA] value and the between-participant variability for every sample in each of the ILS. The consensus values (the “dots”) are the medians of the reported results for the given ILS; the variabilities (the “bars”) are robust standard deviations. Fig. 1 also summarizes the combined results for all distributions of each test material with 95% level of confidence intervals estimated from: the number of times the material was distributed as a test sample, the mean of the median results from each of those distributions (xi), and the standard deviation of those medians.

Fig. 1
figure 1

Summary Results for Test Materials Over Time. Each symbol represents the median participant total ascorbic acid concentration, [TAA], for one test material in one study; bars represent robust Qn standard deviation estimates. Dashed horizontal lines bound approximate 95% level of confidence intervals on the medians for all studies that used the material. The vertical lines mark the seven-year intervals. A) Materials used as test materials (solid symbols) and as controls (open symbols). B) Test materials prepared in 2001 at NIST (squares) and commercially in 2009 (circles). C) Test materials prepared at NIST from 1989 to 1995

Figure 1A displays results for all of the sample materials distributed as both test and control materials. No appreciable difference was observed in the medians between the two uses. Fig. 1B displays results for the “U0” to “U4” series that was produced at NIST and the “Red” to “Purple” series that was produced commercially using NIST-supplied materials and protocol. No performance differences were attributable to the producer of the materials. Fig. 1C displays results for the legacy samples produced at NIST before 1996. Little or no evidence of [TAA] decline in any of the samples was observed.

The “104” sample was produced in 1989 and evaluated in a small ILS in 1990. The NIST and ILS mean values and their standard deviations from that time were (23.0 ± 0.7) μmol/L and (22.1 ± 0.7) μmol/L, respectively [9]. Since relatively few samples of this material were produced, it was not distributed in the MMQAP studies until the number of participants per ILS had declined sufficiently to ensure that samples damaged in shipment could be replaced. The mean and its standard uncertainty for the four distributions of the test sample is (23.7 ± 0.7) μmol/L, suggesting that the [TAA] content in this early MPA-reserved serum did not degrade over 24 years.

Changes in performance over time

Figure 2 displays estimates of within-participant (technical) and between-participant (reproducibility) precision as functions of [TAA] for the three seven-year periods from 1993 to 2015.

Fig. 2
figure 2

Test Material Within- and Between-Participant Precision Estimates Over Time. Each panel reports within- and between-participant precision estimates for total ascorbic acid concentration, [TAA], over one of three seven-year periods: A) 1993 to 2000 (RR4 to RR13), B) 2001 to 2008 (RR14 to RR28), and C) 2008 to 2015 (RR29 to RR43). Each symbol represents results for one test material, combining summary results from all studies that used the material during the period. Solid triangles represent technical precision estimated from the two results per sample reported for each study. The dotted line represents the proportionality of technical precision to [TAA]; the slope of this line defines the coefficient of variation, CVtech. Solid circles represent reproducibility precision estimated by the Qn robust standard deviations of participant results for test samples. The solid line represents the proportionality of reproducibility precision to [TAA]; the slope of this line defines the coefficient of variation, CVrepr. Open squares represent Qn reproducibility precision estimates for control samples

Technical precision

In all of the MMQAP vitamin C ILS, participants were asked to report results for two replicate analyses of each calibrant, control sample, and test sample. This design was intended to enable assessing within-participant variability while keeping the analytical burden on the participants within reasonable limits. Participants were free to interpret this request in light of their own policies. Some participants likely reported results for replicate injections of one sample preparation (instrumental precision conditions) while others reported results from independent preparations (repeatability precision conditions). In consequence, this mixture of within-participant variability is summarized as “technical” precision. For each material i distributed as a control or test sample in distribution j and reported in replicate by participant k, sijk is the standard deviation of those replicates, sij is the median of the sijk, and si is the pooled standard deviation (square root of the mean \( {s}_{ij}^2 \)) for all the distributions within a given time period. These results are then combined by regression using the model si = βtech xi. The technical precision for the period can be expressed as a percent relative standard deviation (coefficient of variation, CV): CVtech = 100βtech.

The composite CVtech estimates improve over time, declining from 2.1% to 0.9%. These values are compatible with the 2% to 3% repeatability estimates reported for several methods [25,26,27]. However, whether the improvement reflects improved control of within-laboratory measurements or a greater proportion of participants reporting instrumental rather than repeatability precision cannot be determined. In any case, within-participant variability does not contribute much to the between-participant variability.

Reproducibility

The relative ILS reproducibility precision can, in principle, be estimated similarly: calculate the mean result for each material i in distribution j reported by each of the k participants, the standard deviation of those means (σij), the pooled standard deviation for all the distributions within the period (σi), and combine these values using the regression model σi = βrepr xi. The reproducibility for the period can be expressed as CVrepr = 100βrepr.

As described above, the MMQAP used the MADE to estimate σij. This estimator not only is less statistically efficient than the Qn, it assumes that the non-concordant values are symmetrically distributed about the median. The histograms for the U1 to U4 series of test samples shown in Fig. 3 are representative of the general tendency for the as-reported vitamin C probability distributions to favor smaller values over larger (left-skew). We therefore here estimate the σij using the Qn estimator [23, 24]. For the 168 data sets with at least 5 results and xi > 0, the MADE estimates are on average 18% smaller than Qn while the standard deviation is on average nearly twice Qn.

Fig. 3
figure 3

Representative Probability Distributions for Test Materials. The thick-line histograms are empirical unit-area probability distributions for the reported total ascorbic acid concentration, [TAA], of four test sample materials {designated U1, U2, U3, and U4) prepared at NIST in 2001. The light vertical lines mark the histogram bin boundaries. The light smooth curves represent normal distributions for the samples parameterized with the ILS estimates

The CVrepr improved over time. We hope that this improvement reflects experience gained through participation in the MMQAP but recognize that it may just reflect changes in the number and nature of the ILS participants as well as differences in the analytical methods used. Evidence from vitamin C studies coordinated by other organizations is difficult to compare since as-reported data are not available to non-participants and the data analysis methods are not described in detail. However, the INSTAND e.v. EQAS program makes vitamin C summary information available for 44 single-distribution lyophilized samples in their Oct-2014 to June-2020 programs [15]. The CVrepr estimated from their “Overview” information is a roughly constant 11% over this six-year period (ESM Fig. S1 reports our analysis of these data.) The Royal College of Pathologists of Australia Quality Assurance Program (RCPAQAP) does not make summary information available, but a recent publication states that CVrepr for their 2010 initial vitamin C program was 13% and decreased to 7% in 2019 [5].

Except for the two 1995 ILS, the CVrepr for materials used as control samples is about the same as when used as test samples. This suggests that participants used the same measurement and reporting processes for the test and control materials regardless of having knowledge of the target [TAA] in the controls.

Preparation and evaluation of calibration solutions

A secondary goal of the MMQAP vitamin C program was assessment and enhancement of the participants’ ability to prepare metrologically traceable calibration solutions. ESM Fig. S2 shows that the probability density functions for the ratio between the measured and calculated [TAA] and the calculated E1% narrowed over time. The final, fully-developed protocols provided to participants for preparing and evaluating the calibrants are contained in Appendix E of [28].

Normalization

Figure 4 displays the impact of three normalization methods on the technical and reproducibility precision of the test and control sample results from the 2008 to 2015 period. These methods are all linear transformations:

$$ {x}_{ijk}^{\prime }=\frac{x_{ijk}-\alpha }{\beta },\kern0.5em {s}_{ijk}^{\prime }=\frac{s_{ijk}}{\beta } $$

where i indexes the material, j the ILS, k the participant, α is an offset parameter and β sets the scale.

Fig. 4
figure 4

Test Material Within- and Between-Participant Precision Estimates Following Normalization. Each panel reports within- and between-participant precision estimates for total ascorbic acid concentration, [TAA], for three approaches to normalizing between-participant variability: A) scaled to the participant-prepared calibration solutions, B) scaled to the ratio of the control sample (or to the average ratio of two control samples), and C) interpolated between results for two control materials. Format as in Fig. 2

Figure 4A reports the effect of normalizing to the calibration solutions prepared by each participant: Here, i is the nominal 50 μmol/L solution, the offset α parameter is set equal to zero, and the scaling β parameter is defined as the ratio between the measured and calculated values:

$$ \beta =\frac{x_{ijk}\left(\mathrm{measured}\right)}{x_{ijk}\left(\mathrm{calculated}\right)}. $$

This method only marginally improves the estimated CVrepr for this period, from (8.8 ± 0.5) % to (8.3 ± 0.6) %. Using regression to estimate both offset and scale parameters from the four participant-prepared calibration solutions is even less effective, leaving the CVrepr essentially unchanged at (8.7 ± 0.6) %.

Figure 4B reports the effect of normalizing to the reference value of a control material, setting the offset to zero and defining the scale parameter as

$$ \beta =\frac{x_{1 jk}\left(\mathrm{measured}\right)}{x_1\left(\mathrm{reference}\right)}\ \mathrm{or}\ \left(\frac{x_{1 jk}\left(\mathrm{measured}\right)}{x_1\left(\mathrm{reference}\right)}+\frac{x_{2 jk}\left(\mathrm{measured}\right)}{x_2\left(\mathrm{reference}\right)}\right)/2 $$

depending on whether one or two controls were distributed. This reduces the CVrepr to (6.7 ± 0.5) %.

Figure 4C reports the effect of linearly interpolating between two controls where the parameters are defined

$$ \beta =\frac{x_{2 jk}\left(\mathrm{measured}\right)-{x}_{1 jk}\left(\mathrm{measured}\right)}{x_2\left(\mathrm{reference}\right)-{x}_1\left(\mathrm{reference}\right)},\kern0.5em \alpha ={x}_{1 jk}\left(\mathrm{measured}\right)-\beta {x}_1\left(\mathrm{reference}\right). $$

This reduces the CVrepr to (3.8 ± 0.9) %, although the available data are about as well described as having a [TAA]-independent standard deviation of (2.8 ± 0.9) μmol/L.

The failure of calibrant normalization to improve reproducibility demonstrates that the between-participant variability is not driven by differences among the participants’ routine calibration materials or strategies. The improvement provided by normalizing to one control sample identifies that whatever does drive the between-participant differences, its effect is systematic across the MPA-preserved samples. The considerable improvement provided by normalizing to two controls suggests that concentration-independent as well as concentration-dependent differences exist.

ESM Fig. S3 displays the changes in the probability density functions induced by normalization. The predominant effect is to reduce the width of the distributions, but without much reduction in the left-skew asymmetry.

Figure 5 displays the effect of normalization on the performance of four exemplar participants, two of whom participated in all 40 MMQAP Vitamin C ILS. For each ILS, a relative mean bias for each participant can be estimated as:

$$ {B}_{jk}=100\left(\sum \limits_i^{n_j}\frac{x_{ijk}-{x}_i}{\sigma_i}\right)/{n}_j $$

where nj is the number of test samples distributed in the jth ILS. Assuming that σi is equal to 0.1 xi (i.e., CVrepr is 10% for all test samples in all ILS) facilitates visualization, although it underestimates the variability during 1993 to 2001 and slightly overestimates it for 2008 to 2015.

Fig. 5
figure 5

Participant Performance Over Time and Normalization Approaches. Each panel reports mean relative test sample measurement bias for four exemplar laboratories that participated in the final MMQAP Vitamin C interlaboratory studies. Each symbol denotes the mean bias in one ILS for one participant, calculated from results: A) as reported, B) scaled to the ratio of the control sample (or to the average ratio of two control samples), and C) projected onto the line defined by two control materials. The curves are quadratic-smoothed representations of the performance of these four participants over time. Dashed horizontal lines bound an approximate 95% level of confidence interval on the relative bias, assuming a constant coefficient of variation of 10% for all test materials in all studies. The vertical lines mark the seven-year intervals. The open circles and squares identify the participants whose mean biases are much reduced by normalization

Figure 5A displays the mean bias for the as-reported results. The Bjk for the two long-term participants stabilized around zero bias in the early 2000s, a few years after the number of test samples distributed per ILS was increased. We believe that the two-year long increase in bias starting in 2011 for one of these participants reflects changes in analytical methodology. However, the Bjk bias for a participant who joined the program in 2005 is fairly consistently +20% and that for a participant who joined in 2011 is consistently −10%.

While control samples became a routine component of the program only in 2004 (see Table 1), Fig. 5B reveals that, when control data is available, scaling the sample results to one control greatly reduces the average bias. Fig. 5C demonstrates that projecting onto the line defined by two controls typically reduces the variability of the bias estimates. Note, however, that normalization can amplify bias if the participants’ control sample results aren’t representative of the test sample results (e.g., because of matrix miss-match, analysis at different times or with different reagents, or measurement error).

Conclusion

NIST’s Micronutrient Measurement Quality Assurance Program (MMQAP) for vitamin C established that the total ascorbic acid concentration ([TAA]) of human serum in a 1 + 1 volumetric combination with an aqueous solution containing a mass fraction of 10% metaphosphoric acid (MPA) is stable for at least 20 years when stored at −80 °C.

Calibration with participant-prepared and evaluated AA aqueous calibration solutions did not improve measurement comparability whereas calibration with one or two MPA-stabilized human serum control materials reduced CVrepr from 9% to ≈7% and ≈4%, respectively. This indicates that the various [TAA] measurement methods used by the MMQAP participants are differently (but self-consistently) sensitive to components of the serum-based samples that are not present in the aqueous calibrants.

The typical within-participant measurement coefficient of variation (CVtech) improved from ≈2% to less than 1% over the 40 interlaboratory studies from 1993 through 2015. The between-participant relative reproducibility precision (CVrepr) improved from ≈16% to ≈9% over this period. We believe this improvement supports the contention that reference materials such as SRM 970 and regular participation in quality assurance and/or proficiency test programs like the MMQAP, RCPAQAP, and INSTAND help improve among-laboratory measurement comparability.

While the availability of certified reference materials and participation in interlaboratory comparisons may contribute to controlling and potentially improving the reproducibility of [TAA] measurements, normalization to serum-based control materials having a matrix that is well matched to that of the samples of interest can ensure it. Multicenter studies of [TAA] in preserved plasma or serum that require the best achievable measurement comparability should consider calibration with matrix-matched control materials.