Introduction

The first measurements of oxygen consumption (\(\dot{V}{\text{O}}_{2}\)) during exercise date back to the 1910s, and the concept of maximal oxygen consumption (\(\dot{V}{\text{O}}_{2\hbox{max} }\)) is one of the traditional cornerstones of exercise physiology (Hodges et al. 2005). \(\dot{V}{\text{O}}_{2\hbox{max} }\) can be defined as the maximal rate at which the body is able to utilize oxygen during exercise, and is thought to be the best measure of cardiorespiratory fitness (Meyer et al. 2005). It is also believed that \(\dot{V}{\text{O}}_{2\hbox{max} }\) is reached when the \(\dot{V}{\text{O}}_{2}\) “plateaus” despite a continually increasing workload (Taylor et al. 1955).

These earlier studies reporting the occurrence of the plateau phenomenon used discontinuous exercise protocols, often performed over multiple days, and the exercise protocols required large changes in \(\dot{V}{\text{O}}_{2}\) (~300 mL O2) between subsequent workloads (Taylor et al. 1955). During these trials, expired gas was collected in meteorological balloons (now termed the Douglas Bag) and subsequently analyzed with very laborious and time-consuming methods (Mitchell et al. 1958). Over time, the \(\dot{V}{\text{O}}_{2\hbox{max} }\) test protocol evolved (from a discontinuous to a continuous test with shorter stages) and the Douglas Bag method (DBM) was replaced by automated gas analyzers. However, the DBM is still considered the gold standard for \(\dot{V}{\text{O}}_{2}\) measurement. The use of higher resolution gas analyzers combined with continuous ramp-like protocols has significantly decreased the incidence of the plateau phenomenon during maximal exercise tests (Astorino 2009; Doherty et al. 2003).

The ability to correctly detect a plateau in \(\dot{V}{\text{O}}_{2}\) during a single incremental test with short stages and small increments in workload largely depends on the capacity of the equipment used to measure the different gas exchange components of \(\dot{V}{\text{O}}_{2}\). While numerous studies have validated gas analyzers from different manufacturers (Bassett et al. 2001; Crouter et al. 2006; Duffield et al. 2004; La Mere et al. 1993; Medbo et al. 2002; Rietjens et al. 2001; Rosdahl et al. 2013), these studies are often performed at low to moderate exercise intensities (Bassett et al. 2001; Macfarlane and Wu 2013) when using long averaging intervals (Bassett et al. 2001; Crouter et al. 2006; Macfarlane and Wu 2013). This differs from the way in which the analyzers are used in practice. For example, these systems may be unable to accurately measure pulmonary ventilation (\(\dot{V}_{\text{E}}\)) at high rates and to match those high rates with rapidly changing fractions of expired O2 and CO2 (Gore et al. 2003). Other problems with validation studies include the lack of consensus as to what is acceptable in terms of measurement error for \(\dot{V}{\text{O}}_{2}\) (Macfarlane 2001) and the use of inadequate statistics to evaluate agreement between systems (Atkinson et al. 2005; Atkinson and Nevill 1998). For example, Basset et al. (2001) found significant differences in \(\dot{V}{\text{O}}_{2}\) readings between a gas analysis system and the DBM (18 mL min−1) along with very small 95 % limits of agreement, while others (Auchincloss et al. 1971) found no significant difference in \(\dot{V}{\text{O}}_{2}\) readings between two different sets of equipment when the mean ± standard deviation between methods was 1 ± 9 %, which would produce very large limits of agreement (Auchincloss et al. 1971). Although the t test results could at first indicate that the latter study showed better agreement, in fact the Parvo Medics equipment used by Basset et al. (2001) had much better accuracy, as given by the very small systematic bias and narrow limits of agreement. Narrow limits of agreement allow unraveling even very small systematic differences, while large limits of agreement may mask even large systematic differences as “random variation”. When only indicators of systematic bias such as t test or analysis of variance (La Mere et al. 1993; Wilmore et al. 1976) are reported, little is known about the range of errors between equipments. Therefore, the accuracy of gas analyzer systems should always be tested including some measurement of error, such as the Technological Error of Measurement or limits of agreement. The latter are now widely reported in validation studies, though the technique still causes some confusion. A method can be considered valid if the range determined by the limits of agreement is of no clinical significance (Bland and Altman 1999), not if 95 % of all the data fall within the limits of agreement, as it has been interpreted (Duffield et al. 2004).

Therefore, part of the inconsistent observations of the plateau phenomenon may result not only from different biological responses (Doherty et al. 2003) and the manner in which the respiratory data are analyzed (Midgley and Carroll 2009), but also as a consequence of the inability of automated gas analyzers to detect small changes in \(\dot{V}{\text{O}}_{2}\) at very high \(\dot{V}_{\text{E}}\) rates. To test this possibility, the purpose of this investigation was to compare the agreement of respiratory data values obtained using an automated gas analyzer to those obtained simultaneously using the DBM as a criterion method during a short-stage incremental test to exhaustion and to evaluate the practical implications of those measurement errors in the calculation of stage-to-stage differences in \(\dot{V}{\text{O}}_{2}\).

Methods

Subjects

Twelve trained male runners (age 25.5 ± 8.3 years, height 179.9 ± 6.6 cm, body mass 75.1 ± 5.4 kg) were recruited for this investigation. All participants were injury free at the time of the study and were able to exercise to their personal best on a treadmill. A minimum running training volume of at least 40 km per week during the 3 months prior to the study was considered as an inclusion criterion for participation. The subjects were fully informed, verbally and in print, of the risks, benefits and requirements of participation, and were required to sign an informed consent form before taking part in the study, which was approved by the Research and Ethics Committee of the Faculty of Health Sciences of the University of Cape Town, in accordance to the Declaration of Helsinki. All participants were also clearly informed that they could withdraw from the study at any time if they so wished.

Study design

Participants visited the laboratory on two different occasions, separated by 2–3 days. On day 1, after signing the consent form, participants underwent anthropometric measurements (weight and height) and subsequently performed a maximal incremental exercise test on the treadmill (see description below). On day 2, participants again performed an incremental exercise test on a treadmill. During both days, gas exchange variables were continuously monitored throughout the test using an automated gas analyzer (MOXUS Modular Metabolic System, AEI Technologies, IL, USA). During a 2-min segment of the test expired gas was also collected into Plastic Bags (Douglas Bags, DBs) for later analysis of gas volume and composition. The DBM system was attached in series at the exit port of the Moxus System mixing chamber, so that simultaneous analysis of gas exchange parameters could be performed. This method/system does not introduce problems with increased dead space, as may be the case for other setups (Medbo et al. 2012).

Exercise test

On both days, the participants performed an incremental exercise test to exhaustion on a motor-driven treadmill (PPM 55, Woodway GmbH, Germany). Prior to the test, participants were requested to warm-up by running three 5-min stages at 8, 10 and 12 km h−1, respectively. After the third 5-min stage, the participants rested for 3 min before beginning the maximal test. For the first stage of the test, subjects ran at 12 km h−1 for 1 min. Thereafter the speed was increased by 0.5 km h−1 every 30 s, until subjects reached volitional exhaustion. The treadmill grade was kept at 5 or 10 % during the entire test, depending on the fitness level of each participant.

Data collection and analysis

The automated gas analyzer, Moxus Modular Metabolic System, consists of a mask incorporating a turbine for determination of ventilation volume. The turbine is attached on the inspiratory side of the valve, thus avoiding problems related to condensation of water vapor and accumulation of saliva (Bassett et al. 2001). The mask covered the subject’s nose and mouth so that they inhale the room air. All the expired gas traveled through a tube from the mask into a 4.2 L mixing chamber. Samples were constantly drawn from the mixing chamber (flow regulated to 250 mL min−1), through a dual-stage nafion dryer and into the O2 and CO2 content analyzers (S-3A/I and CD-3A, respectively, AEI Technologies, IL, USA). The analyzers use the zirconia and non-dispersive infrared methods for O2 and CO2 analysis. Data from the Moxus system were averaged for each 30-s stage. Prior to the start of each testing session, the automated gas analyzer system was calibrated for gas volume and composition according to the manufacturers’ instructions. Volume calibration was performed using a high-precision 3 L calibration syringe (Hans-Rudolph, Kansas City, MO, USA) and gas calibration was made using two-point calibration curve, consisting of room air and a gas of known concentration (15.00 % O2, 6.00 % CO2, AGA Norgas, Norway). The DBM system (Cranlea, UK) was attached in series with this equipment. This method has been used previously with good results (Medbo et al. 2012), and does not impose the technical problems of increased resistance to breathing (Crouter et al. 2006; Duffield et al. 2004) and increased dead space (Medbo et al. 2002) associated with connecting multiple breathing valves close to the mouth. All expired air for each 30 s stages was collected into separate bags and stored for subsequent analysis, which was done immediately after completion of the exercise test. Collection time was clocked to the closest 0.1 s by an experienced researcher, and the bags were opened and closed always at the end of expirations, to contain only full breathing cycles. Gas volumes in the DBM were measured with a Digital Dry Gas Meter (Harvard Apparatus, MA, US). When determining the volume of expired gas collected into each bag, the small volume of air diverted to the automated gas analyzer (approximately 125 mL per 30 s) was accounted for. Fractions of O2 and CO2 in the expired gas collected into the bags were analyzed using the analyzers from the Moxus system, with the air flowing through the analyzers inlet using a closed-loop method, so that the analyzed air returned into the bags.

While the Moxus system could continuously monitor gas exchange variables, the DBM could only collect four bags at a time. Therefore, on the first day the DBM was used during the initial stages of the exercise test, and on the second day the point at which gas samples collection began was determined by each subject’s predicted test duration from day 1, to collect data from the final 2 min of the test.

Statistics

All analyzed data are presented using mean ± standard deviation, and respiratory variables are expressed in STPD units. Fractions of expired O2 (FeO2) and CO2 (FeCO2) are expressed as percentage-points L−1 to avoid confusion. Regression analyses between variables measured by the two different systems and residual analyses were performed using Prism 3.0 (Graphpad Software, La Jolla, CA, USA). Values measured by the two equipments were compared by dependent t tests (Statistica 10; Statsoft, OK, USA). The range of recorded values was divided into thirds (lower, mid and higher range), and the differences between equipments for the different thirds compared using a one-way ANOVA with Tukey Post hoc when necessary (Prism 3.0). Statistical significance was accepted when p < 0.05.

A variety of statistical procedures was used to facilitate comparison between our data and other studies, as has been recommended (Atkinson and Nevill 1998; Hodges et al. 2005; Macfarlane and Wong 2012; Macfarlane and Wu 2013). The Technical Error of Measurement was defined as \(\sqrt {\mathop \sum \nolimits D^{2} /2N}\), where D is the difference in readings between methods and N is the number of pair comparisons (Macfarlane and Wong 2012), and expressed as a percent of the mean. Absolute percentage error was calculated as the mean of absolute differences in readings between methods divided by the DBM readings times 100 (Macfarlane and Wu 2013). Mean Individual Coefficient of variation was expressed as the mean of individual standard deviation between measurements divided by their mean times 100 (Macfarlane and Wu 2013). Effect sizes (Cohen’s d), were calculated as the standard error of estimate from the regression between the DBM and Moxus readings divided by the standard deviation of the DBM readings, and interpreted according to a modified Scale for Cohen’s d (Hopkins 2012; Macfarlane and Wu 2013). The agreement between the two systems was further evaluated using Bland–Altman Plots (Bland and Altman 1999; Hodges et al. 2005). Since heteroscedastic errors are the norm when analyzing physiological parameters (Nevill and Atkinson 1997), data were also log transformed for calculation of limits of agreement as a ratio between measurements (Bland and Altman 2007; Nevill and Atkinson 1997; Rietjens et al. 2001). Although adjustments in the Bland–Altman can be made to adjust for multiple observations per subject, this has been shown to be of little or no significance when the number of comparison pairs is much greater than the number of observations per subject, as in our study (Bland and Altman 2007). Calculations were performed using Excel (Microsoft, WA, US) spreadsheets.

Data from three participants tested on the same days produced apparently unreliable values for \(\dot{V}{\text{O}}_{2}\) and respiratory exchange ratio (RER). Upon investigation, it was discovered that the drying chamber from the Moxus system required maintenance, and therefore the data from these three participants were excluded from the study. A separate analysis showed that inclusion of these data would not significantly impact our results (not shown) (Salier et al. 2012), but the data were excluded nonetheless (Hodges et al. 2005).

Results

Comparison between the data measured simultaneously by the Moxus and DBM is presented in Table 1. All variables presented high correlation coefficient between the DBM and the Moxus (Figs. 1, 2, 3, 4, 5), ranging from 0.93 (\(\dot{V}{\text{O}}_{2}\)) to 0.99 (\(\dot{V}_{\text{E}}\)). Overall, the Moxus produced significantly higher readings for \(\dot{V}_{\text{E}}\), \(\dot{V}{\text{O}}_{2}\), FeCO2, CO2 excretion (\(\dot{V}{\text{CO}}_{2}\)) and RER, but not for FeO2. FeO2 also displayed the smaller mean coefficient of variation (Table 2) and mean absolute percentage error between the two methods (0.3 and 0.4 %, respectively), whereas FeCO2 showed the larger variation between system (mean coefficient of variation 2.75 %, absolute percentage error 3.9 %). Although mean individual coefficient of variation for \(\dot{V}{\text{O}}_{2}\) was 2.5 %, individual values ranged from 0 to 7.43 %. The effect sizes of the standard errors of estimate were deemed “trivial” (0–0.20) according to the Cohen scale, except for \(\dot{V}{\text{O}}_{2}\), RER and FeCO2, which were deemed “small” (0.21–0.60) (Hopkins, 2012b).

Table 1 Regression equation between two systems simultaneously measuring gas exchange, Pearson’s coefficient of correlation (r), error of regression (S y|x), mean ± SD physiological values for the Douglas Bag (DB) and Moxus systems, absolute percentage error (APE) and effect size
Fig. 1
figure 1

(Left) Fraction of expired O2 measured by the Moxus versus the DB System. The solid line represents the regression line, the dotted line represents the 95 % confidence interval of regression and the dashed line represents the identity line. (Right) Bland–Altman plot of the error scores. Dashed line represents the mean bias and dotted lines represent the 95 % limits of agreement

Fig. 2
figure 2

(Left) Fraction of expired CO2 measured by the Moxus versus the DB System. The solid line represents the regression line, the dotted line represents the 95 % confidence interval of regression and the dashed line represents the identity line. (Right) Bland–Altman plot of the error scores. Dashed line represents the mean bias and dotted lines represent the 95 % limits of agreement

Fig. 3
figure 3

(Left) Pulmonary ventilation measured by the Moxus versus the DB System. The solid line represents the regression line, the dotted line represents the 95 % confidence interval of regression and the dashed line represents the identity line. (Right) Bland–Altman plot of the error scores. Dashed line represents the mean bias and dotted lines represent the 95 % limits of agreement

Fig. 4
figure 4

(Left) Oxygen uptake measured by the Moxus versus the DB System. The solid line represents the regression line, the dotted line represents the 95 % confidence interval of regression and the dashed line represents the identity line. (Right) Bland–Altman plot of the error scores. Dashed line represents the mean bias and dotted lines represent the 95 % limits of agreement

Fig. 5
figure 5

(Left) R value measured by the Moxus versus the Douglas Bag (DB) System. The solid line represents the regression line, the dotted line represents the 95 % confidence interval of regression and the dashed line represents the identity line. (Right) Bland–Altman plot of the error scores. Dashed line represents the mean bias and dotted lines represent the 95 % limits of agreement

Table 2 Mean ± SD bias, 95 % CI of mean bias, 95 % limits of agreement (95 % LOA), technical error of measurement (TEM) and mean individual coefficient of variation (CV) between two systems measuring gas exchange simultaneously

The analysis of variance performed to compare the differences in readings over the lower, middle and higher third of recorded values showed no significant differences for RER, \(\dot{V}{\text{O}}_{2}\) and FeO2, showing that there was no increase in bias as values increased. Ventilation readings for the higher third of the analyzed range showed higher differences in reading between the two systems than both the lower and middle ranges of values (5.4 ± 5.4 vs. 1.3 ± 2.5 vs. 2.0 ± 3.3 L min−1 higher for Moxus than DBM, respectively, p < 0.05). Similar patterns were seen for FeCO2 and \(\dot{V}{\text{O}}_{2}\) (0.4 ± 0.3 vs. 0.1 ± 0.2 vs. 0.1 ± 0.2 L min−1 for the higher, middle and lower range of values, respectively, p < 0.01).

The mean bias and 95 % limits of agreement for the respiratory values are showed in Table 2 and also in Figs. 1, 2, 3, 4, 5. While most variables showed very small systematic bias between systems, some of the respiratory parameters showed considerably large 95 % limits of agreement. The technical error of measurement was larger for \(\dot{V}{\text{O}}_{2}\) than all other variables, which is expected since \(\dot{V}{\text{O}}_{2}\) accumulates the error in measurements from \(\dot{V}_{\text{E}}\), FeCO2 and FEO2.

Although there were no significant differences between equipments in the measured \(\dot{V}{\text{O}}_{2}\) difference between two consecutive 30-s intervals (0.12 ± 0.11 vs. 0.09 ± 0.19 L min−1 for the Moxus and DBM, respectively, p = 0.26), the standard deviation for the difference between the systems was several times greater than the difference itself (0.03 ± 0.20 L min−1). As shown in Fig. 6a, b, the level of agreement between the two systems when measuring stage-to-stage differences in \(\dot{V}{\text{O}}_{2}\) was very low (r = 0.25, p = 0.12), and the slope for the regression line between the Moxus and the DBM was not different from zero, further indicating low agreement.

Fig. 6
figure 6

(Left) Difference in O2 uptake between two consecutive 30-s intervals measured by the Moxus versus the Douglas Bag (DB) System. The parameters for the regression line are as follows: {Y} = −0.042 {x} + 0.072; s b = 0.334; S Y|x = 0.109 L min−1; r = 0.253 (p = 0.124); n = 38. The solid line represents the regression line, the dotted line represents the 95 % confidence interval of regression and the dashed line represents the identity line. (Right) Bland–Altman plot of the error scores. Dashed line represents the mean bias and dotted lines represent the 95 % limits of agreement

Discussion

To our knowledge, this is one of few studies testing the accuracy of a modern gas analyzer in more practical and challenging conditions (Meyer et al. 2005). The first important finding of this investigation is that the Moxus performed similarly to several other commercially available systems and within recommended standards for validity of gas analysis systems. The second main finding is that even though the Moxus system performed similarly to other commercially available equipments, the agreement between the two methods when comparing \(\Delta \dot{V}{\text{O}}_{2}\) for two consecutive 30-s interval was very low.

Although there is no consensus about what constitutes an acceptable error in measurement for automated gas analyzers, some standards have been proposed (Atkinson et al. 2005; Hodges et al. 2005; Macfarlane 2001). Atkinson et al. (2005) suggest that there is no practical impact on \(\dot{V}{\text{O}}_{2}\) measurements if two methods disagree systematically by no more than ±2 mL kg−1 min−1, as judged by the 95 % confidence interval (95 % CI) of the mean bias. This was achieved in our study (−1.8 to −0.3 mL kg−1 min−1). MacFarlane (2001) recommends that the technical error of measurement should be smaller than 3 % for \(\dot{V}{\text{O}}_{2}\) and 5 % for \(\dot{V}_{\text{E}}\), which is also supported by Hodges et al. (Atkinson et al. 2005). Our data for \(\dot{V}_{\text{E}}\) were within the recommended range (2.69 %), but slightly off for \(\dot{V}{\text{O}}_{2}\) (3.33 %).

When comparing different methods to measure a given physiological variable, the total error in measurement can be split into technological errors and biological variation, which is the normal fluctuation of a variable over the time between measurements (Hopkins 2012). The technological error of measurement can be separated from the biological variations in \(\dot{V}{\text{O}}_{2}\) by testing two systems working simultaneously, and this should be used whenever possible (Atkinson et al. 2005; Hopkins 2012; Macfarlane 2001). Since the measurement errors should be much smaller when systems work at the same time (Macfarlane and Wu 2013), we will focus our discussion on studies that used similar methodology to ours. More detailed reviews on the performance of gas analyzers tested sequentially or over multiple days can be found elsewhere (Atkinson et al. 2005; Macfarlane 2001).

Validity studies comparing gas analyzers to the DBM have reported results similar to ours. Rietjens et al. (2001) reported wider limits of agreement than those in this study for \(\dot{V}{\text{O}}_{2}\) (ratio of DBM over Oxycon between 0.88 and 1.12 vs. 0.94 and 1.10 in our study) when validating the Oxycon Pro. Jakovljevic et al. (2008) found wider limits of agreement for the Innocor system against the DBM than in our investigation for both \(\dot{V}{\text{O}}_{2}\) and \(\dot{V}_{\text{E}}\) (−0.52 to 0.55 L min−1 for \(\dot{V}{\text{O}}_{2}\) and −8.74 and 10.66 L min−1 for \(\dot{V}_{\text{E}}\)), though the authors interpreted their values as too wide to be acceptable for clinical testing. Reybrouck et al. (1992) presented limits of agreement for \(\dot{V}{\text{O}}_{2}\) in the range of ±3.5 and ±3.0 L min−1 for \(\dot{V}_{\text{E}}\) in a study validating an automated system, which are narrower values than those reported in the present investigation. It is important to note, however, that their rate of \(\dot{V}_{\text{E}}\) did not exceed 40 L min−1: a very different range of exercise intensities than that in the present investigation.

Crouter et al. reported 95 % limits of agreement for \(\dot{V}{\text{O}}_{2}\) of ± 200 mL min−1 and \(\dot{V}_{\text{E}}\) of ±2.3 L min−1, but like the former study minute ventilation rarely exceeded 80 L min−1, and data were averaged over 2-min intervals. The study from Gore et al. (2003) clearly demonstrates how differences between methods often grow proportionally to the size of the mean of measurements, which further identifies the importance of testing equipment over the full spectrum of physiological values normally measured. The presence of heteroscedastic errors in physiological variables has already been examined (Atkinson and Nevill 1998) and is further shown by the larger difference between equipments for the higher third of measurements of \(\dot{V}_{\text{E}}\), \(\dot{V}{\text{CO}}_{2} \;{\text{FeO}}_{2}\). In our study, \(\dot{V}_{\text{E}}\) values ranged from ~70 to ~190 L min−1, while \(\dot{V}{\text{O}}_{2}\) values ranged between ~3.3 and 5.5 L min−1, values much higher than often reported (Bassett et al. 2001; Macfarlane and Wu 2013; Reybrouck et al. 1992) in validation studies. A much larger study on the Oxycon Pro revealed that almost all data points that fell outside the 95 % limits of agreement range were measured at \(\dot{V}_{\text{E}}\) above 120 L min−1 and \(\dot{V}{\text{O}}_{2}\) of 3.8 L min−1, intensities that are seldom tested in validity studies.

Minute ventilation is often considered the main source of error when measuring gas exchange parameters, especially since it directly influences \(\dot{V}{\text{O}}_{2}\) (that is, a 1 % change in \(\dot{V}_{\text{E}}\) will change \(\dot{V}{\text{O}}_{2}\) by 1 %) (Carter and Jeukendrup 2002; Hodges et al. 2005; Medbo et al. 2012; Rosdahl et al. 2013). Although the DBM is considered the gold standard for measuring gas exchange parameters, it also has potential for error, especially when measuring \(\dot{V}_{\text{E}}\). Carter and Jeukendrup (2002) used a dry gas meter similar to that used by us, and concluded that the equipment was slightly inaccurate, since the DBM consistently resulted in lower \(\dot{V}_{\text{E}}\) values than both an Oxycon Pro, an Oxycon Alpha and more importantly a metabolic simulator. This conclusion is shared by others (Rietjens et al. 2001). Conversely, Macfarlane and Wong (2012) also used the same dry gas meter system and found similar \(\dot{V}_{\text{E}}\) values for the Oxycon Pro and the DBM. Another potential problem when measuring \(\dot{V}{\text{O}}_{2}\) and \(\dot{V}_{\text{E}}\) with the DBM is the timing for opening and closing the bags (Rietjens et al. 2001; Wilmore et al. 1976). In this regard, automated gas analyzer can measure the start and end of breathing cycles more accurately. This could have been attenuated in our study using longer sampling intervals; however, this would have been against the purpose of this investigation (Bassett et al. 2001), which was to test the accuracy of the analyzer using short intervals at high intensities, a situation that is common in clinical exercise testing but rare in validation studies (Hodges et al. 2005). Another concern when working with the DBM is that the Bags are made of PVC, and can be marginally permeable to air leakage (Carter and Jeukendrup 2002). In our study, the bags were emptied within ~5 min of test completion so that gas leakage was of little consequence (Hodges et al. 2005). Finally, the turbine flowmeter used by the MOXUS to determine \(\dot{V}_{\text{E}}\) is no longer commercially available, and the MOXUS system now uses a pneumotachometer sensor (Rosdahl et al. 2013). These two systems, however, have been recently compared and showed very similar degrees of accuracy and day-to-day reliability (Rosdahl et al. 2013).

While errors in ventilation receive significant attention, errors in the measurement of gas fraction are often not considered in validation studies (Carter and Jeukendrup 2002; Macfarlane and Wong 2012; Reybrouck et al. 1992), but also will have substantial effects on the calculation of \(\dot{V}{\text{O}}_{2}\), particularly FeO2, and should be reported (Hodges et al. 2005). When measuring FeO2 in the present investigation, the Moxus performance was better than some analyzers (Crouter et al. 2006; Gore et al. 2003; Jakovljevic et al. 2008), but poorer than others (Bassett et al. 2001; Foss and Hallen 2004). The Moxus readings for FeO2 showed the best agreement with the DBM measurements from all variables, with a mean absolute percentage error of only 0.4 % and mean bias of 0.01 percent-points L min−1. Still, since each 0.01 percent-point error in FeO2 amounts to 0.13 mL of O2 for each liter of \(\dot{V}_{\text{E}}\), and considering our limits of agreement for FeO2 (±0.18 percent-points L min−1), FeO2 errors alone could explain differences of 350 mL in \(\dot{V}{\text{O}}_{2}\) at a \(\dot{V}_{\text{E}}\) of 150 L min−1.

Errors in the measurement of gas exchange parameters can substantially affect the capacity to correctly detect small changes in \(\dot{V}{\text{O}}_{2}\), as required in many criteria for determining the occurrence of a plateau during incremental tests. The original plateau criterion (Taylor et al. 1955) was essentially an application of the limits of agreement, where values up to two standard deviation smaller than the mean stage-to-stage difference in \(\dot{V}{\text{O}}_{2}\) were considered normal, so that only lower values would be considered a plateau (Taylor et al. 1955). Nowadays, however, the search for \(\dot{V}{\text{O}}_{2\hbox{max} }\) using automated gas analyzers has distorted the original criterion, and multiple plateau criteria have emerged (Noakes and St Clair 2004). While the use of ramp incremental tests with small load increments over time to measure \(\dot{V}{\text{O}}_{2\hbox{max} }\) and detect small changes in \(\dot{V}{\text{O}}_{2}\)—a plateau—is widely promoted (Meyer et al. 2005), our results support the idea that this might be beyond the accuracy of modern analyzers (Howley et al. 1995), or perhaps even the accuracy of the DBM system, given that the possible sources of error with the current “gold-standard” method to evaluate oxygen consumption also increase with decrease in the collection interval.

In the present study, comparisons of \(\dot{V}{\text{O}}_{2}\) between the Moxus and the DBM during high-intensity running resulted in similar limits of agreement as the \(\Delta \dot{V}{\text{O}}_{2}\) over two consecutive 30-s intervals measured by both systems (±402 mL min−1 O2 vs. ± 386 mL min−1). Thus, in this case, level of uncertainty around the difference between two consecutive measurements was not greater than the uncertainty around each measurement alone (Howley et al. 1995). Nonetheless, these data indicate that if our exercise protocol was designed to test the incidence of the plateau phenomenon, larger increases in workloads should have been used together with longer collection intervals (Howley et al. 1995). Both proposed changes tend to minimize errors in measurements and allow the expected changes in \(\dot{V}{\text{O}}_{2}\) to be greater than the limits of agreement for the system accuracy in a particular condition. It has been demonstrated, however, that the timeaveraging of gas exchange parameters does not affect the reproducibility of \(\dot{V}{\text{O}}_{2\hbox{max} }\) itself (Midgley et al. 2007). Another possibility is the use of supramaximal testing to confirm the attainment of \(\dot{V}{\text{O}}_{2\hbox{max} }\) (Midgley and Carroll 2009), which has convincingly demonstrated that the occurrence of a plateau is not a pre-requisite for measuring \(\dot{V}{\text{O}}_{2\hbox{max} }\) (Howley 2007).

In our study, \(\dot{V}_{\text{E}}\), \(\dot{V}{\text{O}}_{2}\) and \(\dot{V}{\text{CO}}_{2}\), were lower for the DBM than for the Moxus system. Howley et al. (1995) have already identified this possibility during high-intensity exercise with short sampling intervals, but to our knowledge it had not been evaluated before. Our Moxus system has already been tested against a different set of DBs and Dry Gas Meter and under more stable conditions (Medbo et al. 2012), with closer agreement between methods than the ones we reported here, in spite of equal systematic bias for \(\dot{V}{\text{O}}_{2}\) (~80 mL min−1). These two studies were performed by the same researchers, and at the same time of the year, the only difference being the DBM equipment used, the averaging interval and the exercise protocols evaluated. Also, our Moxus system has been shown to be extremely reliable when measuring \(\dot{V}{\text{O}}_{2\hbox{max} }\) over different days (Beltrami et al. 2012; Medbo et al. 2012). Taken together, these suggest that our dry gas meter may have been slightly inaccurate (Carter and Jeukendrup 2002), which is a limitation of the present investigation, and points out to the uncertainty of the true value of measurements even when using “gold standard” techniques, as suggested by Bland and Altman (1995), but criticized by others (Medbo et al. 2002). Unfortunately, the equipment (Carter and Jeukendrup 2002) to test the accuracy of the dry gas meter was not available, and testing it by pumping air from the calibration syringe multiple times into the bags is not ideal. However, it must be noted that even metabolic simulators have been suggested to be problematic when used for validating gas analyzers during breath-by-breath and perhaps even time-averaged measurements (Kim and Robergs 2012). This has led to the suggestion that no “true” gold standard method exist for determining oxygen uptake, even though the DBM is still widely used as a criterion method for validating automated gas analyzers (Hodges et al. 2005). Another limitation of the current investigation is that although the Moxus system has a flow velocity controller for the air coming from the mixing chamber, this does not apply to the gas coming from the calibration bottles. O2 analyzers are known to be highly dependent on pressure, or flow velocity, (Macfarlane 2001) and to counteract this problem all calibrations were done by the same investigator. Finally, both systems used the same gas analyzers. Although this can affect the level of certainty about the values, the S-3A/I and CD-3A analyzers have been on the market for many years, and are often part of the DBM setup for validation studies (Bassett et al. 2001; Duffield et al. 2004; Macfarlane and Wong 2012).

Conclusion

In summary, our data show that despite good agreement between the DBM and the Moxus system, within recommended ranges of accuracy, there are large discrepancies in \(\Delta \dot{V}{\text{O}}_{2}\) measured over two consecutive 30-s intervals by the two sets of equipment. Researchers should be aware of the accuracy of their equipment before establishing a protocol for an incremental test, and choose load increments and sampling intervals that are large enough to minimize random errors and so to allow small changes in \(\dot{V}{\text{O}}_{2}\) to be correctly detected. Further testing of gas analyzers during high-intensity exercise is also encouraged to determine the true ability of modern equipments to measure the occurrence or not of the plateau phenomenon during maximal incremental exercise tests.