1 Introduction

Untargeted metabolomics based on mass spectrometry (MS) can provide insight into human health and disease that may not be apparent using nucleic acid or protein-based analytical approaches (Babu & Snyder, 2023). While metabolomic studies primarily focus on aqueous metabolites in samples such as blood, urine, and feces, breath is a rich and diverse matrix containing thousands of different volatile organic compounds (VOCs) (Costello et al., 2014; Haworth et al., 2022). The non-invasive nature of breath sampling makes it particularly attractive for clinical applications, such as early diagnosis and ongoing longitudinal monitoring.

However, the validation of clinically useful breath biomarkers remains limited. This is likely due, at least in part, to the lack of consistent methodologies and quality controls across the breath research literature (Issitt et al., 2022; Jia et al., 2019). To advance the field of breath analysis, there is an urgent need to develop a robust platform that can accurately identify the VOCs considered to be genuinely originating from the breath (which are comprised of endogenous VOCs derived from metabolic processes and exogenous VOCs, such as microbiome or dietary compounds). These breath-borne VOCs need to be distinguished from background VOCs that arise from the sampling equipment and surrounding air that was inhaled immediately before sampling, which are unrelated to underlying physiology. Establishing an accurate and repeatable methodology will expedite the identification and validation of VOC biomarkers of disease in future studies.

There are multiple approaches for collecting and analyzing breath, each with different advantages, limitations, and challenges (Haworth et al., 2022). Untargeted breath biomarker discovery workflows are most common, which often produce data on unknown VOCs or VOCs tentatively identified by comparison with publicly available standard libraries, such as those provided by the National Institute of Standards and Technology (NIST). This increases the risk of misidentifications due to differences in methodology and instrumentation between untargeted datasets and reference libraries, impeding the replication and validation of findings. Accurate identification of the VOCs in a breath sample requires a comparison to purified chemical standards analyzed using the same instrumentation and methods (Fiehn et al., 2007; Sumner et al., 2007). At least one unique chemical standard is required for every VOC to be identified, but as this is costly and time-consuming, many studies forego this critical process.

Another unmet need in the breath field is standardized methods for background correction (Herbig & Beauchamp, 2014). Many reports have noted the significance of background contributions to the VOCs observed in breath samples, which can originate from multiple potential sources such as ambient air or from breath sample collection equipment (Di Gilio et al., 2020; Westphal et al., 2022). Common background correction techniques in breath analysis include the calculation of an alveolar gradient to identify VOCs that are more abundant in breath (Phillips, 1997), or using a lung washout with synthetic air to identify VOCs likely to be contributed by inhaled background (Hewitt et al., 2022; Maurer et al., 2014; Schubert et al., 2005; Spaněl et al., 2013; Westhoff et al., 2022). However, the success of background correction relies heavily on the quality of the background measurement. One commonly used method for background measurements is to take a sample of ambient air in the same location where breath sampling is being performed, which neglects compounds originating from the sampling equipment. Since VOCs are ubiquitous in the environment and therefore can be introduced through multiple components and points throughout the analytical process (Pham et al., 2023), it is important to ensure comparable collection and handling of breath and background samples, of which there is already a method that has been previously described for the ReCIVA® breath Sampler (Di Gilio et al., 2020; Doran et al., 2017).

Other attempts have been made to develop a compendium of breath biomarkers (Drabińska et al., 2021; Kuo et al., 2020) which are useful assemblies and distillations of the important literature in this field. However, they also have the same limitations as the underlying literature. A general lack of standardization in sampling, analysis, and identification of VOCs means that it is difficult to quickly assign confidence to any single observation without reviewing the underlying literature. In this study, we present a novel methodology that combines robust breath and background collection, analytical distinguishing breath VOCs from background contamination, and VOC identification against chemical standards. We demonstrate the capability of this method by presenting a list of high-confidence breath VOCs identified from a heterogeneous human population.

2 Methods

2.1 Study design and subjects

This observational study was approved by the Reading Independent Ethics Committee RIEC: 290620-1, all participants provided written informed consent. Adults (≥ 18 years; Cambridge, UK) recruited all met the inclusion criteria, were free of active respiratory infection symptoms or diagnoses (including COVID-19) and fasted for at least two hours prior to breath sampling. We also decided to include some volunteers with various chronic diseases to account for potential normal variation in the population, and ensure breadth of VOC detection. All subjects were treated as a single cohort in statistical analysis as the study’s intention was not to compare differences between disease and control. Breath samples were collected from 99 adult volunteers between January and February 2022. Nine samples were excluded due to saliva contamination (determined by observation of saliva/bubbles within the tube) (n = 8) or incomplete collection volume (n = 1). The final analysis consisted of 90 breath samples and 90 paired system backgrounds (Table 1). Of the 90 adult subjects, 24 had some type of chronic disease including type 2 diabetes, high blood pressure, arthritis, and irritable bowel syndrome.

Table 1 Study cohort demographics. A breakdown of the total sample sizes (n) is shown along the top row, split into columns by the overall cohort, and then divided by sex (male and female). The mean age, BMI, and standard deviation (SD) for the cohort are shown in the top two rows. The total n and different percentages (%) of the cohort for smoking status and ethnicity are shown in subsequent rows, with finer breakdowns outlined

2.2 Breath sampling and analysis

The methodology utilized as part of this study is known as the Owlstone Medical Novel Insights (OMNI) method, and will be described in this section. Breath samples were collected using Owlstone Medical’s ReCIVA® Breath Sampler. The ReCIVA Breath Sampler pre-concentrates breath samples onto adsorbent tubes, enabling a larger collection of air volume and offering the potential for greater sensitivity in detecting low-abundance compounds. Subjects breathed normally into the ReCIVA mouthpiece with a nose plug (Supplementary Fig. 1). Approximately 1.25 L of breath are collected onto each sorbent tube – with the analysis completed with 2.5 L: 2 tubes for analysis and 2 tubes for backup. This meant a total of 5 L of breath was collected per participant. This takes approximately 12–15 min of normal tidal breathing into the device to collect.

Ambient contamination was minimized using the CASPER® Portable Air Supply during breath sampling, which filters ambient air into the ReCIVA (Supplementary Fig. 1). The CASPER is a portable air supply that takes in room air, filters it to remove VOCs and particulates, and supplies it directly into the ReCIVA Breath Sampler for a subject to breathe into. The CASPER removes VOCs using a replaceable air filter pack that is filled with activated carbon; VOCs from the ambient air are adsorbed to the surface of the carbon. These tools combined were used to collect VOCs produced from the subject and eliminate VOCs that are re-breathed directly from the air in the room. Equal volumes of matched system background samples were collected immediately before each breath sample. Using internal, fast response pressure sensors, the ReCIVA Breath Sampler can monitor patient breathing patterns in real-time (see Supplementary Fig. 1). Alongside the paired software, these sensors estimate when the end-tidal fraction of breath is being exhaled, and the sampling pumps are automatically turned on and off at the necessary time to collect that breath fraction. For background samples, all collection hardware was configured as if to collect a breath sample, but with the mouthpiece opening sealed and the software configured to sample continuously (no selection for specific fractions of breath). The exact settings are detailed in Supplementary Table 1.

Breath and background samples were analyzed using the Breath Biopsy OMNI settings (see Supplementary Materials for more details). The tubes were purged with a TD-100 (Markes International Ltd. Llantrisant, UK) and stored at a temperature of 4–8 °C for no more than 27 days before analysis. Breath and their paired background samples were liquid injected with a mix of eight deuterated internal standard compounds (Supplementary Table 2) solubilized in methanol and analyzed in the same sequence using TD-GC-MS. A series of straight-chain alkanes (C5-C16) were spiked onto a separate tube (50ng per alkane) and analyzed within each analytical sequence, to enable the calculation of retention indices. Analysis was conducted on the TD (Markes) – Q exactive Orbitrap (Thermo Fisher Scientific) high-resolution accurate mass spectrometry platform, utilizing the settings specified in Supplementary Table 3.

ReCIVA and CASPER breath collection, paired background sampling, TD-GC-MS analysis, and the feature extraction method are collectively known as the ‘OMNI’ method.

2.3 Feature extraction and data normalization

The resulting breath and background chromatograms (an example of a breath chromatogram is shown in Supplementary Fig. 2) were batch processed (spectral deconvolution, feature group clustering, and library matching to NIST17) utilizing the OMNI untargeted feature extraction method in Compound Discoverer (ver. 3.2, Thermo Scientific™), detailed in Supplementary Table 4. After feature extraction, all features were normalized using the measured peak area intensity response of spiked internal standard (IS) compounds to reduce analytical variability associated with TD-GC-MS. A hybrid correlation-retention time normalization method was applied, where the Pearson correlation coefficients between each feature in each sample and every IS compound were calculated. Features that had a correlation coefficient ≥ 0.8 with any IS compound were normalized using that IS compound’s response. When correlations for a given feature were below 0.8 for all ISs, the mean peak area response of the three IS compounds with the closest retention times to the feature was used for normalization.

2.4 Calculations comparing breath and paired system background samples

The three metrics were used to compare VOCs in breath samples and paired system background samples:

  1. 1.

    The standard deviation (SD) metric: A VOC was considered on-breath if the signal exceeded the mean of the system background signal plus 3 SDs in at least 50% of the breath samples from the cohort. A VOC is automatically on-breath if values are observed in less than 4 system backgrounds. Additionally, a feature observed in breath samples was automatically regarded “on-breath” if its signal was observed in fewer than 4 system backgrounds samples only.

  2. 2.

    The paired T-test approach metric: A VOC was regarded as on-breath if the paired breath/ system background samples were associated with a fold difference ≥ 2 and paired t-test one-tailed p-value ≤ 0.05.

  3. 3.

    The Receiver Operating characteristic area under the curve [ROC-AUC] metric: A VOC was considered on-breath if the fold difference between breath and background was > 1, and the calculated ROC-AUC value was ≥ 0.8.

Each on-breath metric queries the VOC signal detected in breath samples, with respect to the system background, in different ways. This increases the confidence of a VOC’s assignment as “on-breath” if it is calculated as such, by multiple metrics.

2.5 VOC identification using chemical standards

The candidate identities of on-breath VOCs were determined by matching the breath data against the NIST library and cross-checked against the human metabolome database (HMDB) (Westhoff et al., 2022). All NIST matches with a similarity index (SI) Match Factor ≥ 500 were considered for confirmation using purified standards. All NIST matches with a similarity index (SI) Match Factor of 500 or higher were considered for confirmation using purified standards. This threshold was chosen to reduce the risk of missing true matches due to spectral differences between the NIST spectra and the experimental data. These differences can arise from variations in the analytical methods, instruments, and deconvolution parameters used to generate the spectra. Specifically, the in-house library data was generated using Orbitrap-MS, whereas the NIST spectra are based on quadrupole mass analyzer data. The VOCs with the highest SI score and present in the HMDB were prioritized. Additionally, a list of commonly reported VOCs with hypothesized biological relevance were compiled from a literature search and added to the candidate list.

A certified reference standard (minimum 95% purity) was sourced for each candidate compound and analyzed to generate spectra for matching against on-breath VOCs. Reference standards were dissolved in methanol, due to its suitability for GC-MS analysis in terms of expansion coefficient and solubility for each candidate compound. Two to sixteen standards were grouped into each mix using NIST17-reported retention index values to minimize co-elution risk. The prepared chemical mixes were then liquid injected onto Tenax TA-Carbograph-5TD sorbent tubes resulting in 50 ng on-tube mass per chemical and analyzed using the OMNI analytical method (see Supplementary) alongside a C5 to C16 straight chain alkane RI ladder.

Spectra for individual reference standards were identified by deconvolution (using Thermo Scientific GC Deconvolution plugin, using the peak detection settings in Supplementary Table 3) followed by cross-referencing with the NIST library (NIST 17 mainlib and replib). A background tube loaded with methanol was examined to ensure the peaks of interest were not derived from contamination during analytical processing. Mass spectral cleaning was performed to retain only the high-resolution accurate mass fragments suspected to derive from the reference standards. The final spectrum of each reference standard confirmed the on-breath VOC identities. Confirmation relied on the three breath chromatograms with the highest normalized peak area intensity demonstrating a successful match (forward and reverse similarity index (SI & RSI) above 800, retention index within +/- 2 units). Undetected candidate standards were re-analyzed at higher concentrations (150 ng on-tube mass) to increase detection probability.

3 Results

3.1 Distinguishing on-breath VOCs from background

Following the analysis of the 90 adult breath samples and paired system backgrounds using the OMNI method, 1471 unique features were present in ≥ 80% of breath samples. Three metrics, detailed in the Methods section, were applied to identify the subset of VOCs present in the breath at levels significantly above those in the system background (henceforth referred to as “on-breath”).

Figure 1A shows the total number of on-breath VOCs that were calculated using each of the 3 metrics. There is a significant overlap in the subset of on-breath features classified using each on-breath calculation metric. A total of 585 VOCs were identified as on-breath using any metric, and, of these, the majority (328/585 = 56%) were on-breath by all metrics. Metric 1 was the most stringent, with most (328/346 = 95%) of the features identified on-breath using metric 1 also on-breath by the other two metrics.

Metric 1 includes a flexible cut-off for the frequency of a VOC’s appearance on-breath at levels 3 SDs above background (Fig. 1B). In this analysis, a 50% frequency threshold was applied, restricting the subset of on-breath VOCs to 346 out of the total 1471 (22.3%) features (Fig. 1B). This threshold was chosen to emphasize the VOCs that are on-breath in the majority of samples but could be adjusted to accommodate other analyses. For example, if a more stringent threshold was deemed appropriate, fewer VOCs could be considered as on-breath.

The three metrics were chosen to provide complimentary insights into the potential composition of on-breath VOCs. Equally, metrics of differing stringencies, when combined into a panel of metrics, may give higher confidence to an on-breath identification. For example, being on-breath in multiple metrics at once, or in metrics with lower odds of a false positive may provide higher confidence that a VOC is indeed on breath, while still ensuring that a wide range of potentially on-breath VOCs are still captured by at least one metric.

Fig. 1
figure 1

A - Venn diagram showing the numbers of VOCs classified as on-breath by each metric, along with the number of those VOCs that have been identified, in brackets. B - Bar chart showing the frequency with which individual VOCs are classified as on-breath across all samples using metric 1. The dotted line indicates the 50% threshold, restricting the number of on-breath VOCs above this cut-off to 346 of the 1471 total

3.2 Identified VOCs: chemical characteristics

A total of 148 (25% of 585 VOCs on-breath by any metric) VOCs were able to be assigned identities based on comparisons to reference standards analyzed on the same analytical method in this dataset. A total of 825 purified chemical standards were run to achieve this; 37% of NIST matches with SI scores over 800 were found to be the true identity of the on-breath VOC. Factors impeding the identification of the remaining on-breath features include poor matches against the NIST library due to differences in analytical methodologies used (as discussed above), logistical considerations (such as lack of standard availability, safety considerations and/or prohibitive cost) and potential spectral issues during deconvolution (such as co-elution or splitting), whereby the resulting VOC spectra may not be an accurate representation of a true compound. Possible avenues of further work to overcome these limitations include considering custom synthesis of standards not readily available off the shelf, along with applying tailored spectral deconvolution settings for the peak-rich regions of the breath sample chromatograms.

Of the 148 identified VOCs, 102 are on-breath by all three metrics, a single identified VOC is on-breath by metric 1 alone, three identified VOCs are on-breath by metric 2 alone, and nine identified VOCs are on-breath by metric 3 alone (Fig. 1A). A substantial portion of the on-breath VOCs that have been assigned formal identifications (29/148 = 19.6%) were not classified as on-breath by metric 1, but they were on-breath by both metric 2 and metric 3 (Fig. 1A). While the three metrics have substantial overlap in the VOCs they determine to be on-breath, each metric contributed unique entries to the final pool of identified on-breath VOCs and may be appropriate for different VOCs or study designs. Additionally, to include VOCs that are on-breath in only a subset of the population, due to their uniqueness to a particular demographic, a separate analysis was carried out whereby features’ on-breath status was calculated per collected demographic variable, by considering only the system background samples relevant to the specific sub-population. This resulted in 3 additional on-breath features: two unique to the age 70 + group (one of which was successfully identified as 1,3-Dimethylcyclohexane), and one unique to the 30 + BMI group.

A full list of the identities of the 148 on-breath VOCs are presented in Table 2.

Table 2 The table shows all 148 on-breath VOCs that were able to be confidently chemically identified in this study, along with their InChI key identifier and chemical classification. The criteria they passed (TRUE) and failed (FALSE) to be classified as on-breath by the three metrics is also shown. The compounds are sorted by alphabetical order of chemical class

4 Discussion

The challenges in studying breath VOCs are well-known in the research community. Healthy human breath profiles have been developed to understand how physiological conditions, including age, gender and circadian rhythms, can influence breath profiles (Sasiene et al., 2024). However, differentiation of on-breath compounds from background, and confirmation of their identities, remain major obstacles to advancing their applications in diagnostics and clinical settings. In this study, we present a list of 148 breath-associated (on-breath) chemically identified VOCs. The integrity of this data relies on stringent criteria for two key aspects: distinguishing VOCs from background contaminants and confirming their chemical identity. The on-breath VOCs presented in this study have been confidently chemically identified using MSI standards and were distinguishable from background contaminants through a robust methodology in a heterogenous human population (spanning a range of ages, BMIs, and ethnicities). On-breath VOCs span 45 chemical classes, indicating that they comprise a diverse pool of chemical entities, and 62% have been previously reported in the literature in different biological matrices such as blood, urine, and fecal matter. However, the identified VOCs may also not necessarily be consistent with other studies in the literature due to differences in populations and analytical methodologies utilized previously.

It is also imperative to emphasize that on-breath VOCs can include both endogenously and exogenously generated VOCs (as exogenous VOCs can be very strongly breath-associated, especially those generated from internal sources such as the gut microbiome). Although not produced by the body, exogenous on-breath VOCs can interact widely with the host metabolome and demonstrate powerful utility in bridging the gap between research tools and breath-based clinical applications. One such example is indole, which is generated by the catabolism of tryptophan, mediated by the human gut microbiome; elevated levels of indole observed in cirrhosis could be explained by impaired hepatic clearance, providing a plausible mechanistic relationship between a microbiome related VOC and a clinical diagnosis (Ferrandino et al., 2023). In terms of well-established clinical applications, the gold standard diagnostic test used in gut health clinics for small intestinal bacterial overgrowth (SIBO) involves the ingestion of an exogenous substrate (lactulose). If SIBO is present, this synthetic sugar is metabolized by bacteria in the small intestine to produce molecular hydrogen, detectable on breath. Similarly, limonene, a VOC generated by dietary exposure was observed to be elevated in the breath of subjects with cirrhosis compared to controls in multiple studies (Dadamio et al., 2012; Fernández del Río et al., 2015), suggesting that reduced liver function and impaired hepatic perfusion induce limonene accumulation in the body resulting in elevated levels in breath. These alterations make limonene a candidate biomarker for non-invasive cirrhosis detection using a breath test (Ferrandino et al., 2023).

Acetone, isoprene, and indole can be used as a starting point to assess the consistency of the results of this study with breath compositions reported elsewhere, as these are some of the most abundant and commonly identified breath VOCs (Drabińska et al., 2021). All three of these compounds were frequently found to be on-breath within this population, supporting the replicability of this study’s results. Isoprene has been associated previously with a broad range of disease states, however, there are doubts over how useful of a biomarker breath isoprene currently is due to the lack of specificity to certain disease states, and sensitivity to individual breathing patterns and movement (Mochalski et al., 2023). Breath isoprene has recently been mechanistically associated with skeletal muscle metabolic activity using a multi-omic approach (Sukul et al., 2023; Mochalski et al., 2024), demonstrating endogenous origin. The majority of breath isoprene is produced through the IDI2 protein that is only present within skeletal-myocellular peroxisomes (Sukul et al., 2023), and therefore supports the observation that breath isoprene abundance increases after exercise (Chou et al., 2024; Pugliese et al., 2022). This understanding of the body’s mechanistic origin has helped to associate isoprene with a specific physiological process and could help establish what clinically useful information could be gained by the use of breath isoprene as a biomarker.

As the current study only aimed to characterize the composition of normal human breath, the identified on-breath VOCs currently cannot suggest metabolic pathway changes, however, certain valuable insights can be gained. For example, acetic acid and propionic acid were both found to be on-breath and are two well-characterized short-chain fatty acids (SCFAs) associated with the gut microbiome. SCFAs are considered exogenous VOCs because they are produced by microbial fermentation of dietary fiber and are thought to diffuse into local blood vessels of the gastrointestinal tract, travel via the blood, and enter the breath through alveolar exchange. The microbially-formed gas hydrogen produced in the gastrointestinal tract is rapidly detectable in the breath through this mechanism, and is therefore currently used in the clinic to diagnose conditions such as small intestinal bacterial overgrowth (Pitcher et al., 2022; Read et al., 1985; Sachdev & Pimentel, 2013). The abundance level of SCFAs has been implicated in multiple health contexts, including cancer, neurogenerative disease, and inflammatory bowel disease (Duizer & de Zoete, 2023; Majumdar et al., 2023; Ney et al., n.d.; Parada Venegas et al., 2019; van Vorstenbosch et al., 2023; Wang et al., 2023). In addition, their multiple signaling roles have become increasingly appreciated for their potential impacts on human health (Louis & Flint, 2017; Miller & Wolin, 1996). Therefore, the SCFAs could serve as breath biomarkers of disease much like hydrogen and methane breath tests in the future. Their characterization on-breath and the development of reference ranges in a healthy population is essential for this development, of which this study provides a useful starting point.

Isoprene, acetic and propionic acid are examples of the connections between the identified on-breath compounds and the literature, but there have been many more associations with a broader range of these compounds with physiological processes in the literature, such as carbon disulfide, dimethyl sulfide, and dimethyl disulfide (Carrión et al., 2015; Di Cagno et al., 2011; Grabowska-Polanowska et al., 2017; Preter et al., 2015). While the mechanism behind the appearance of certain on-breath VOCs in this dataset remain unknown, crucially, their identities have been verified. This suggests that the discovery of novel pathways and a new understanding of physiological processes can occur if the levels of these VOCs are found to change in disease cohorts.

In this study, the system background samples were collected from the entire equipment flow path of air to capture all possible sources of VOC background in the sampling process. Moisture levels may differ between background samples and breath. Given that humidity and temperature are key variables for VOC capture, mitigations have been implemented to reduce this differential: hydrophobic sorbents are used in our TD tubes and all samples are dry purged prior to analysis. We acknowledge that this minimizes the carryover risk, but not the instantaneous risk. Despite our best efforts, breath and blank samples are ultimately different sample types and there may be small analytical differences introduced as a result. We did not consider this to have a major impact to the conclusions we draw from this article.

The analytical methodology utilized offers high mass accuracy and resolution capability, resulting in precise mass measurement of ions and separation of closely spaced mass peaks. These functionalities provided high specificity by accurately determining molecular formulas and distinguishing between different VOCs with similar masses, respectively, thereby increasing the accuracy of VOC identification. For background correction, this study built on the alveolar gradient approach and utilized three metrics to compare breath and background signals, which can account for the variability of VOCs in the samples while also being practical for field implementation. Stringent processes were applied to the identification of VOCs using reference standards run on the same analytical method and instrument, accurate mass (within 5 ppm tolerance) to compare fragmentation spectra, and an alkane ladder to enable a strict RI match. VOCs were only identified if they could be characterized as “on-breath” by pre-defined metrics used to compare breath to representative system background samples. These metrics are intended to capture the widest range of on-breath VOCs, while providing sufficient stringency. In the future, they can be used alone or in combination, depending on the study design and VOCs of interest. The multiple metrics not only expand our knowledge of common on-breath VOCs but also allow differentiation from background, increasing confidence when comparing VOCs across different study cohorts and methodologies.

Despite the stringent methodology used to exclude background contaminants, it is still likely that certain on-breath compounds arise from processes unrelated to underlying physiology. It is also possible that some breath VOCs may be currently below our effective detection limit due to the presence and variation of VOCs in the background, and systematically reducing background contamination could improve this. The list of on-breath VOCs provides a starting point for this work, as targeted analyses can help identify where background contamination may be reduced and which specific chemicals, or groups of chemicals are likely to be affected. Future work will expand the capabilities of detection via sampling onto different sorbent beds and using different GC-column chemistries, broadening the range of VOCs covered. This study included a heterogeneous cohort, enabling a comprehensive characterization of breath composition. It is likely that VOCs will be significantly elevated in diseased cohorts such that they may be characterized as on-breath in diseased populations alone. Therefore, future efforts should focus on identifying and quantifying on-breath VOCs in both heterogenous and diseased populations to build a platform for cross-population comparison.

The on-breath VOCs identified in this study can offer utility to the breath field in numerous ways: they can serve as targets for optimizing breath measurement platforms, enabling efficient and accurate identification of potential biomarkers. With appropriate controls, these VOCs may also facilitate comparison of data across studies for cross-validation of results. Furthermore, the list of VOCs can be utilized to optimize precise and accurate measurement of breath via informed selection of patient preparation, breath collection, sample storage, and sample analysis. This optimization process will preserve biological variability while minimizing technical variability, thereby advancing the reliability and reproducibility of breath-based biomarker research. This list of chemically confirmed on-breath VOCs distinguished from background contaminants lays the foundation for the development of the Breath Biopsy VOC Atlas®, an ongoing project to develop a database of chemically identified breath VOCs complete with on-breath status, and quantified reference ranges across different cohorts, including different disease states. Biological interpretation of VOCs in the breath will significantly help to confidently assign on-breath VOC status, and therefore adding mechanistic understanding of breath VOCs in the literature is important future work. Work is ongoing to build the VOC Atlas as a reference database on confirmed compounds in the breath alongside their scientific context in the literature, utilizing the robust OMNI method and the on-breath VOC list presented in this work.

5 Conclusion

Through the development of a robust methodology, this study collected and compared breath and background samples of a heterogenous human population to identify on-breath VOCs. These on-breath VOCs can serve as a reference for breath researchers to improve confidence that their results are capturing truly on-breath VOCs, and it will continue to expand as additional VOCs are identified using reference standards. Future work will expand the list to include a broad range of populations and physiologies to capture the diversity of on-breath VOCs. By continuing to compare background samples collected and analyzed in the same manner as their breath samples, VOCs confidently identified as being on-breath can be the basis for future biomarker investigations.