1 Introduction

Artificial neural networks and deep learning have contributed to major breakthroughs in several applications like image classification, segmentation, generation of images and text, natural language processing (NLP) and many more. These new frameworks outperform conventional machine learning algorithms in many disciplines [1,2,3,4,5].

However, they cannot be trivially applied to regression tasks in natural sciences. In contrast to most conventional data analysis methods, hyperparameter optimization often needs complete retraining of the neural network and is therefore associated with high computational effort. Error estimation of the output is a research field on its own [6] and the model prediction can react chaotically to tiny deviations of the input [7]. Deep learning applications often differ from regression tasks with respect to their goal, constrains and requirements [2].

Nonetheless, several studies have successfully applied neural networks to spectroscopy tasks and reported performance increases compared to conventional approaches. Applications feature classification of spectroscopic data obtained from Raman spectroscopy and other spectroscopic techniques [8, 9], speeding up expensive calculations via surrogate models for nonlinear tomography [10, 11] and spectrum prediction [12], and concentration estimation [13,14,15]. Nicely et al. [13] used a shallow neural network for fringe reduction in direct absorption spectroscopy using simulated data. Tian et al. [15] report a good linearity of a direct fit performed by a neural network for high SNR input spectra. In our recent study we observed good performance of a neural network based noise reduction scheme for a specific noise structure where all tested conventional methods fell short [16].

This study will focus on absorption spectra obtained via wavelength modulation spectroscopy but the main considerations should be transferable to direct detection methods or other data acquisition schemes, as well. The main noise sources of wavelength modulation spectroscopy instruments can be split into relative and absolute contributions. The relative contributions cause disturbances proportional to the concentration of the measured species in the cell, e. g. variations in pressure, temperature, laser power and detector sensitivity. The main absolute limitation is often caused by notorious etalon fringe patterns that emerge from reflective surfaces of the optics [17, 18]. Other possible noise sources can be interference with absorption of other molecules, laser and detector stability or signal processing electronics. This study will, however, focus mainly on the reduction of noise resulting from interference fringe patterns. The procedures described may also be able to remove other noise types, as long as the requirements in Sect. 2.3 are fulfilled.

1.1 Neural networks overview

The basic architecture of an artificial neural network (ANN) consists of iterated layers \(y_k\) of linear transformations implemented via matrix multiplications \(W_k\) and a following non-linearity \(\sigma\) [2]:

$$\begin{aligned} y_{k+1} = \sigma \left( W_k y_k\right) \end{aligned}$$
(1)

Theoretically this architecture can approximate any function using only a single intermediate ("hidden") layer [19,20,21]. Advanced architectures try to optimize the ability of the model to learn and to generalize while being very efficient in time and memory.

A first breakthrough in image recognition tasks was achieved via convolutional neural networks (CNN) [22]. The matrix multiplications are replaced with convolutions of small filters whose weights are shared between all positions in the image [23]. This operation also ensures translation equivariance of the model output with regards to the input image [2]. Together with normalization schemes to overcome internal covariate shift [24], shortcut paths to decrease gradient decay [25] and randomization patterns to lower the chance of overfitting [26], models were constructed that outperformed human predictions for image classification tasks.

Vaswani et al. [27] introduced the transformer, a new architecture for NLP that builds on a mechanism called attention, where a query sequence gates the input sequence to focus on important parts. This method lead to other similar architectures that have become state-of-the-art in NLP, such as the bidirectional encoder representation from transformers (BERT) [5]. Others adjusted the transformer-based architectures for image classification and achieved comparable results to state-of-the-art convolutional architectures [28, 29].

The field of deep learning is fast evolving and state-of-the-art candidates are constantly changing. Researchers seeking to apply deep learning to their discipline may not need to design whole neural network topologies themselves, but rather adjust already tested state-of-the-art networks to their need. Results may benefit from focusing on finding the appropriate already available state-of-the-art model. A drawback of this approach, though, is the vast size of state-of-the-art models.

1.2 Opportunities for neural networks

These fringe patterns can interfere with the frequency region of the signal. In our previous study we reported different behavior and performance of several noise reduction methods depending on this interference. Many conventional methods are based on frequency separation of signal and background and therefore only show great performance if this interference is small. Otherwise the method will produce a high bias. We reported, however, an improved performance of a neural network based approach in a region of high interference [16].

This behavior can be motivated in an example: Suppose an absorption spectrum is obstructed by a pure sinusoidal background modulation and some white noise. When fitted, the resulting spectrum will show high variations depending on the phase of the sine background. Conventional numerical noise reduction schemes that act on the frequency domain will not be able to reconstruct the original signal but will most likely dampen the result towards zero. However, if the structure of the background is known a priori, noise reduction is very efficient. In this example the structure of the background can be extracted from the edge of the spectrum by a curve fit and the obtained sine can be subtracted from the noised spectrum. Of course, such a simple example of background can simply be removed by applying established experimental techniques, e.g., lock-in amplification. However, the background structure in a real experiment is often more complex. It cannot be easily removed via experimental techniques and the fitting parameters cannot be easily fit. A machine learning algorithm, though, could learn the distribution of background structures and infer the interference with the signal. This approach requires the background structure to have a sparse distribution and be stable over the whole absorption spectrum.

Another conventional approach to reduce the impact of etalon fringe patterns is regular determination of background spectra and an interpolation scheme [30, 31]. If the background structure changes point-wise and slow enough, this approach will give precise estimates of the underlying background structure of a measured spectrum. The interpolated spectrum can be subtracted from the measurement and will yield great noise reduction. In a second example a problem of this approach will be discussed: Assume the background structure does not change point-wise but along the frequency axis. If e.g. the phase changes in the order of half a cycle between two background measurements, the result of the interpolation will vastly differ from the true background structure. Given a priori knowledge about the speed of the phase shift, the background can be reconstructed again with high precision. In practice obtaining this a priori knowledge can be very hard as a realistic background has a much more complex structure and could consist of several phase changing periodic structures or beat interferences. Also the true shape of the fringe pattern could be obscured by aliasing effects [32]. Again a machine learning algorithm could learn the distribution of these structures and yield a better interpolation scheme to reconstruct the background.

In Sect. 2 possible interpretations of the data in order to motivate appropriate model architectures are discussed. In Sect. 3 the spectrometer is described in more detail and the training data acquisition and processing is presented. In Sect. 4 the performances are discussed for the training process, validation set evaluation, test set evaluation and further transfer applications.

2 Neural network architectures

As motivated in the introduction, two different machine learning objectives will be considered: concentration fitting (2) and temporal interpolation (3), with spectrum size N, number of spectra for each measurement C and sequence interval T. The fitter F directly fits an absorption spectrum to obtain the trace gas concentration. The interpolator I gets a temporal sequence that contains regular background measurements and reconstructs the evolution of the background in the inbetween region.

$$\begin{aligned} F:&\ Z \longrightarrow A\,Z \in R^{N \times C} A \in R^{1} \end{aligned}$$
(2)
$$\begin{aligned} I:&\ X \longrightarrow Y\,X \in R^{T \times N \times C} Y \in R^{T \times N \times C} \end{aligned}$$
(3)

2.1 Possible fitting architectures

Although the input to the fitting task is one-dimensional (disregarding the feature axis) increasing the dimension to two by a redundant linear transformation might be beneficial. This enables the application of models that have proven themselves on popular image classification benchmarks. In this study the continuous wavelet transform (CWT) with the DOG2 wavelet is used to scale up the spectra to \(R^{N \times J \times C}\) with \(J=128\). The continuous wavelet transform has been very successful in environmental sciences [33, 34], but similar behavior can be expected for other linear transformations. An example plot of a CWT-transformed spectrum can be found in Fig. 1.

Fig. 1
figure 1

Example plot of a CWT-transformed spectrum, taken from the validation set. Values smaller than zero are colored red, values larger than zero are colored blue. The plot suggests how to interpret the data as image-like. The anisotropy of the image is visible, with high frequency noise at the lower end of the scaling axis, signal at the higher end of the scaling axis and left-right-centered along the spectral axis, and obstructing fringes in the same scaling range but isotropic along the spectral axis. The SNR of the underlying spectrum is 3.2

Now, the data can be interpreted in different ways: a \(N \times J\) image of the whole wavelet spectrum or an N-dimensional sequence of J-dimensional vectors.

If data is considered as image-like, a CNN architecture can be used. A novel family of efficient and effective CNNs is EfficientNetV2 [35], from which the smallest iteration EfficientNetV2B0 (EffV2) will be used for performance reasons. A CNN architecture iteratively extracts features of the image \(y_{k}\) and scales down using convolutional (Conv) blocks that contain pooling operations and strided convolutions.

$$\begin{aligned} y_{k+1} = \textrm{Conv}\left( y_k\right) \end{aligned}$$
(4)

In classification tasks the final layer is globally reduced and fed through a shallow fully connected neural network (FCNN). The result is then softmaxed to retrieve the probability for each class. In this case, though, the output needs to represent the concentration of the trace gas, therefore the final output is not passed through an activation layer. This way the output can be any real number.

However, as mentioned earlier, a CNN is translation invariant with respect to its input - in image processing the algorithm should not depend on the global position of the object. In the data described there is an anisotropy: The signal is left-right-centered and occupies a distinct part of the wavelet scaling, while noise is isotropic and can appear differently throughout the scaling dimension. The vision transformer (ViT) [29] decomposes a picture of shape \(R^{N\times J \times C}\) into patches of shape \(R^{P\times P}\) and flattens the pixels to obtain data of shape \(R^{N/P \cdot J/P \times P^2\cdot C}\). These patches are embedded via a linear transformation. The patch embedding is then added to a learned positional encoding that represents the position of the patches with regards to the complete image. Afterwards the architecture is very similar to BERT [5] and contains of iterated transformer (Trans) blocks containing attention layers:

$$\begin{aligned} y_{k+1} = \textrm{Trans}\left( y_k\right) \end{aligned}$$
(5)

With the attention mechanism, the model can learn global dependencies and structure of the data due to a higher receptive field [28]. The final layer of the transformer can again be globally reduced and fed into the FCNN.

Additionally, a hybrid architecture will be considered composed of an EffV2 backbone consisting of the first 5 blocks of the model, whose output is linearly transformed and directly fed to a ViT as patch embeddings. Dosovitskiy et al. reported similar performance of both approaches and found a similarity in function of the first transformer layers to the CNN backbone. The backbone will extract local features of the image that can be processed globally by the transformer [29].

If the data is considered as sequence-like, the transformer architecture can be applied directly. In this case the data is simply reshaped to \(R^{N \times (J \cdot C)}\) and fed into a BERT-styled architecture [5]. Table 1 summarizes these approaches and gives a small overview of the chosen size.

Table 1 Summary of utilized model architectures and number of parameters

2.2 Possible interpolation architectures

The interpolator also gets 2D inputs: temporal sequences of spectra. The spectra at the beginning and the end and also at regular distances throughout the sequence are considered pure background and are point-wise interpolated to fill the intermediate temporal region. Also the spectral edges of this intermediate region is kept as no absorption signal is to be expected here. An example can be found in Fig. 2. Again the data can be interpreted image-like or sequence-like but this time with a different objective.

Fig. 2
figure 2

Example plot of a temporal background spectrum sequence (top) and estimation via linear interpolation along the time axis (bottom). The black boxes mark interpolated regions. The temporal interpolation anchors and spectral edges are left unchanged. Values smaller than zero are colored red, values larger than zero are colored blue. The plot suggests how to interpret the task similar to image reconstruction. The objective is to reconstruct the original sequence (top) from the linear interpolation (bottom)

If data is considered as image-like, the U-Net variant can be used, that has been utilized for segmentation tasks [3] or image noise reduction [36]. The down-scaling part of the U-Net is constructed from a subset of the EffV2 and the inverse of the initial EffV2. In the inverse part every downscaling operation is replaced by an upscaling operation to obtain outputs with the original dimension.

Another approach would be again a ViT styled segmentation followed by a transformer architecture. The final patches can be linearly transformed and concatenated back to the original shape of the picture. A hybrid architecture is not considered here since the U-Net architecture already requires a lot of memory.

If data is again considered sequence-like, a linear transformation can be applied to each \(N \times C\) vector. In contrast to the CWT a learnable linear transformation is used here since the dimension does not need to be increased in this case. This linear embedding can then be input to a BERT styled model. The output is again fed to a linear transformation and gets reshaped to match the desired output. In their original paper the authors introduced a masked language model (MLM) where words were randomly replaced by a missing token and their model was pre-trained in an unsupervised fashion to learn natural language structure [5]. In the example given here most of the data is missing, but the target output can be slowly transformed to the desired input by increasing the number of interpolated spectra in the intermediate region to speed up the initial learning period. This procedure is also applied to the image-like representations.

2.3 Remarks

In summary several ways to interpret the data have been considered and appropriate established neural network architectures have been chosen for each interpretation. It is important to emphasize here that different performances of these models do not indicate advantages of one model architecture or data interpretation over the other. No proofs or evidence of a specific model type to be favored can be given, as these are only single random examples. The performance depends very much on the choice of hyper-parameters like optimizer setup, learning rate value and schedule or number of parameters. Taking into account several ways to understand data may lead to good performance without sensitive variations of hyper-parameters and emerging biases as a consequence.

All procedures motivated in this section and in the introduction require several characteristics of the underlying background structure for the neural network approach to work accordingly. All of these requirements are fulfilled for the specific spectrometer setup used in this study with regards to interference fringe patterns as a main noise source. The procedures might show similar performance on other noise sources, if these requirements are also met there:

  • Sparse noise distribution. All possible noise shapes must follow a sparse probability distribution compared to independent white noise. Otherwise no prior information about noise structure can be extracted from background measurements.

  • Local stability. The noise structure needs to be locally stable so reconstruction is possible from the absorption-free parts of the data.

  • Global stability. The noise structure needs to be stable over time, otherwise the prior information needed for reconstruction changes and the network needs to be retrained. If the structure changes too fast, retraining is needed before desirable performance can be achieved.

3 Experimentation and network training

The experiment conducted for this study is based on the instrument TRISTAR reported in [30, 37]. It is driven by a room temperature quantum cascade laser [38] from Alpes Lasers operated near the formaldehyde (HCHO) transition at 1759.72 cm\(^{-1}\) [39]. The laser beam is guided into a 50 cm long White Cell [40] where it gets reflected 128 times to yield an effective path length of 64 m. The beam exits the cell and is split by a beam splitter into two separate paths. One beam is guided through a reference cell filled with the substance of interest at high concentration. Both beams are focused on identical infrared detectors from VIGO Systems. The laser is modulated by a slow triangle wave that scans through the absorption spectrum and a fast sine wave that is demodulated at twice frequency by the data acquisition FGPA. The resulting spectrum is similar to the second derivative of the absorption profile, depending on the modulation depth. In the experiment each data point consists of \(C=2\) spectra, one from the increasing part of the triangle wave and one from the decreasing part, with \(N=512\) points each.

Experimental data of the instrument is gathered for 14 days. The instrument inlet is connected to an air purifier to remove the substance of interest, in this case HCHO. The reference cell is filled with a high concentration of HCHO. The detector channel without reference cell detects absorption-free spectra that only consist of the interference fringe background structure. Spectra acquired at this detector will be called BGD. The detector channel with inserted reference cell can be assumed fringe-free due to the high concentration signal. Spectra acquired at this detector will be called REF. Only the last 82% of this data is used for training.

Training data for the interpolator is created using sequences of BGD. A sequence X is input to a network that gets 4 s of spectra every 63 s and does point-wise linear interpolation in the intermediate region. The interpolation anchors and 64 points at both edges of the spectrum are left unchanged. Then this array of shape \(R^{T \times N \times C}\) is normalized for each subarray in C while mean and standard deviation are stored. This prepared matrix is fed through the interpolator and the output is rescaled by the standard deviation and translated according to the mean that was stored before. The loss is calculated as mean-squared-error (MSE) loss between this final matrix Y and the input sequence X.

Training data for the fitter is created using random pairs of BGD and REF. Two log-uniform distributed values A, S are drawn for the concentration and the signal-noise-ratio (SNR). The input to the network Z is a linear combination of BGD and REF with

$$\begin{aligned} Z = A\ \textrm{REF} + \frac{A}{S}\ \textrm{BGD} \end{aligned}$$
(6)

First this input is normalized and the standard deviation is stored. Then the CWT is performed that transforms to a dimension of \(R^{N \times J \times C}\). The CWT result is fed through the fitter to obtain a single value. This value is rescaled by the standard deviation stored. The loss is calculated as MSE loss between this rescaled output value and the value A. For additional investigation two instances of fitters are trained for different SNR ranges.

For validation, sequences of BGD and pairs of BGD and REF along with pre-determined random values are taken from the first 9–18% of the measurement data. The best iteration of each interpolator or fitter, respectively, is then applied to the test set. The test set consists of a sequence of BGD and injections of calibration gas into the measurement cell from the first 9% of the measurement time. Variation and point-wise accuracy of background and calibration signal can then be determined and compared to a conventional approach.

Each network is trained on a HPC cluster hosting several NVIDIA V100 GPUs. The training process is parallelized through the distributed learning scheme Horovod [41].

4 Results and discussion

In this section the performance of the described model architectures will be discussed.

4.1 Training performance

Fig. 3
figure 3

Training metric (upward triangle) and validation metric (downward triangle) of interpolator models based on BERT (red), VIT (yellow) and UNET (blue) architecture: Top: Loss (mean squared error); Bottom: MAE (mean absolute error)

As already mentioned in Sect. 2, the interpolator was trained on an easier task during the first 10 epochs for a more efficient training process. A random portion of spectra in the input sequence were exchanged by the corresponding spectra in the target sequence. This procedure reinterprets the MLM from BERT or an autoencoder application to enable faster learning of the underlying distribution. The amount of exchanged spectra was linearly decreased towards zero in epoch 10. All model architectures show decreasing training loss and error metrics while validation metrics do not show major indications for overfitting. Figure 3 shows losses and mean absolute error of the interpolator models. Additional plots can be found in the appendix. For further investigation of the robustness, several ablation studies [29] were performed where the number of parameters of each model was reduced by scaling down the feature axes. The scaled down models achieved faster convergence speeds while resulting in similar final metrics. Example plots can also be found in the appendix.

4.2 Validation set performance

Fig. 4
figure 4

Evaluation of fitter models trained on low SNR range and comparison to linear fit. Red boxes indicate 25%-quantile, median and 75%-quantile; red whiskers indicate 10% and 90% quantile. Blue diamonds show the mean, blue whiskers (if provided) show one standard deviation. Top: absolute squared error between prediction and true value. Maxima are in the order \(10^4\). Bottom: predicted amplitude of pure background spectra

Although different performances during training might suggest the most suitable model choice for each application, these metrics only provide guidance during the training process. Model performance should always be derived from experiments that more closely reassemble real applications. Thus, the models are evaluated in more detail using the validation sets.

The fitter model is evaluated by calculating the absolute squared error between model prediction and true signal amplitude of each spectrum in the validation set. For the investigation of small signal behavior the pure background spectra are also input to the model to obtain predicted zero amplitudes.

Fig. 5
figure 5

Fitter squared error as a function of input SNR. Linear fit in red, low SNR trained BERT variant prediction in blue. Boxes indicate 25%-quantile, median and 75%-quantile; whiskers indicate 10% and 90% quantile. Diamonds and lines show the mean. For better visualization the BERT plot is slightly offset along the x-axis

Fig. 4 shows an overview of the evaluation results for the fitters trained on a low SNR range. The BERT-styled variant shows the least combination of bias and variance for the zero amplitude while performing similar to the EffV2 and Hybrid variant overall. The mean squared error over the validation set is reduced by 99.5% and the mean squared error on pure background spectra by 97.4%. The absolute squared error can be further investigated depending on the input SNR. This is depicted in Fig. 5. The BERT variant outperforms linear fitting in the low SNR regime by 1–2 orders of magnitude but falls short in the high SNR range.

The fitters trained on a high SNR range fail to outperform linear fitting and perform poorly for SNR values below their training data. A large bias is introduced when fitting pure background spectra due to the lack of low signal examples in their training distribution. Detailed plots similar to the ones shown in the low SNR example can be found in the appendix.

Fig. 6
figure 6

Point-wise squared error between prediction and true spectra of interpolator models and comparison to linear interpolation. Red boxes indicate 25%-quantile, median and 75%-quantile; red whiskers indicate 10% and 90% quantile. Blue diamonds show the mean, blue whiskers show one standard deviation

Fig. 7
figure 7

Point-wise squared error as a function of distance to nearest interpolation anchor. Linear interpolation in red, U-Net variant prediction in blue. Boxes indicate 25%-quantile, median and 75%-quantile; whiskers indicate 10% and 90% quantile. Diamonds and lines show the mean. For better visualization the U-Net plot is slightly offset along the x-axis

The interpolator models are evaluated by calculating point-wise squared errors between predicted background spectra and measured background. The evaluation result is shown in Fig. 6. Here the U-Net variant clearly shows the best performance. The mean squared error is reduced by 93.2%. This does not necessarily indicate an advantage of this exact model architecture compared to a transformer type, since further hyper-parameter optimizations can result in increased training efficiency and model performance. The behavior of the U-Net variant is analyzed in more detail by calculating the point-wise squared errors in dependence of the distance of the spectrum from the nearest interpolation anchor. This relation can be found in Fig. 7. In the case of a stable, slowly changing background, the linear interpolation would give best estimations near the anchors, which is the case in this dataset. The model prediction shows no dependency on the distance to the nearest anchor and reduces the error evenly over the sequence.

4.3 Denoising behavior

Fig. 8
figure 8

Denoising behavior of the fitter model. Estimated amplitude against true value for the linear fit (red) and the low-SNR fitter model output using the BERT-styled variant (blue) for a low SNR example batch with 256 spectra. The black solid line refers to a perfect fit \(y=x\). The MSE values for this example batch are 121 and 59 for the linear fit and the fitter model, respectively. Despite the improvement in MSE, the fitter model produces a large bias which makes the estimation less meaningful

Despite the extraordinary mean squared error reduction from application of the fitter model, the denoising behavior shows undesirable properties that introduces disadvantages compared to classical linear fitting. In the low SNR limit the variance of the linear method is very large compared to the signal amplitude. It is shown in Fig. 8 that the fitter model introduces a small bias while substantially reducing the variance. This leads to an effective reduction of the MSE. However, the bias is comparable to the signal amplitude which renders the method almost useless in applications.

Fig. 9
figure 9

Training metric (red upward triangle) and validation metric (blue downward triangle) of classifier model based on EffV2 architecture. Top: Loss (binary crossentropy); Bottom: accuracy

A possible explanation of this behavior is a poorly chosen loss function. While the loss is reduced, the actual objective is not reached. Alternative explanations are non-optimal model and optimizing schemes or an impossible task. For the investigation of the loss function, a variation of the experiment is conducted where instead of estimating the absolute concentration, the final layer is activated using the sigmoid function and the model is trained to distinguish pure background spectra from absorption spectra with an SNR of 0.01. Figure 9 shows the training performance of this model. The accuracy of the validation set shows no significant performance increase compared to random guessing. Thus, either the combination of model architecture and optimizing scheme is not optimal or the distinction between background and low SNR signal is not possible. The latter is the case if the data does not match the requirements given in Sect. 2.3. More specifically, the noise structure might not be locally stable or not sparse enough for the fitter task.

Fig. 10
figure 10

Denoising behavior of the interpolator model. Difference between the example of the temporal background spectrum sequence shown in Fig. 2 and the linear interpolation along the time axis (left) and the interpolator model output using the U-Net variant (right)

The interpolator shows desirable properties in the denoising behavior. In Fig. 10 the deviation from the true value of both approaches is depicted. The example from Sect. 2 is used. While coarse structures are visible in the difference between linear interpolation and original spectra, the residual structure of the model interpolation is much less correlated. This indicates effective learning of the underlying distribution and successful reconstruction.

For the rest of this chapter, only the interpolator will be considered.

4.4 Test set performance

Fig. 11
figure 11

Experimental performance metrics for the different interpolation techniques. Reprod: reproducibility of calibrations; DetLim (BGD): detection limit (2\(\sigma\)) evaluated at spectra marked background; DetLim (AMB): detection limit (2\(\sigma\)) evaluated at spectra marked ambient

The test set is evaluated using experimental performance metrics instead of model loss. This allows further quantification of the performance and generalization ability of each method. In absorption spectroscopy experiments, key parameters that limit the instrument performance are the reproducibility of calibrations due to temporal drifts, the relative precision, and the detection limit. The reproducibility of calibrations is retrieved by first averaging individual calibration intervals and calculating the standard deviation of these averages. It is a measure for long-term drifts and stability. The precision is the relative standard deviation of all calibration amplitudes after correction of the long-term drifts. The detection limit is two times the standard deviation of background amplitudes. Since only absorption-free gas was measured during the experiment, the detection limit can be evaluated using spectra marked as background as well as spectra marked as ambient.

Figure 11 gives an overview of the resulting experimental performance metrics. Similar to the validation set results the model-based approach achieve a better detection limit than the linear approach. Applying the interpolator model and a linear fit achieves a very robust reduction of detection limit as both the absolute and the relative errors are reduced.

Example time series of background and calibration amplitudes are shown in Fig. 12. Again, very robust behavior can be observed for the combination of interpolator model and linear fit. This indicates evidence that the proposed interpolator architecture provides robust noise reduction.

Fig. 12
figure 12

Examples of amplitude results for different interpolation techniques. Top: Example result during zero gas measurement. Grey regions indicate declaration of the spectra as background. Bottom: Example result during calibration gas injection

4.5 Transferability

An important question left unanswered is the transferability of the networks. What kind of output is created if a different spectrometer is used? Is a similar performance increase possible? Will the result contain chaotic artifacts due to out-of-distribution problems?

To answer these questions, the trained model is applied to a different QCL absorption spectrometer which utilizes a 20 cm long Herriott Cell configuration with 182 passes [42, 43] and is operated at the carbon monoxide (CO) transition at 2190.02 cm\(^{-1}\) [44]. Data acquisition and detectors are similar, but the experiment is driven at a different modulation and sweeping frequency.

Fig. 13
figure 13

Performance comparison of linear approaches to interpolator model transferred to the alternative spectrometer. Point-wise squared error between background spectra and interpolation reconstruction with linear interpolation, pre-trained variant and fine-tuned variant. Red boxes indicate 25%-quantile, median and 75%-quantile; red whiskers indicate 10% and 90% quantile. Blue diamonds show the mean, blue whiskers show one standard deviation

Evaluation data for the interpolator was gathered by flooding the absorption cell with nitrogen. Zero gas was measured for 70 h. Application of the pre-trained version directly resulted in similar performance than linear interpolation. This indicates that the transferred model may require additional fine-tuning to achieve a good performance, while it does not result in chaotic artifacts. For fine-tuning, the first 7 h of nitrogen measurement were taken as new training and validation data and the pre-trained interpolator model was trained further for 10 epochs, which results in 0.8% of fine-tuning iterations compared to pre-training iterations. A comparison of the evaluation is shown in Fig. 13.

5 Summary and conclusion

In this study several possible applications of neural networks to absorption spectroscopy experiments were tested by interpreting the data structure in a way that several state-of-the-art neural network architectures can achieve good performance. These architectures were chosen from image classification and natural language processing tasks. A model for interpolation of background spectra and a model for gas concentration fitting of absorption spectra were created. Each neural network was trained on measured data and showed good generalization performance. The best performing instance of each task was further evaluated using test data and data from a different type of absorption spectrometer.

Fitters trained on a high SNR range did not outperform linear fitting. The best performing fitter trained on a low SNR range was of the BERT-type. It decreased the mean squared error on the validation set by 99.5% and the mean squared error on pure background spectra by 97.4%. However, undesirable denoising behavior was observed that rendered the method unusable. Training a classifier with the same architecture showed that this behavior was not caused by a poor choice of loss function but is caused either by insufficient architecture and optimizing scheme or an impossible objective. Considering the human-level performance in image recognition and natural language processing tasks of the chosen architectures, the objective might not be possible due to the strong interference of signal and background. However, due to the large search space and choice of hyper-parameters this can only be speculated.

The best performing interpolator was of the U-Net type and reduced the mean squared error of the validation set by 93.2%. It showed less dependence on the distance from the nearest interpolation anchor than the linear interpolation. The combination of model interpolation and linear fitting showed very robust behavior and decreased the relative error by 8.2% and the detection limit by 52.4% on the test set.

It was shown that the interpolator model can be transferred to a different spectrometer without chaotic out-of-distribution effects. However, the performance of the pre-trained model on the different setup does not match the performance on the original spectrometer setup and may become worse than conventional approaches. The performance can be enhanced via fine-tuning on new data. Using just 0.8% fine-tuning iterations in relation to initial training iterations, the interpolator mean squared error was reduced by \(36.3\,\)% compared to the conventional approach.

In conclusion, using state-of-the-art architectures is no guarantee to obtain a well performing neural network if the task is not appropriate. But, interpreting the task in multiple ways to include many state-of-the-art architectures can make the application less sensitive to specific properties of a chosen network and speed up the architecture design significantly.

In this study only 2f-wavelength modulation spectroscopy was considered, but the concept should also work for other absorption spectroscopy data acquisition schemes due to the similarities of the spectral features and the dominant noise sources. More fine-tuning may be required in this case.