Introduction

Realizing stable operation of the plasma is a key issue for the magnetically controlled fusion plasma. In experiments, tokamaks are often subject to various magneto hydrodynamic (MHD) activities, which may destabilize or even terminate operation of the plasma. Therefore, it is critical to develop an efficient technique for real-time monitor and control of the MHD instabilities, which needs an algorithm to resolve the dynamics of fast evolving (time scale is ~ ms) mode structure. MHD instabilities inside fusion plasmas are rather complicated, the evolution of the two-dimensional (2D) mode structures might be key to understanding the dynamics of the instabilities [1]. Measurement of the soft X-ray (SXR) signals can provide important information on the MHD instabilities. HL-2 A is equipped with a SXR diagnostic system [2], which has a number of detectors for measuring line-integrated emission along the viewing chords with a very high sampling frequency sufficient to detect MHD instabilities such as sawtooth precursors, tearing modes, fishbones, etc. To resolve the MHD mode structure, a tomography method needs to be developed for 2D SXR reconstruction by an inversion of the line-integrated data.

To date, a variety of tomography techniques have been developed including 2D peeling away algorithm [3], Fourier–Bessel analytical method [4], Bayesian based non-stationary Gaussian processes tomography (NSGPT) method [5, 6], etc. However, most traditional tomography methods are incapable of real-time application due to a heavy time cost incurred by an iterative algorithm. For instance, a few seconds’ discharge process generally takes hours to days of computation for high fidelity reconstructions. In this context, a common approach is to replace high fidelity reconstructions codes with deep learning surrogate models in real-time scenario. This allows to accomplish the same task significantly reducing computation time while preserving a reasonable level of accuracy.

In the last decades, deep learning is facilitating a wide range of data processing tasks in fusion community, as a shortcut for computationally expensive tasks or as a powerful tool for solving a variety of problems. For example, JET applies the inverse of a convolutional network to bolometer (range from ultraviolet to SXR) tomography [7], and Wendelstein 7-X uses neural network regression approaches to reconstruct magnetic configuration properties from heat load patterns on the plasma-facing components [8, 9]. For EAST, a neural network-based soft X-ray proxy model has also been constructed [10]. However, current plasma tomography surrogate models predominantly focus on reconstructing individual discharge time points, emphasizing the description and analysis of plasma states at isolated moments, rather than treating it as a sequential problem. Consequently, existing models do not fully exploit the dynamic attributes inherent in time series data, thereby failing to conduct holistic and continuous reconstructions. This paper will separately showcase two distinct modeling strategies: the first being the classical single-time point modeling technique, and the second being the sequential modeling method based on time-series data. Both models can leverage hardware accelerators such as Graphic Process Unit (GPU) to achieve enhanced precision and speed. Through comparative analysis of these two models and discussions on their practical application scenarios, we can attain a deeper appreciation of the advantages and value inherent in sequential modeling. To construct the dataset required for the neural network, we utilize the results generated by the NSGPT method as target value. NSGPT has proven successful in reconstructing SXR radiation profiles across multiple devices and implements local adaptive smoothness regularization, thereby significantly enhancing the accuracy of the reconstructions. The rest of the article is arranged as follows: Sect. 2 presents essential background on SXR diagnostics and the NSGPT approach. Section 3 details the training of our networks utilizing NSGPT-generated data, illustrating how such models enable accurate predictions from experimental SXR data. Section 4 evaluates the strong performance of these networks, highlighting their time efficiency, reliability, and ability to generalize. Finally, Sect. 5 offers a forward-looking discussion on potential developments in this field.

Background

The use of tomography diagnostics on HL-2 A dates back to 2016 when the SXR system was installed [2]. In this section, the composition and distribution of the SXR diagnostic system on HL-2 A, as well as the theory of the NSGPT algorithm, advantages, and limitations will be introduced.

Soft X-ray Diagnostics on HL-2 A

The SXR diagnostic on HL-2 A consists of 5 pinhole cameras with 100 chords [2]. In this study, a combined total of 40 viewing chords from No. 3 and No. 4 cameras, each equipped with 20 Si-PIN photon-diode detectors, were employed to test the reconstruction method [2]. The experimental setup of SXR diagnostics is shown in Fig. 1 [6]. The 25 μm thick beryllium foils are installed in front of the detectors to filter out unwanted energy range, achieving a response energy range from 1 keV to 10 keV. The temporal and spatial resolution of this system are 10 µs and 3 cm, respectively. Regarding the measurement of the SXR diagnostics, the data di from a single detector indexed by i is obtained by a line integral with respect to a 2D emissivity distribution f(x, y), which can be described by the following physical model:

$${d}_{i}=\text{R}\left(f\right)+{\upepsilon }$$
(1)

where R stands for the forward process from the 2D emission to the line-integrated data, which is calculated by taking into account the starting positions, end positions as well as the beam width of the lines-of-sight (LOS); ε denotes an error term to account for the systematic and statistical errors suffered in the actual experiments. The reconstruction region considered for the tomographic inversion is discretized into many pixels, which covers the whole plasma cross-cession [6].

Fig. 1
figure 1

The experimental setup for SXR diagnostics on the HL-2 A tokamak involves the use of multiple cameras, each equipped with 20 viewing chords to achieve full coverage of the poloidal plasma cross-section

Bayesian Based Non-Stationary Gaussian Processes Tomography (NSGPT) Method

The essence of reconstruction is ill-posed problem [11, 12]. The Bayesian framework demonstrates the characteristics of flexibility and generality, enabling it to successfully address inverse problems. Bayesian methods excel at handling uncertainties and providing a coherent approach to solving inverse problems by integrating experimental data, a priori knowledge. In the Bayesian formula, all major variables can be formulated in the probability form:

$$p(f\left|{d}_{i},{\uptheta }\right)=\frac{p\left({d}_{i}\right|f,{\uptheta })\times p(f\left|\theta \right)}{p\left({d}_{i}|\theta \right)}$$
(2)

where, \({\uptheta }\) represents the hyper-parameters involved in the process of building a probabilistic mode; \(p\left(f\right|\theta )\) is prior probability over the physical quantity to be inferred, mainly used to impose regularization on that quantity; \(\text{l}\text{i}\text{k}\text{e}\text{l}\text{i}\text{h}\text{o}\text{o}\text{d} p\left({d}_{i}\right|f,{\uptheta })\) is conditional probability over the measured data, mainly used to achieve a reasonable fit to the measured data; \(p(f\left|{d}_{i},{\uptheta }\right)\) is posterior probability, which refers to the probability of a hypothesis or an event occurring by combining information from prior knowledge and measured data.

Selecting appropriate probability models to model prior probabilities is a key component of Bayesian inference. Due to its tractability and ability to accurately capture the distribution of real-world data, Gaussian process is an excellent choice as a prior distribution in Bayesian inference. The radiation levels at different discrete points can be modeled as a multivariate normal distribution:

$$\begin{aligned} &p(\stackrel{-}{f}\left|{\stackrel{-}{\mu }}_{f},{\overline {\overline \sum}}_{f}\right)=\frac{1}{{\left(2\pi \right)}^{N/2}{\left|{\overline {\overline \sum}}_{f}\right|}^{1/2}}\\&{exp}\left(-\frac{1}{2}{\left(\stackrel{-}{f}-{\stackrel{-}{\mu }}_{f}\right)}^{T}{{\overline {\overline \sum}}_{f}}^{-1}\left(\stackrel{-}{f}-{\stackrel{-}{\mu }}_{f}\right)\right)\end{aligned}$$
(3)

where,\({\overline {\overline \sum}}_{f}\) is the covariance matrix, which determines the correlations between the individual independent variables in the vector \(\stackrel{-}{f}\). Length scale l is an important hyper-parameter in the covariance function. It determines the smoothness of the random process. For stationary Gaussian Process, when l is a constant, it implies that the reconstructed radiation distribution has uniform smoothness at all positions. However, in reality, SXR radiation may have different levels of smoothness at the plasma edge and core regions. To address problem of varying smoothness in an emissivity distribution, the following non-stationary extension of the squared exponential covariance function has been developed [6], and produced promising results. The non-stationary covariance equation can be represented as follows:

$$\begin{aligned} &{{k}^{NS}\left(\stackrel{-}{r},{\stackrel{-}{r}}^{{\prime }}\right)={\upsigma }}^{2}{\left|\overline {\overline \sum}\right(\stackrel{-}{r}\left)\right|}^{1/4}\\&{\left|\overline {\overline \sum}\right({\stackrel{-}{r}}^{{\prime }}\left)\right|}^{1/4}{\left|\frac{\overline {\overline \sum}\left(\stackrel{-}{r}\right)+\overline {\overline \sum}\left({\stackrel{-}{r}}^{{\prime }}\right)}{2}\right|}^{-1/2}\\&exp\left(-{\stackrel{-}{d}}^{T}{\left(\frac{\overline {\overline \sum}\left(\stackrel{-}{r}\right)+\overline {\overline \sum}\left({\stackrel{-}{r}}^{{\prime }}\right)}{2}\right)}^{-1}\stackrel{-}{d}\right),\\& \overline {\overline \sum}\left(\stackrel{-}{r}\right)=\left[\begin{array}{cc}{l}^{2}\left(\stackrel{-}{r}\right)& 0\\ 0& {l}^{2}\left(\stackrel{-}{r}\right)\end{array}\right],\stackrel{-}{\text{d}}=(\stackrel{-}{\text{r}}-{\stackrel{-}{r}}^{{\prime }})\end{aligned}$$
(4)

where \(\overline {\overline \sum}\left(\stackrel{-}{r}\right)\) is a 2D matrix describing the local length scales and possible local correlations of the function at location \(\stackrel{-}{r}\). We begin the assessment of the NSGPT method with steady discharge and perturbed discharge. The output of the NSGPT method comprises a series of 2D SXR emissivity profiles. These profiles have been compared to the equilibrium magnetic flux surfaces, demonstrating satisfactory agreement in terms of both shape and position. In addition, singular-value-decomposition (SVD) have been applied for the analysis of SXR reconstructions. It is found that NSGPT tomography method shows strength in preserving the fine structure of perturbations embedded in the reconstructions, which can be further easily abstracted by SVD. It is beneficial for obtaining the spatial and temporal characteristics of MHD.

However, NSGPT method takes a significant amount of computation time, it can take several minutes to produce a single-time reconstruction. Given that the SXR system was usually operated with high sampling rate, from a few kilohertz to several hundred kilohertz, it could take several months to compute all the tomographic reconstructions for a single pulse. Simultaneously, during the reconstruction process, NSGPT not only relies on diagnostic signal information but also depends on supplementary data such as boundary conditions computed by the EFIT equilibrium code, among other auxiliary inputs. Utilizing the results obtained through the NSGPT, the deep learning neural networks in our work are expected to be faster and more independent alternative, capable of achieving precise tomography using only the original SXR measurement data without necessitating additional input information or time-consuming iterative computations.

AI Methods

In this section, the techniques for data construction and the architecture of the employed neural network model will be provided. In Sect. 3.1, a comprehensive presentation of the dataset being utilized will be offered to ensure its suitability for fulfilling model training requirements. In Sect. 3.2, a basic Convolutional Neural Network (CNN) algorithm serving as a benchmark will be introduced for understanding and comparison. Further progressing into Sect. 3.3, the content will be divided into two sub-sections. In Sect. 3.3.1, the backbone network of time series reconstruction neural network model will be presented and explained. In Sect. 3.3.2, a dedicated focus will be placed on explaining and revealing the loss function selected for the entire model training process.

Datasets and Construction Methods

Building a robust and representative training database is crucial for deep learning models. A well-constructed training database sets the foundation for model learning and generalization. Data from HL-2 A campaigns are analyzed in this work. Since the HL-2 A campaigns exhibit notable differences in various aspects such as operating conditions, diagnostic equipment, system calibration, and even fuel composition. To ensure data consistency, 120 different plasma shots of steady and MHD perturbed discharges (sawtooth, fishbones) from the SXR system are chosen, with a sampling rate of 100 kHz.

In this work, two different dataset construction methods have been adopted. One of the methods involves treating each time point’s data as independent samples, without considering the temporal relationship between them. In this approach, the input data, consisting of measurements taken from 40 viewing chords, is represented as a 1D vector of length 40, while the target value is a 1D vector of length 1152 derived from the SXR emissivity profiles provided by the NSGPT code. By decomposing the model results with 1 × 1152 dimensions into a 36 × 32 grid, a 2D profile image can be obtained. This decomposition enables the representation of the SXR emissivity profiles in a two-dimensional format, allowing for a visual and spatial understanding of the data. The 120 pulses are divided into 1,471,000 single time point samples, training dataset 1,186,700 samples, validation dataset 275,900 samples, and test dataset 8,400 samples. By having the input data as a one-dimensional vector, the dataset becomes simple and intuitive. The neural network doesn’t need to handle correlations between multi-dimensional features. This simplification in the network’s input layer and feature extraction process helps in fast learning and updating of weight parameters.

The other method is slicing the SXR data into sequence length n, allows for better observation and analysis of the periodic behavior in the MHD activities. The length of the time window can be considered as a hyper parameter and adjusted based on the sampling rate. In our case, the SXR time series data is segmented into fixed-length windows of 100 data points. Therefore, the input data for both deep learning and NSGPT model comprise a 100\(\times\)40 dimensional matrix. The target value is the result of size 100\(\times\)1152 from NSGPT results. The 120 pulses were divided into 14,710 temporal samples, training dataset 11,867 samples, validation dataset 2759 samples, and test dataset 84 samples.

Baseline Model

A baseline model has been developed as a reference for the single time point data. For ill-posed problems where the input size is smaller than the desired output, the role of the baseline model is to achieve up sampling by enlarging the input. The deconvolutional network [13,14,15], also referred to as the inverse CNN, is an excellent choice for mapping from low-dimensional input to high-dimensional output and preserving texture information during the expansion process. Figure 2 shows the deconvolutional network developed for emissivity tomography. It receives the SXR measurements 40 lines of sight from two cameras and produces a reconstruction of the emissivity profile. After the network’s input, there are two dense layers with 4096 and 7500 nodes, respectively. These layers produce outputs that are reshaped into 20 feature maps of size 15\(\times\)25. By applying a transposed convolution, the maps are brought up to size of 30\(\times\)50. Then, a flattening operation is performed to convert the output feature maps into a one-dimensional vector. Finally, this one-dimensional vector is passed to the subsequent fully connected layers. The network was trained to minimize the mean squared error (MSE) [16] between the output and the sample tomograms that were provided for training.

Fig. 2
figure 2

Architecture of the up-convolutional network

Time Series Reconstruction Model

Network Architecture

The model architecture of time series reconstruction model is shown in Fig. 3. This architecture contains an encoder module with eight blocks. Each block contains two 1-D convolutional layers with residual block. In traditional 1-D convolutions, the filter scans the input sequence consecutively with a fixed stride. In our work, dilation factors [17] have been introduced as an additional parameter incorporated into 1-D convolutions. The dilation factor determines the spacing between the values in the filter, allowing the filter to sample non-adjacent elements of the input sequence during convolution. By adjusting the dilation factor, the receptive field can be effectively expanded, enabling the capture of a wider range of contextual information, thereby enhancing the model’s understanding and analytical capabilities for complex patterns in sequential data. Here, the dilation factor is related to the depth of the network, with a larger dilation factor as the network becomes deeper. Since the time series reconstruction model’s receptive field depends on the network depth n as well as dilation factor d, stabilization of deeper and larger time series reconstruction model becomes important. In our designed model, a generic residual network [18] and the rectified linear unit (ReLU) [19] as the non-linearity function are employed. The adoption of a residual network facilitates effective learning of complex features and addresses the vanishing gradient problem often encountered in deep neural networks. The ReLU activation function is used to introduce non-linearity into the model, enabling the capture of nonlinear relationships within the data. In encoder module, the input and output have different widths. To account for discrepant input-output widths, we use an additional 1 × 1 convolution to ensure that elementwise addition ⊕ receives tensors of the same shape. And two dropout blocks are applied to avoid overfitting during eight CNN blocks. The input data comprise a 100 × 40 dimensional matrixes. By applying the encoder module, the feature maps are brought up to a size of 100 × 1024. After the encoder output, there is an FC layer with 100 × 1152 nodes, then output through the FC layer.

Loss Function

Smooth L1 loss [20] is used as loss function to improve MSE loss, which is described as followed:

$${smooth}_{L1}\left({x}_{i},y\left({x}_{i}\right)\right)=\left\{\begin{array}{c}0.5{({x}_{i}-f({x}_{i}\left)\right)}^{2}, \left|{x}_{i}-\right.f(\left.{x}_{i})\right|<1\\ \left|{x}_{i}-\right.f(\left.{x}_{i})\right|-0.5, otherwise\end{array}\right.$$
(5)

where \({x}_{i}\),\(f\left({x}_{i}\right)\) represent the elements in the list of SXR emissivity profiles given by the NSGPT model and the time series reconstruction model, respectively. Using smooth L1 loss generally leads to better results compared to MSE loss in our CNN-net. Since in our regression task where the targets are unbounded, training with MSE loss should require careful tuning of learning rates in order to prevent exploding gradients. Smooth L1 loss can eliminate this sensitivity. Additionally, smooth L1 loss avoids the issue of sudden gradient changes near the origin that occur with MSE loss, further mitigating the risk of gradient explosion.

Fig. 3
figure 3

The architecture of time series reconstruction model

Experiments and Results

In this section, the discussion will be divided into two main parts. Section 4.1 systematically presents and analyzes the differences in performance between two distinct models in the context of reconstruction tasks, alongside the effects of employing various loss functions on their respective performances. In Sect. 4.2, we further subject the selected models to noise resistance evaluation tests, thereby providing strong evidence for the reliability and effectiveness of these models in practical applications.

Ablation Experiments

In this section, ablation experiments were conducted on SXR data to validate the performance of neural network model. The experiments are conducted using a single A100 GPU with CUDA version 11.7. The algorithms in this paper are implemented in Python 3.9 using torch 2.0.1. The Adam optimizer is used to enhance the stability of the training process. Mean absolute error (MAE) and root mean square error (RMSE) are used as the metric to measure the average absolute difference between the predicted and target value in our regression task. Lower MAE and RMSE values indicate better overall performance of the model in accurately predicting the target values. To ensure the fairness of test, all models are trained from scratch and trained several times.

The evolution of loss during the training of three networks is depicted in Fig. 4. The evolution of loss of the baseline model and time series reconstruction model can be observed across 5 and 20 epochs of training, respectively. The blue cross represents the average loss on the training set, while the red line represents the average loss on the validation set. The time series networks demonstrate the potential to decrease both training loss and validation loss. During baseline model training, the trend shows that the model’s loss decreases gradually on the training set, yet increases simultaneously on the validation set. This behavior may stem from the fact that the baseline model overfits the training set features and struggles to generalize effectively to previously unseen validation data. Moreover, time series neural networks using the smooth L1 loss has a smaller loss value on the validation set compared to neural networks with MSE loss.

Fig. 4
figure 4

Evolution of MSE loss of baseline model (a), time series model with MSE loss function(b) and time series model with Smooth L1 loss function(c)

The corresponding relative MAE and RMSE values are presented in Table 1, where the MAE of the baseline model is one order of magnitude higher than that of the time series model. Similarly, the RMSE of the baseline model is also larger compared to the time series model. Furthermore, it is observed that the time series model using the smooth L1 loss function exhibits lower MAE and RMSE values compared to the time series model with MSE loss. Fig. 5 depicts sample reconstructions from the same test set. Figure 5A is the reconstruction of steady discharge, while Fig. 5B is the reconstruction of MHD perturbed discharge. (a), (b), and (c) correspond to the results obtained from the Bayesian model, baseline model, and time series model with MSE and smooth L1 loss, respectively. From the results, it can be observed that for both steady discharge and MHD perturbed discharge, the time series model using smooth L1 loss demonstrate smoother predictions and closely resembles the shape of the target value. Figure 6 represents the relative errors between the predictions from various models and the target values, the time series model using smooth L1 loss exhibit smaller relative errors. These results suggest that the time series model with smooth L1 loss performs better in terms of accuracy and error metrics.

Fig. 5
figure 5

Sample reconstructions from the same test set

Fig. 6
figure 6

Relative errors between the predictions from various models and the target values

Table 1 MAE and RMSE obtained on test set

In order to test the application speed of different methods, the time spent on test set with 8400 data points is counted in Table 2. These neural networks are at least 9000 times faster than the Bayesian method on CPU and spend only several milliseconds on GPU reconstruction. This demonstrates the significant advantage in time efficiency that deep learning-based Soft X-ray Tomography (SXT) holds over the conventional Bayesian method.

Table 2 Execution time obtained on test set

Ability of Anti-Noising

During the HL-2 A campaigns, SXR signals as input to the model are typically subject to a certain level of measurement noise, which can arise from various sources, including limitations in instrument accuracy, electromagnetic interference, and environmental noise. These factors impact the accuracy of the diagnostic signals, resulting in discrepancies between the model’s input and the actual values. To make the model more applicable to real-world scenarios, Gaussian noise is randomly added to the test data with varying magnitudes. In the anti-noising ablation experiment, the initial step involves determining the mean and standard deviation of the Gaussian noise. Three different standard deviations have been carefully chosen for testing purposes. Following this, an equal number of random numbers are extracted from a Gaussian distribution, ensuring they correspond to the size of the model input. Finally, the generated Gaussian noise sequence is added to the original data points, effectively introducing random noise into the dataset. The steps were repeated a total of 5 times, generating different random noise sequences each time. The time series reconstruction model performed inference calculations on these random noise test sets. The comparison results are shown in Table 3, while Fig. 7 shows some of sample reconstructions obtained on the test sets with different level of noise, Fig. 8 presents the relative errors of the time series model under different levels of noise contamination in the data. The experimental results indicate that there is no significant decrease in the model’s performance as the level of noise increased, this suggests that time series model with smooth L1 loss exhibits robustness to noise and can handle noisy input effectively. Such robustness is valuable during the HL-2 A campaigns.

Fig. 7
figure 7

Sample reconstructions of time series model with different level of noise

Fig. 8
figure 8

Relative errors of time series model with different level of noise

Table 3 The comparison results of noise testing

Conclusion

In this paper, deep learning as a fast, approximated Bayesian model is applied for the inference of emission profiles from measurements. First, based on HL-2 A experimental SXR data and the Bayesian based non-stationary Gaussian processes tomography method, training, validation and test datasets are built. Second, two typical neural networks are carried out and trained, including an up-convolutional neural network and time series neural network. Smooth L1 loss is adopted to improve the stability of the training process and resilience against noisy or outlier data points. Third, ablation experiments on SXR datasets are conducted and experiments results show that time series model has certain advantages in term of fitting accuracy, inference speed. The time series model has back-projection error levels at around 0.0015%, close to that of the Bayesian tomography method, and inference speed in the millisecond range. Moreover, noise testing results indicate its abilities in constraining the SXR profile to match most of the data, including noisy environments.

In the future, the network could be tested on a larger data set of measurements collected at previous campaigns. Reasonable reconstructions can be stored as secondary database resources for conducting in-depth analysis of MHD mode structures, such as shape, size and location. Meanwhile, individual investigations should be conducted for cases where the reconstruction fails, which can provide valuable insights for further improving the model. Furthermore, the work described here is not only interesting for SXR diagnostic systems but also for other diagnostics that have used Bayesian method for plasma tomography. Besides the diagnostics available on HL-2 A, the approach could also be used in other fusion devices.