1 Introduction

High-speed planar laser-based diagnostics has been widely applied to experimentally study reacting and non-reacting flows [1,2,3,4,5,6,7,8,9]. Several laboratories have developed high-frequency laser diagnosing facilities for experimentally studying turbulent flows and flames. For example, Slipchenko et al. [10] have developed a burst-mode laser, the output frequency of which can be as high as 1 MHz. The laser was used for planar laser-induced fluorescence (PLIF) measurement of formaldehyde (CH2O) in floating methane diffusion flame at 20 kHz. The fundamental frequency (1064 nm) of this laser can also be used for high-speed planar laser induced incandescence (PLII) measurement of soot concentration as well. In addition, Fu et al. [11] carried out simultaneous measurement of PLIF and three-dimensional particle image velocimetry (PIV) at 20 kHz in an ethylene diffusion flame with acoustic excitation. Michael et al. [12] optimized the burst-mode laser system and successfully carried out a 100 kHz CH2O PLIF measurement on a floating diffusion flame. In addition to CH2O imaging, the burst-mode laser has also achieved 100 kHz in other measurements, including Rayleigh scattering temperature measurement [13, 14], coherent anti Stokes Raman scattering (CARS) temperature measurement [15], and PIV measurement for turbulent flow fields [16]. Besides, Philo et al. performed 100 kHz PIV in a liquid-fueled gas turbine swirl combustor at 1 MPa, demonstrating the validity of high repetition-rate measurements in practical combustion systems [17]. It is also worth mentioning that researchers have used multi-pulsed Nd:YAG systems to generate short-bursts for LIF measurements at 20–40 kHz in practical combustion systems such as spark ignition engines [18].

While most of the measurements discussed above applied the harmonics of the Nd:YAG laser, there are still many laser-based techniques requiring other excitation frequencies, e.g. in PLIF measurement for hydroxyl (OH) radicals, where the Nd: YAG laser needs to be combined with either a dye laser [19,20,21] or an optical parametric oscillator (OPO) [22]. Sjöholm et al. [23] used this experimental method for optical diagnosis of other substances such as CH, CH2O and toluene. Wang et al. [24] reported the first ultra-high-speed diagnostic technique which simultaneously probes the OH and formaldehyde distributions in a highly turbulent flame at a repetition rate of 50 kHz. Using fast Nd:YAG lasers and frequency extension units, OH-PLIF imaging has been conducted at 50 kHz by by Miller et al. [25] in a H2-air diffusion flame. However, further increasing the speed of imaging measurements is very challenging due to the limited repetition rates of laser system and cameras, as well as difficulties in storing and transferring big data. Therefore, either the number of consecutive images is significantly reduced, or the spatial resolution and field of view must be sacrificed to maintain image quality.

Yet these issues can potentially be addressed by combing laser-based imaging with computational methods to artificially accelerate the frequency of imaging, inspired by the previous works [26,27,28]. As an effective approach of computational imaging, machine learning architecture, and particularly deep neural networks, has seen explosive growth drawing on similar progress in mathematical optimization and computing hardware. While these developments have always been to the benefit of image interpretation and machine vision, only recently has it become evident that deep neural networks can be effective for computational image formation, aside from interpretation [29].

Generating high-speed imaging from low-frequency ones essentially requires a temporal sequence prediction model. Such a model takes the reference image sequence (low-speed diagnostics) as the input and generates the temporal interpolation (usually greater than 1) as the output. What is more challenging is that each image in the sequence needs to have high spatial resolution to reconstruct turbulence structures. In machine learning, recurrent neural network (RNN) [30] and its variant models such as long short-term memory (LSTM) [31] and gated recurrent unit (GRU) [32] are designed to deal with this type of multi-sequence input and multi-sequence output problems. What’s more, each image is spatially resolved by a convolutional neural network (CNN) [33,34,35]. Therefore, it is of great significance to explore the feasibility of such methodology, to compute the high-frequency imaging with the relatively low-frequency experimental data, thus reducing the dependence on high-speed laser-camera setup.

Many recent works have demonstrated the effectiveness of deep learning in generating a sequence of images with high spatial resolution. For example, Hong et al. [36] used a specific combination of skip connection and convolutional sequence-to-sequence auto-encoder to predict undiscovered weather situations from previous satellite images with high accuracy. However, the model only predicted the next weather situations, which were scalars instead of image matrix. Shi et al. [37] and Kim et al. [38] predicted future precipitation from historical multichannel radar reflectivity images with a convolutional long short-term memory (Conv-LSTM) network. This network has become a seminal model in this area. Afterwards, Finn et al. [39] constructed a network based on this model for predicting transformation in the next frame from previous image sequence. Lotter et al. [40] built a predictive model upon Conv-LSTM, mainly focusing on increasing the prediction quality of the next frame. However, they only predict the next single frame with a sequence of frames as the input. Whilst in this research, a sequence of images (equal to or greater than 1) need to be predicted from single input image, which is more challenging. Patraucean et al. [41] modified RNNs by introducing optical flow to model temporal dynamics. However, this methodology is difficult to apply due to the high additional computational costs. Such learning problems, regardless of their exact applications, are nontrivial in the first place due to the high dimensionality of the spatio-temporal sequences especially when multi-step predictions have to be made. Moreover, building an effective prediction model for high-speed imaging in turbulent flames is even more challenging due to the unsteady nature of the turbulent combustion.

This paper presents the Conv-LSTM network to form the computational model, because convolutional layers can extract spatial features and LSTM can capture temporal characteristics from high-speed imaging. As such a prediction with sufficient spatial and temporal resolution is expected using Conv-LSTM. Among all the intermediate species of combustion, OH is a critical radical, of which the formation is commonly interpreted as a marker of the flame reaction zone. Furthermore, excitation of OH radicals cannot be directly achieved by the harmonics of the commonly employed Nd:YAG lasers [5]. Therefore, this work chose OH-PLIF as an example to explore the feasibility of developing a computational method to artificially increase the diagnosing repetition rate of high-speed imaging in turbulent flames.

2 Experimental data

The experimental data in the present work was collected using an experimental setup that can be found in a previous paper [42], with the details of the optical system summarized in Table 1. An ultra-high-speed laser (Quasimodo by spectral energy, LLC) was used, which is similar to but not identical with the system developed by Slipchenko et al. [10]. This laser system was employed to pump the optical parametric oscillator (OPO, GWU, PremiScan/MB), for the generation of excitation radiation of OH radicals in the flame. The pump beam size was reduced to approximately 4 mm in diameter, using a telescope before the β-barium borate (BBO) crystal. OH excitation scan is not presented here for brevity. More information can be found in Ref. [24]. After the OH excitation scan, laser radiation was tuned to 283.93 nm (A2Σ+−X2Π, 1–0 transition). The emission from the A–X (0, 0) transition was collected at around 308 nm, through a bandpass filter (λT = 310 ± 10 nm) and an UV lens (B. Halle lens, f# = 2, f = 100 mm) mounted in front of a high-speed intensifier (Lavision HSIRO) and a CMOS camera (Photron Fastcam SA-Z). A Pellin-Broca prism was used to separate 284 nm from 568 nm after the doubling crystal. A 20 mm high laser sheet was formed by a cylindrical lens (f = − 40 mm) and a spherical lens (f =  + 200 mm). The resulting pulse energy of the 283.93 nm laser at 100 kHz was about 150 µJ/pulse, which generates sufficient SNR in this application. The resolution of the CMOS camera for OH-PLIF measurement was set to 600 × 240 pixels at 100 kHz, resulting in an field of view of 18 × 7 mm.

Table 1 Specifications of the optical setup

The burner employed in this study was a mixed porous plug/jet burner (LUPJ burner), of which the details can be found in previous works [42,43,44,45]. The main components of the burner include a porous sintered stainless-steel plug with a diameter of 61 mm, and a nozzle with a diameter of 1.5 mm in the center. The premixed CH4 and air mixture formed a jet flame through a central nozzle. The jet flow speed is 66 m/s, corresponding to an exit Reynolds number of 6300 [46] and a turbulent Reynolds number of 95 at y/d = 30 [42]. The equivalence ratio of jet flow is 1.0. The gas flow rate was regulated by mass flow controllers (Bronkhorst), which are calibrated at 300 Kelvin with an accuracy higher than 98.5%.

In the present work, all the OH-PLIF images were preprocessed with binarization to enhance the discrimination of OH regions from those without OH radicals (Fig. 1). A threshold value of m was applied which is set to be the average value of the intensities of all pixels over the image. More specifically, for the pixels with an intensity less than m, a value of 0 is applied, while those with an intensity equal and larger than m, a unity intensity is applied, i.e. setting the value of [0, m] to 0, and (m, 255) to 255, with m being the average pixel intensity of the image.

Fig. 1
figure 1

a Representative raw OH-PLIF images from the optical experiment and b the corresponding binarized images

3 Deep learning methodology

3.1 Convolutional LSTM

A modified Conv-LSTM model reported by Shi et al. [37] was used in the present paper. To enable the hidden layers learn from longer time scale, two extra parameters were designed, which are steps and effective steps, representing the length of OH-PLIF image sequence fed to the hidden layers and the interpolated OH-PLIF image sequence, respectively. Besides, we used the eLU [47] activation function in the input layer and SeLU [48] activation function in the output layer instead of ReLU [49]. The OH-PLIF images were generated as a sequence of images with sufficient temporal and spatial resolution, which can be solved under the general sequence-to-sequence learning framework proposed by Sutskever et al. [50].

The main formulas of the Conv-LSTM used in this paper are given as follows:

$${i}_{t}=\sigma ({W}_{xi}\,\,{*\mathcal{X}}_{t}+{W}_{hi}\,\,{*\mathcal{H}}_{t-1}+{W}_{ci}\,\,{\circ \,\,\mathcal{C}}_{t-1}+{b}_{i})$$
(1)
$${f}_{t}=\sigma ({W}_{xf}\,\,{*\mathcal{X}}_{t}+{W}_{hf}\,\,{*\mathcal{H}}_{t-1}+{W}_{cf}{\,\,\circ\,\, \mathcal{C}}_{t-1}+{b}_{f})$$
(2)
$${\mathcal{C}}_{t}={f}_{t}{\,\,\circ\,\, \mathcal{C}}_{t-1}+{i}_{t}{\,\,\circ\,\, \mathrm{tanh}(W}_{xc}\,\,{*\mathcal{X}}_{t}+{W}_{hc}\,\,{*\mathcal{H}}_{t-1}+{b}_{c})$$
(3)
$${o}_{t}=\sigma ({W}_{xo}\,\,*{\mathcal{X}}_{t}+{W}_{ho}\,\,{*\mathcal{H}}_{t-1}+{W}_{co}{\,\,\circ\,\, \mathcal{C}}_{t}+{b}_{o})$$
(4)
$${\mathcal{H}}_{t}={o}_{t}{\,\,\circ\,\, \mathrm{tanh}(\mathcal{C}}_{t})$$
(5)

where ‘\(*\)’ represents the convolution operation and ‘\(\circ\)’ is the Hadamard product. Here, \({i}_{t}\), \({f}_{t}\) and \({o}_{t}\) of the Conv-LSTM represents input gate, forget gate and output gate, respectively, and they are all 3D tensors whose last two dimensions are spatial dimensions (rows and columns). \({\mathcal{X}}_{t}, {\mathcal{H}}_{t}\,\,\mathrm{ and }\,\,{\mathcal{C}}_{t}\) are the input, hidden layer memory state and memory cell output at timestamp \(t\), respectively. Here, the memory cell output \({\mathcal{C}}_{t}\) acts as an accumulator of the state information. The cell is accessed, written, and cleared by several self-parameterized controlling gates. When a new input comes, its information will be accumulated to the cell if the input gate is activated. Also, the past cell status \({\mathcal{C}}_{t-1}\) could be “forgotten” in this process if the forget gate \({f}_{t}\) is on. Whether the latest cell output \({\mathcal{C}}_{t}\) will be propagated to the final state \({\mathcal{H}}_{t}\) is further controlled by the output gate \({o}_{t}\). One advantage of using the memory cell and gates to control the information flow is that the gradient will be trapped in the cell (also known as constant error carousels) and be prevented from vanishing too quickly, which is a critical problem for the vanilla recurrent neural network (RNN) model.

If the states were viewed as the hidden representations of moving objects, a Conv-LSTM with a larger transitional kernel should be able to capture faster motions while that with a smaller kernel can capture slower motions. This trend has been proven in our model training process. Also, as was discussed in [38], the inputs, cell outputs and hidden states of the traditional FC-LSTM may also be seen as 3D tensors with the last two dimensions being 1. In this sense, FC-LSTM is essentially a special case of Conv-LSTM with all features standing on a single cell.

To ensure that the states have the same number of rows and columns as the inputs, padding is needed before applying the convolution operation. Here, padding of the hidden states on the boundary points can be considered as using the area out of photosensitive range. Usually, before the first input was fed to the network, initialize all the states of the LSTM to zero which corresponds to “total ignorance” of the future. Similarly, zero-padding (which is used in this paper) on the hidden states, are actually setting the state of the outside world to zero and assume no prior knowledge about the zone outside of the camera photosensitive range.

3.2 Encoding-forecasting architecture

Assuming a spatio-temporal sequence \(({S}_{1},{S}_{2},{S}_{3},\dots ,{S}_{n})\) was measured under a frequency of \(f\). To increase the measurement frequency by a factor of \(K\), at every timestamp \(t\), the model needs to generate a \(K\)-step prediction based on the previous observation, i.e. from \({S}_{t}\) to \((\widehat{{S}_{t+1}},\widehat{{S}_{t+2}},\dots ,\widehat{{S}_{t+K}})\). Our encoding-forecasting network first encodes the observation into n layers of RNN states: \({\mathcal{H}}_{t}^{1},{\mathcal{H}}_{t}^{2},\dots ,{\mathcal{H}}_{t}^{n}=h({S}_{t})\), and then uses another n layers of RNNs with the same weights to generate the predictions based on these encoded states: \(\widehat{{S}_{t+1}},\widehat{{S}_{t+2}},\dots ,\widehat{{S}_{t+K}}=p({S}_{t})\). Figure 2 illustrates the encoding-forecasting structure for \(n\)= 3; \(K\)= 3. The target images (ground truth) are supervision of the network, which is used to evaluate how close the predictions are to the ground truth. A similar formulation of the sequential prediction was adopted as was described in [37], but with a significant difference in input size.

Fig. 2
figure 2

The encoding-forecasting architecture used in the current deep learning model

As is shown in Fig. 2, the Conv-LSTM encoder compresses the whole input sequence into a hidden state tensor and the Conv-LSTM forecaster unfolds this hidden state to form the final prediction. This structure is also similar to the LSTM future predictor model proposed by Srivastava et al. [51] except that our input and output elements are all 3D tensors which preserve all the spatial information. Since the network has multiple stacked Conv-LSTM layers, it is able to predict sequences in complex dynamic systems, similar with that applied to solve the precipitation nowcasting problem by Shi et al. [37].

3.3 Model description

Before the training of the network, we first split OH-PLIF sequence to training set (80% of the data) and test set (20% of the data). The training set is used to train the Conv-LSTM model to optimize the parameters through back propagation [52]. The test set is preserved to calculate and analyze the errors of the model optimized by the training set.

All the models in this study were trained by minimizing the mean square error (MSE) loss using back-propagation through time (BPTT) [53] and Adam [54] with a learning rate of \({10}^{-4}\). MSE is defined as follows:

$$\mathrm{MSE}\left(f,g\right)=\frac{1}{MN}\sum_{i=1}^{M}\sum_{j=1}^{N}{({f}_{ij}-{g}_{ij})}^{2}$$
(6)

where \({f}_{ij}\) and \({g}_{ij}\) represent the pixel intensity of image \(f\) and \(g\) at row \(i\) and column \(j\), respectively; \(M\) and \(N\) represent the number of rows and columns, respectively.

The implementations of the models are in Python under Pytorch framework. We run all the experiments with eight NVIDIA 2080TI GPUs. To investigate the predicting capability of the deep learning model, we interpolated 1, 2 and 4 image sequence in between the experimental sequence, corresponding to generating 100 kHz OH-PLIF images from 50 kHz, 33.3 kHz and 20 kHz, respectively. In addition, we also artificially generated 200 kHz OH-PLIF images out of 100 kHz experimental sequence, which has not been captured by real-world PLIF yet. These predictions were denoted as P50-100 kHz, P33.3–100 kHz, P20-100 kHz, P100-200 kHz respectively in the following sections of this paper. These models are trained on 1791 OH-PLIF sequences and tested on 100 sequences, each sequence contains 15 images.

3.4 Quantitative evaluation of model accuracy

In addition of MSE mentioned in the last section, structural similarity index (SSIM) [55] was also used in this study to quantify the degree of similarity between predicted image P and experimental supervision T, both of which binary images. SSIM is defined as:

$$\mathrm{SSIM}\left(P,T\right)=l\left(P,T\right)c\left(P,T\right)s\left(P,T\right)$$
(7)

where the relevance of luminance \(l\), contrast \(c\) and structure \(s\) are further defined as follows:

$$l\left(P,T\right)= \frac{2{\mu }_{P}{\mu }_{T}+{C}_{1}}{{\mu }_{P}^{2}{+\mu }_{T}^{2}+{C}_{1}}$$
(8)
$$c\left(P,T\right)= \frac{2{\sigma }_{P}{\sigma }_{T}+{C}_{2}}{{\sigma }_{P}^{2}{+\sigma }_{T}^{2}+{C}_{2}}$$
(9)
$$s\left(P,T\right)= \frac{{\sigma }_{PT}+{C}_{3}}{{\sigma }_{P}{\sigma }_{T}+{C}_{3}}$$
(10)

where \({\mu }_{P}\;\mathrm{ and}\; {\mu }_{T}\) represent the mean value of pixel intensity for image \(P\) and \(T\); \({\sigma }_{P}\), \({\sigma }_{T}\) represent the standard deviation of image \(P\) and \(T\); \({\sigma }_{PT}\) represents the covariance of image \(P\) and T. \({C}_{1}\), \({C}_{2}\) and \({C}_{3}\) are constants to avoid zero denominator. The constants are determined as follows: \({C}_{1}={({K}_{1}\times L)}^{2}\), \({C}_{2}={({K}_{2}\times L)}^{2}\), \({C}_{3}=\frac{{C}_{2}}{2}\); \({ K}_{1}=0.01\),\({K}_{2}=0.03\), \(L=255.\)

Furthermore, another index, Correlation, was also employed to quantify the similarity between prediction and measurement. Correlation is defined as follows:

$$\mathrm{Correlation} = \frac{{\sum }_{i,j}{P}_{i,j}{T}_{i,j}}{\sqrt{\left(\sum_{i,j}{P}_{i,j}^{2}\right)\left(\sum_{i,j}{T}_{i,j}^{2}\right)}+\mathcal{E}}$$
(11)

where \({P}_{ij}\) and \({T}_{ij}\) represent the pixel intensity of image \(P\) and \(T\) at row \(i\) and column \(j\), and \(\mathcal{E}={10}^{-9}\).

For the predicted \(P\) and its ground truth \(T\), their corresponding binary image \(\widehat{P}\) and \(\widehat{T}\) can be acquired based on the binarization threshold. In this study, the threshold is the mean value of image intensity. Then, the intersection over union (IoU) can be calculated by overlapping the two binary images, as is shown in Eq. 12. IoU is used to quantify the prediction accuracy on signal occurrence. Specifically, the difference between the prediction and ground truth can be found, as is shown by the red region in Fig. 3. It should be noted that, there are sparse noises after binarization, to decrease the influence of these noises while keeping the major parts of OH cluster, a 10 \(\times\) 10 pixels (0.31 mm \(\times\) 0.31 mm) smooth window was applied on the binary images before calculating IoU.

$$\mathrm{IoU}=\frac{\widehat{P} \cap \widehat{ T}}{\widehat{P} \cup \widehat{ T}}$$
(12)

3.5 Architecture optimization and parametric study

To evaluate the performance of different architectures and parameters of the Deep-Learning (DL) model, a modified Conv-LSTM model (employed in the current work) was compared with the original Conv-LSTM model, as well as some other mainstream DL architectures like FC-LSTM and Conv-GRU. The comparison was performed under the case of P50-100 kHz. For the FC-LSTM network, the same structure was utilized as the unconditional model in [51] with two 2048-node LSTM layers. For Conv-LSTM and Conv-GRU, the convolutional kernel size and the number of hidden layers were fixed to directly compare their performances, for which the kernel sizes are all 5 × 5 with 3 hidden layers and 16, 16, 32 hidden states, respectively. The results show that the Conv-LSTM model employed in the current work results in the lowest MSE among all the models studied. These models are trained on 1791 image sequences and tested on 100 image sequences, with the length of each sequence being 15. The images were resized from [1024, 400] to [512, 200].

Table 2 shows that the FC-LSTM network results in relatively large MSE, which is mainly because a fully connected network has a disadvantage of capturing spatial correlations, whereas the sequential OH-PLIF images present strong spatial correlation, i.e., the motion of flame is highly consistent in a local region. Thus, this model is unlikely to capture these local consistencies. Also, the Conv-LSTM model with eLU and SeLU activation function outperforms the Conv-GRU and original Conv-LSTM, which is mainly due to two reasons. First, Conv-LSTM with eLU and SeLU activation function can learn more nonlinear relationships between input and output than the original Conv-LSTM and Conv-GRU whose activation function is ReLU. Secondly, the structure of LSTM consists of three gates which is more complicated than GRU. It takes about 36 h to train the model, and once trained, the model can generate one inserted frame in a few seconds.

Table 2 Performance of the DL model under different architecture of neural networks

Also, the performance of the DL model at different structures/configurations of the network was also tested and summarized in Table 3. Therein, the (5 × 5) represents input-to-state kernel size, the following 5 × 5 terms represent the state-to-state kernel size. The values of ‘16’, ‘32’ and ‘64’ represent the hidden states in hidden layer. The ‘Step’ value represents the length of temporal image sequence we feed into the network during the training process. The images were resized from [1024, 400] to [256, 100] in consideration of time and computational cost.

Table 3 Performance of the DL model under different structures of the Conv-LSTM network

The test MSE of different model configurations were presented in Table 3. The 1-layer network contains one Conv-LSTM layer with 64 hidden states, the 2-layer network has two Conv-LSTM layers with 32 hidden states each, and the 3-layer network has 16, 16, and 32 hidden states respectively in the three Conv-LSTM layers. All the input-to-state and state-to-state kernels are of size 5 × 5. The deeper models (3 Conv-LSTM layers) result in lower MSE loss. To investigate the relationship between time step of the model and characteristic time scale of the flow itself, we have tried 5, 10, 15, 20, 25 as steps, which is related to Taylor time scale (100 microseconds) and 45 images, which is related to integral time scale (447 microseconds). The results are presented in Table 3. It can be seen that the test MSE of different steps are quite close when the steps is more than 10, and the statistical best performance is achieved when the step is 45, and longer sequence provides lower MSE loss. However, we tend not to choose long length of sequence as training parameter if very similar prediction capability of model can be guaranteed. Additionally, the increasing number of step size leads to the decline in the number of sequences available for training and testing, which means large step size demands even bigger dataset for the model to be properly trained. Considering these reasons, finally we choose step size 15 in practice for our further research.

In addition to the comparison of model configurations, the influence of input-to-state and state-to-state convolutional kernel size was also evaluated, as is shown in Table 4. This was achieved by comparing the model performance at different convolutional kernel sizes, including 3 × 3, 5 × 5, 7 × 7, 9 × 9 and 11 × 11. For convolution kernel sizes larger than 5 × 5, the train MSE loss decreases but the test MSE loss increases, which means the model is overfitted under larger kernel size. The reason for this is that larger kernel size introduces more parameters into the model. If the scale of train set is not sufficient to support the large scale of parameters, the model tends to be overfitted with low train MSE loss and high test MSE loss. Moreover, further increasing the kernel size will sacrifice the spatial resolution of the predicted results, hence is not preferred. Therefore, for the rest of the study, we chose a convolutional kernel size of 5 × 5.

Table 4 Comparison of Train MSE and Test MSE for different convolutional kernel sizes

In short, the network in the present work was trained by 100 epochs with 3 hidden layers, corresponding to 16, 16 and 32 hidden states. Both the input-to-state and state-to-state convolutional kernel size were 5 × 5, and the length of OH-PLIF sequences fed to the network was 15. The activation of the input layer was eLU and the activation of the output layer was SeLU.

4 Results and discussion

4.1 Transient performance of the DL model

The performance of the model under three different conditions were presented in Fig. 3. In this figure, the frames enclosed in red dash square are those from the DL prediction, while those not enclosed are experimental results. DL model has the capacity to generate high-speed OH-PLIF images from its low-frequency measured counterpart, especially for the case of P50-100 kHz, as the profile and fluidity are particularly similar to the experimental data for this case. While comparing the different cases studied, it can be found that the performance of the model becomes worse as the prediction steps increase. A quantitative evaluation of the model was summarized in Tables 4 and 5. Here we defined 3 indices to quantify the accuracy of the model, the mathematical descriptions of which are given in Eqs. (25). The values shown in Tables 5 and 6 were calculated from 100 image pairs. The statistics in the tables are consistent with what was found from Fig. 3, in that the performance of our model decreases with the interpolation steps, but still maintains a relatively satisfactory accuracy. For example, the average correlation (a similarity index) between the time-averaged ground truth and DL prediction are calculated to be 0.917, 0.854 and 0.752, for the case of P50-100 kHz, P33.3–100 kHz and P20-100 kHz, respectively.

Table 5 Average values of similarity indices under different modeling conditions
Table 6 Standard deviation of similarity indices under different modeling conditions

Furthermore, the IoU (intersection over union) of the predicted and measured binary image is also presented in Fig. 3. So that it is easier to identify which region is mostly correctly predicted and which region is not. From the IoU, it can be found that the deviation between prediction and measurement mainly stay in the transition (or switching) regions, and the deviation increases with the inserted frame number, as is shown in the last column of Tables 5 and 6. According to some previous research, an IoU and SSIM of 0.65 and above [26, 56] suggests that the model has well-reconstructed the images with high similarity.”

Fig. 3
figure 3

Experimental OH-PLIF data, DL predictions, and IoU imaging for the P50–100 kHz, P33.3–100 kHz and P20–100 kHz cases

To interpret the difference in model performance under the three conditions, it is important to discuss the characteristic time scale of turbulent structures. As was described above, the jet flow speed is 66 m/s for the current flame studied, hence the integral time scale is approximately 447 µs [45], the Kolmogorov time scale is 46 µs, while the Taylor time scale is about 100 µs at y/d = 30 [57]. The Taylor time scale indicates the characteristic time the small vortex is correlated with itself, or in this case, the time flame structures are correlated with themselves. Here, a Taylor time scale of 100 µs is reasonable to measure the time of flame self-correlation, because as can be seen from Figs. 3 and 7, that within around 50 µs, the flame has moved upwards by half of the frame length. With a Taylor time scale on the order of 100 µs, it seems that for the case of P50-100 kHz, the prediction might be reasonably made since a flame structure correlates highly with itself in the next frame. However, as more time-steps are skipped in the interpolation, for example, in the extreme case of P20-100 kHz, the correlation between flame structures is statistically much lower, thus difficult for the model to accurately predict intermediate steps as most of the turbulent structures begin to disappear in the next ground truth image. Therefore, the prediction capability of the model is related with the characteristic time scale of the flow itself.

Nevertheless, it is worth noting that, the above discussion was made under the condition that the models are trained on 1791 image sequences, with the length of each sequence being 15 frames. It is anticipated that further increasing the training data size will result in better model performance, because this is a supervised-training model, which means the ground truth (experimental data) at 100 kHz was fed into the model for parameter optimization. As such, with enough training data, the model is expected to capture the temporal-spatial correlation among consecutive frames even when the consecutive ground truths are weakly correlated with each other.

Figure 4 presents the probability density function of MSE, SSIM, Correlation and IoU calculated for 100 pairs of the testing images under the three conditions. The values of SSIM mostly range from 0.65 to 0.9, which indicates that the DL has the potential to predict the signal appearance and structure of OH-PLIF from its low-speed experimental data. Another obvious trend being observed is that the statistical results of SSIM and Correlation decreases with the number of temporal interpolations, indicating that increasing the prediction steps will compromise the model accuracy. It is worth noting that MSE presents an opposite trend compared with SSIM and Correlation as expected, this is because the former quantifies the difference while the latter two evaluates the degree of similarity.

Fig. 4
figure 4

Probability density function (PDF) curve for a MSE, b SSIM, c correlation and d IoU of test sequences

Figure 5a and b presents the binarized OH-PLIF image from experiments and DL model, respectively. It can be seen that the proposed DL method is able to reconstruct the OH profile. In addition, the perimeter of OH signal in each image was also calculated to evaluate how well the profile of signals were reproduced. To minimize the influence of noises in the image, an image morphological operation named erosion and dilation [58] was applied to smooth the image, as is shown in Fig. 5c and d, then a “marching squares” method [59] was used to compute the contours of the input 2D array at a particular signal intensity. The matrix of pixel intensity is linearly interpolated to provide better precision for the output contours. To obtain the connective area for perimeter calculation, we used the array based union-find method proposed by Wu et.al. [60], which is reported to be effective in removing noises. It is worth noting that the perimeter was calculated based on the contour enclosed within the dashed box in Fig. 5e, f, i.e., that only vertical profiles within the dashed boxes were evaluated to avoid the interference from those ‘horizontal’ edges due to the low excitation power on the edge of laser sheet. Figure 5g is the IoU of the experimental and the DL predicted binary images, which is presented here just for comparison.

Fig. 5
figure 5

a Binarized experimental OH-PLIF image; b DL predicted OH-PLIF image; c transformed experimental image by erosion and dilation; d transformed DL prediction by erosion and dilation; e contours of the experimental image in c; f contours of the DL prediction image in d; g IoU of the experimental and the DL predicted binary images

Figure 6 presents the perimeters of measured and predicted OH signal at various time under different modeling conditions. The average errors of perimeter between ground truth and prediction were also calculated for all the conditions studied, which is 2.05% for P50-100 kHz, 4.83% for P33.3–100 kHz and 5.42% for P20-100 kHz, respectively. Consistent with the results shown in Fig. 3, these results indicate that the model performed well for 50-100 kHz, but the increased length of interpolated sequence reduces the prediction accuracy. Also, it is worth noting that the model seems always to underestimate the perimeter of the OH boundary. This is because the DL based model was constructed with multiple kernels (filters) of 5 by 5, hence after the operation of these filters, some subtle features on the boundary of the OH cluster will be lost, hence the total perimeter of the DL predicted results will be shorter than the experimental one.

Fig. 6
figure 6

Perimeters of measured and predicted OH-PLIF signal at various time under different conditions

The Conv-LSTM model was also used to generate 200 kHz OH-PLIF based on 100 kHz experimental data, as is shown in Fig. 7. These predictions are not validated by experiments, as no such measurements are available yet. But it can be found that the predicted flame structure appears to be reasonable, and the flame fluidity is also reflected to some extent. Specifically, the predicted images preserve most of the spatial structures from the experimental data. In the meantime, the overall structure moves upwards to a reasonable extent at each time step. For example, the target area mark in the red circle moves up 452 pixels in total from 0 \(\mathrm{\mu s}\) to 45 \(\mathrm{\mu s}\), i.e. about 45 pixels every 5 \(\mathrm{\mu s}\).

Fig. 7
figure 7

Predictions of OH-PLIF images shown in dashed boxes for the case of P100–200 kHz

4.2 Time-averaged predictions of the DL model

While the indices described above were employed to quantify the similarity between prediction and measurement for instantaneous images, that of time-averaged images also worth evaluating. Here, we present the time-averaged images of OH-PLIF, for the measured and DL predicted results, with each sub-figure averaged from 100 instantaneous images, as is shown in Fig. 8. It can be seen that the profile and intensity distribution of signals between prediction and measurement are similar, but the degree of similarity slightly decreases as the number of interpolations increases. Specifically, the SSIM between the ground truth and DL prediction are 0.937, 0.920 and 0.915 for the case of P50-100 kHz, P33.3–100 kHz and P20-100 kHz, respectively. This is again, consistent with the results shown in Figs. 4 and 6 that, longer temporal interpolation deteriorates the performance of the DL model.

Fig. 8
figure 8

Time-averaged images of OH-PLIF, for the measured and DL predicted results, each sub-figure averaged from 100 instantaneous images

5 Conclusion

In this paper, we artificially accelerated the high-speed planar imaging of turbulent flames by building up a deep learning-based computational imaging model. An end-to-end trainable model for accelerating OH-PLIF imaging was established based on a Conv-LSTM model and incorporating it into the encoding-forecasting structure. It was found that the model has the capacity to generate 100 kHz OH-PLIF images from 50 kHz, 33.3 kHz and 20 kHz experimental data. The accuracy of prediction was also quantified with similarity indices, and it was found that the SSIM, i.e. a similarity index between the time-averaged ground truth and DL prediction are calculated to be 0.833, 0.804 and 0.732 for the case of 50–100 kHz, 33.3–100 kHz and 20–100 kHz, respectively. But the accuracy of the model at longer interpolation sequence is expected to increase with training data size. Furthermore, 200 kHz OH-PLIF imaging was also generated by the DL model based on the 100 kHz experimental results, with reasonable spatial structure and fluidity.