Introduction

Blast furnace (BF) ironmaking is a typical process manufacturing with complex mechanisms and high energy consumption (Pan et al. 2018), which is mainly composed of the BF body and the auxiliary supporting systems (Wang et al. 2015), as shown in Fig. 1. The feeding system is to feed the raw materials to the top of the BF body, including iron ore, coke, etc. The coal injection system continuously injects pulverized coal into the BF body. The hot blast system constantly supplies the BF body with high-temperature hot air above 1000 degrees. The steel tapping system produces the final iron and slag from the bottom of the BF body (Li et al. 2022). As iron ore and coke intermittently enter the top of the BF body, hot air is continuously bubbled into the bottom of the BF body, and the molten iron and slag intermittently exit the bottom of the BF body (Li et al. 2021). In addition, waste heat and gases are collected and treated by the top gas treatment system and the top gas recovery turbine (TRT) system. Therefore, it is crucial to ensure continuous and stable production by monitoring the condition of the complex BF ironmaking system in real-time and effectively.

Fig. 1
figure 1

Blast furnace body and its auxiliary system. The BF body can be roughly divided into five zones: throat, shaft, belly, bosh, and hearth, and the auxiliary supporting systems mainly include the ore and coke feeding system, the top gas treatment system and the top gas recovery turbine (TRT) system, the pulverized coal injection system, the hot blast system, and the steel tapping system

The BF heat load refers to the heat carried away by cooling equipment per unit area of time (Zhou et al. 2017). It is a critical index for monitoring the BF body's state. The total heat load of the BF is composed of 7 parts: the furnace bottom, furnace cylinder, furnace waist, furnace belly, the lower part of the furnace body, the middle part of the furnace body, and the upper part of the furnace body, that is, the total heat load value is accumulated from the heat load of these seven parts. However, the traditional method of measuring heat load by heat transfer and water flow has significant hysteresis (Zhou et al. 2018), and it is difficult to grasp the changing trend of the heat load in time. Furthermore, in the actual ironmaking process, when the temperature and pressure in the BF body fluctuate violently, the heat load change will increase exponentially in the short temporal (Semenov et al. 2017). But the current heat load monitoring methods can not acquire the rapid change in time (Wang et al. 2018a, 2018b). That is, there are problems such as unmeasured, inaccurate, and measurement delay. Once the heat load exceeds the controllable range, it will bring in BF erosion and even cause accidents (Kurunov et al. 2006). Therefore, it is essential to predict the heat load and take cooling measures in time to ensure the stability of the BF body.

In the existing literature, there is little research on the heat load prediction of BF systems, and more attention is paid to small-scale boiler systems such as utility boilers and solar boilers (Xing et al. 2017). Traditional heat load prediction methods are mainly based on mechanism analysis. Taler et al., (2019) proposed a mathematical model of a supercritical power boiler for simulating rapid changes in boiler heat load. In addition, to solve the problem of boiler tube failure caused by the heat load difference, Xu et al., (2000) proposed a heat load monitoring model based on thermal variables, thermal deviation theory, and flow deviation theory of power plants. The above modeling and analysis methods have achieved excellent boiler analysis and optimization scenarios. However, these methods mainly rely on empirical knowledge and assumptions. Their performance is usually demonstrated by post-inspection, which makes it challenging to obtain the state of the BF body in advance.

As an alternative method in recent years, automatic data-driven prediction methods have been used to extract hidden features from complex industrial data, mainly using machine learning and neural network models to predict the heat load. Dong et al., (2011) proposed a modified Takagi- Sugeno -type neuro-fuzzy system, realizing the online identification of thermal processes. In order to further improve the accuracy of heat load prediction in time-varying and uncertain environments, Xie, (2017) proposed a Back Propagation Neural Netp-Markov-based prediction model. In addition, scholars are also concerned with the simultaneous prediction of multiple indicators. Yukun et al., (2011) proposed a hybrid prediction model based on the artificial neural network (ANN), which realized the prediction of furnace temperature, NOx emissions, and other variables simultaneously. For the modeling and optimization of the heat load prediction, Zhao and Wang, (2009) proposed a hybrid model based on support vector regression (SVR) and improved the center particle swarm optimization (ICPSO) algorithm, which can improve the accuracy of heat load prediction. Moreover, to enhance the stability and robustness of heat load prediction, Hu et al., (2017) and Li et al., (2013) proposed ANN-based heat load prediction models, which are based on generalized regression neural network (GRNN) and fruit fly optimization algorithm (FOA).

In conclusion, the data-driven methods represented by the neural network models have been widely used in the prediction model. Optimization methods are also employed to improve the prediction model's accuracy and robustness. However, most of them can not accurately predict the heat load due to the complicated and changeable characteristics of the BF ironmaking process. The main reasons are as follows. First of all, there are many variables related to the heat load, including both temporal and non-temporal series variables, which results in high dimensionality and redundancy in the multiple data. Secondly, the diverse multi-granularity features (Liao et al., 2018; Lv et al., 2021; Zhao et al., 2020) in the state monitoring data. It mainly includes coarse-grained features of different measurement objects, such as temperature, pressure, load, etc. It also contains the fine-grained features of the same measurement object at different positions, such as throat temperature, shaft temperature, belly temperature, and temperature of other parts. The shallow neural network model can not effectively extract these features. Last but not least, most of the existing methods ignore the characteristics of the rapid change of heat load in a short period, resulting in insufficient accuracy of rapid change heat load prediction.

Therefore, we propose a novel data-driven framework for BF heat load prediction to address these three challenges mentioned above, and an attention-based one-dimension convolution neural network (1DCNN) combined with a bidirectional long short-term memory (BiLSTM) network is designed. The heat load prediction framework is shown in Fig. 2 and consists of two main parts: two-stage data pre-processing and the prediction model establishment. The main processes and contributions are shown as follows:

  1. 1.

    A two-stage data pre-processing model combining data cleaning and mutual information-based feature selection model is designed to solve multivariate BF ironmaking data's high dimensionality and redundancy, which realizes data dimension reduction and key variable selection.

  2. 2.

    We consider both coarse-grained and fine-grained features in multivariate data to realize the output prediction. We use the 1DCNN model to extract fine-grained features, and the coarse-grained features are extracted by the BiLSTM for prediction output under the designed fusion training mechanism.

  3. 3.

    In order to extract the rapid change characteristics of heat load caused by the violent oxidation–reduction reaction in the BF body, an attention mechanism is added to the 1DCNN model to extract the local features of the prominent fluctuation characteristics.

Fig. 2
figure 2

The framework of attention mechanism-based deep learning for heat load prediction

The rest of this article is organized as follows. Section "Related works" presents the related works, followed by the Two-stage data pre-processing in section "Two-stage data pre-processing". In Section "Prediction model", the composition and fusion strategy of the prediction model is introduced. Then the validity and accuracy experiments of the prediction model are evaluated in section "Industrial data-based experiments", and in section "Conclusion", the conclusion and future work are presented.

Related works

1DCNN model

Convolutional neural network (CNN) is a successful deep learning architecture first proposed by (Lecun et al. 1998) in artificial intelligence, which has strong feature extraction ability and high robustness. It has been widely applied in the fields of pattern recognition (Wang et al. 2020), image processing (Zhang et al. 2020), and natural language processing (Wang et al. 2021). However, the 1DCNN model is more suitable for temporal series feature extraction (Eren et al. 2019). The difference between the two models is that the 1DCNN model only convolutes on one dimension, while the two dimensions Convolutional neural network (2DCNN) model is convoluted on two dimensions simultaneously, as shown in Fig. 3. The brown box is the convolution kernel, which is the detector for feature extraction, the direction of the arrow is the direction of movement of the convolution kernel, and a one-way convolution kernel extracts the sequence features. Because the direction of movement is one-dimensional, it moves in one direction. It does not move back and forth as it does in multiple dimensions. Therefore, in the 1DCNN layer, the input data is convoluted by the activation function before flowing to the next layer refers to (1):

$${h}_{t}={\sigma }_{cnn}({W}_{cnn}*{x}_{t}+{b}_{cnn})$$
(1)

where \({W}_{cnn}\) represents the weight coefficient of the filter, namely the convolution kernel;\({x}_{t}\) represents the t input sample data information; * represents the discrete convolution operation between \({x}_{t}\) and \({W}_{cnn}\);\({b}_{cnn}\) is a bias parameter, which is obtained mainly through learning the model during training. \({\sigma }_{cnn}(\cdot )\) represents the activation function; \({h}_{t}\) represents the output data at the end of the convolution operation.

Fig. 3
figure 3

Difference between 2 and 1DCNN

The convolution operation establishes a mapping relationship between the layers. However, after convolution, the dimensionality of the data becomes higher and higher, and the feature map does not change much. After several successive convolutions, a large number of parameters will be generated, which will greatly increase the difficulty of network training and easily cause the phenomenon of overfitting. Therefore, to further reduce the computational effort. Pooling layers are used to reduce the dimensionality of the convolutional features and reduce the risk of network overfitting. Its calculation process refers to (2):

$${h}_{t}=down({h}_{t-1})$$
(2)

where \({h}_{t-1}\) and \({h}_{t}\) represent the eigenvalues before and after pooling, and down() represents the pooling function.

To obtain a larger receptive field, the 1DCNN model can use a wider convolution kernel to get the eigenvalues of the sequence more comprehensively and then realize feature extraction by stacking convolution and pooling layers.

Attention mechanism

The attention mechanism was proposed based on human vision research, introduced in computer vision and natural language processing to optimize the existing models and focus on the most practical information with limited resources (Li et al. 2017). The attention mechanism is mainly divided into hard attention and soft attention (Zhong et al. 2018). The hard attention mechanism filters the area of interest as input, effectively focusing on the target after removing meaningless background data in the image recognition. But the hard attention mechanisms are not fully applicable in temporal series prediction for the direct limits input (Chen et al. 2020).

Comparatively speaking, the soft attention mechanism uses the weights trained by the neural network to measure the global input features in space or channel, which can achieve attention on specific space areas or channels (Hou et al. 2021). At the same time, this method can also perform differential operations in reverse computation. Therefore, it can use the end-to-end learning method to train the attention network directly.

In general, a key-value pair can be used to represent an input message so that N input messages can be described as (K, V) = [(k1,v1), …,(ki,vi), …,(kN,vN)], where "keys" are used to calculate the attention distribution \({\alpha }_{i}\) and "values" are used to calculate the aggregated information. The attention mechanism is usually thought of as a soft addressing operation. Think of the input information as the content stored in the memory. The elements consist of the address key and the value. The value corresponding to each address key is extracted from the content and then summed. The weight of each value is first calculated using a similarity criterion, and then the values are weighted and summed. The weighted sum is then applied to the values to obtain the final value, which is the attention value.

$$\begin{array}{ll}Att((K,V),q)& ={\sum }_{i=1}^{N} {\alpha }_{i}{v}_{i}\\ & ={\sum }_{i=1}^{N} \frac{\mathrm{exp}\left(s\left({k}_{i},q\right)\right)}{{\sum }_{j} \mathrm{exp}\left(s\left({k}_{j},q\right)\right)}{v}_{i}\end{array}$$
(3)

where \(s\left({k}_{i},q\right)\) represents the attention score calculated based on the dot product model, \({\alpha }_{i}\) represents the result of the numerical transformation of the attention score by the softmax function.

LSTM model

Long short-term memory (LSTM) model is a variant of recurrent neural networks (RNN), composed of input, hidden, and output layers (Tian et al. 2020). Based on the RNN model, the LSTM model adds three gate structures: input gate, forgetting gate, and output gate. However, the cell state is the core of the LSTM model, and the cell state transmits relevant information along with temporal series and updates the state through the three gate structures. The typical structure of the LSTM neural unit is shown in Fig. 4. The forgetting gate selectively forgets the input of the previous node refers to (4):

$${f}_{t}=\sigma ({W}_{f}\cdot [{h}_{t-1},{x}_{t}]+{b}_{f})$$
(4)

where \({h}_{t-1}\) represents the hidden layer information of the previous moment, \({x}_{t}\) represents the current input, \(\sigma (\cdot )\) is the sigmoid function, and \({W}_{f}\) and \({b}_{f}\) are training parameters. The input gate selectively remembers the input at this stage, and the result of the forgetting gate determines the cell state update refers to (5):

$$\left\{\begin{array}{l}{i}_{t}=\sigma \left({W}_{i}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{i}\right)\\ {\widetilde{C}}_{t}=tanh({W}_{c}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{c})\\ {C}_{t}={f}_{t}*{C}_{t-1}+{i}_{t}*{\widetilde{C}}_{t}\end{array}\right.$$
(5)

where \({C}_{t-1}\) represents the cell state at the last moment, \({\widetilde{C}}_{t}\) represents the new candidate value vector, \({C}_{t}\) represents the current cell state, \({W}_{i}\), \({b}_{i}\), \({W}_{c}\) and \({b}_{c}\) are training parameters.

Fig. 4
figure 4

The basic unit of the LSTM

The output gate determines the output of the current state refers to (6):

$$\left\{\begin{array}{l}{o}_{t}=\sigma \left({W}_{0}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{0}\right)\\ {h}_{t}={o}_{t}*tanh({C}_{t})\end{array}\right.$$
(6)

where \({o}_{t}\) represents the operation result of the output gate, \({W}_{0}\) represents the weight, \({b}_{0}\) represents the bias, \(tanh(\cdot )\) is the activation function, * represents the inner vector product.

Two-stage data pre-processing

In this section, data pre-processing is carried out for the problems in the real data set. The first stage is data cleaning, and a feature selection model based on mutual information is designed to select the key variables and reduce feature dimensions in the second stage.

Data cleaning

Some collected data have missing values in the BF ironmaking process due to the abnormal collection equipment and the high ambient temperature. The next-minute values filling method (Deng & Wang 2017) fills in the missing values. In addition, considering the dimensional differences between the BF ironmaking temperature, feed rate, coal ratio, and other variables, the collected BF ironmaking state variables need to be standardized (Wang et al. 2019). Therefore, the maximum and minimum normalization method is used to standardize the related variables refers to (7):

$${{x}_{i}}^{*}=\frac{{x}_{i}-{x}_{min}}{{x}_{max}-{x}_{min}}$$
(7)

where \({x}_{i}\) and \({{x}_{i}}^{*}\) are the variables before and after normalization, \({x}_{max}\) and \({x}_{min}\) are the maximum and minimum variables before normalization respectively. The original data are quantized to 0 and 1 through linearization.

Feature selection

BF ironmaking is a complicated industrial process with an intense change of material and energy. The collected BF ironmaking state data contain different variables, including temporal and non-temporal series-related variables, such as temperature, feed rate, coal ratio, coke ratio, etc. In addition, there are significant linear and nonlinear relationships between some variables, and some of the variables with multi-location measurement, resulting in a certain degree of redundancy (Lv et al. 2020), so it is difficult to identify the variables related to heat load accurately. Moreover, taking all variables as inputs causes the complexity of model training and reduces the model prediction's response time. It is not conducive to the requirement of real-time monitoring of the BF body's state. Therefore, the feature selection method (Xu et al. 2020) needs to be used to identify the key variables affecting the heat load and take these critical variables as the input of the prediction model.

Mutual information is an index to measure the degree of interdependence between two variables (Qin et al. 2020). It can measure the relationships between BF ironmaking process variables and heat load. The mutual information of the continuous random variable refers to (8) and is used to analyze the nonlinear relationship between each variable and the heat load.

$${I}_{c}\left({X}_{i};Y\right)={\int }_{Y}{\int }_{{X}_{i}}p({x}_{i},y)\mathit{log}\left(\frac{p({x}_{i},y)}{p({x}_{i})p(y)}\right)dxdy$$
(8)

where p(\({x}_{i}\), y) represents the joint probability density function of the current variable \({X}_{i}\) and heat load Y, while p(\({x}_{i}\)) and p(y) represents the edge probability density function of the current variable \({X}_{i}\) and heat load Y respectively. The nonlinear relationships between each variable and the heat load are obtained through mutual information (Xu et al. 2022). Then the key variables closely related to the heat load are sorted in reverse order.

Prediction model

In this section, a heat load prediction model combining an attention-based 1DCNN with a BiLSTM network is designed. An attention branch is added to the standard 1DCNN network to extract fine-grained features of the local rapid change heat load. Coarse-grained features of latent temporal sequence are extracted by BiLSTM and realize multi-model fusion prediction under the full connection layer.

Attention-1DCNN-based feature extraction

In the BF ironmaking process, the heat load changes sharply. The features of long temporal series often contain more information due to the multi-granularity features in short sub-series, which influence the changing trend of actual heat load. Therefore, the soft attention mechanism is employed in the 1DCNN model, weighing all input features one by one and focusing on specific spaces and channels to extract significant fine-grained features of temporal series. The structure of the 1DCNN unit based on the attention mechanism is shown in Fig. 5, which mainly includes three parts: temporal feature extraction, parallel module, and feature fusion.

Fig. 5
figure 5

The structure of attention mechanism-based 1DCNN model

Temporal feature extraction Different segmentation scales are designed for the 1DCNN model in the original temporal series data, and an attention mechanism is designed to extract the multi-granularity features of temporal series data. Each 1DCNN module sets the input subsequence to s. Accordingly, in the attention mechanism module, the input subsequence is set to sa. In addition, by setting sa \(<\) s, the sensory field of the attention mechanism is made to focus on the local area, which can better grasp the process of rapid changes in heat load compared to the sensory field of 1DCNN. Therefore, it can accurately obtain the salient features with sharp transitions. The features extracted by the 1DCNN model and the attentional mechanism are calculated to (9) and (10).

$$e=\sigma ({W}_{attention}\cdot sa+{B}_{attention})$$
(9)
$$x=\sigma ({W}_{1dcnn}\cdot s+{B}_{1dcnn})$$
(10)

where \({W}_{attention}\) and \({B}_{attention}\) denote the weight and bias of the attention mechanism, \({W}_{1dcnn}\) and \({B}_{1dcnn}\) represent the weight and bias of the 1DCNN model.

Parallel module The parallel module refers to adding a parallel soft attention branch to the standard 1DCNN to extract trend features of significant changes in heat load. The attention branch is designed with a smaller input scale than the 1DCNN to focus on the perceptual field of the input, resulting in complete access to temporal contextual information and learning the importance of local sequence features. The attention module reduces the influence of non-important features in the final model by enhancing the impact of critical temporal features in the final model. This effectively responds to the model's inability to distinguish the variability in the importance of temporal features. Therefore, more abundant short sequence features are extracted through multi-scale information to prevent input scale deviation of the single-scale model, which increases the corresponding matching features of the prediction model.

Feature fusion The output features of the 1DCNN module and the corresponding attention mechanism module output significance features are fused by multiplying the feature elements one by one in the existing literature (Hyndman & Koehler 2006; Wang et al. 2017). Precisely, the importance of the features extracted by the 1DCNN model is measured by parameter λ. If the output features of 1DCNN are more critical, the corresponding attention mechanism module is closer to 0. Conversely, the less essential the output features of 1DCNN are, the corresponding attention mechanism module is closer to 1. In general, the features extracted by the default parallel module are equally important, i.e., λ = 0.5. Therefore, parameter λ reflects the importance of features to identify essential features accurately. The feature fusion of the parallel module is shown in (11).

$$\widetilde{x}=\uplambda e\cdot (1-\uplambda )x$$
(11)

where e and x represent the attention feature and the convolution feature in the parallel module respectively.

BiLSTM-based prediction output

After feature extraction based on the attention mechanism, features associated with changes in heat load trends are obtained, and this feature information is used as input parameters for subsequent prediction models. BiLSTM predicts the output based on the entire temporal series, including the LSTM network in positive and negative directions (Ribeiro et al. 2019). The training sequences are two different LSTM networks. Both are connected to an output layer, which provides complete past and future context information for each point in the input sequence of the output layer. The network weights are updated during the training process through forward and backpropagation of the output neurons (Wang et al. 2018a, 2018b; Zhang et al. 2018). The structure of the BiLSTM neural unit is shown in Fig. 6.

Fig. 6
figure 6

Structure of BiLSTM unit

In the structure of the bidirectional recurrent neural network, the forward hiding state and the reverse hiding state of the temporal step are set as \({h}_{t}\) and \({{h}^{^{\prime}}}_{t}\) respectively, and the forward hiding state and the reverse hiding state can be calculated respectively, refers to (12):

$$ \left\{ {\begin{array}{*{20}c} {h_t = \sigma \left( {W \cdot x_t + W \cdot h_{t - 1} + b_h } \right)} \\ {h^{\prime}_t = \sigma \left( {W^{\prime} \cdot x_t + W^{\prime} \cdot h_{t + 1} + b^{\prime}_h } \right)} \\ \end{array} } \right. $$
(12)

where \({x}_{t}\) and \({h}_{t-1}\) represent state input and hidden layer output at the hidden forward state, \({x}_{t}\) and \({h}_{t+1}\) represent current state input and hidden layer output at the reverse hiding state, \({b}_{h}\) and \(b^{\prime}_h\) represent bias at different levels. Furthermore, connect the hidden states \({h}_{t}\) and \(h^{\prime}_t\) in both directions to obtain the whole hidden state \({H}_{t}\), and input it to the output layer. The output layer calculates the output \({y}_{t}\) refers to (13):

$${y}_{t}=tanh({H}_{t}\cdot W+b)$$
(13)

where \(tanh(\cdot )\) is the activation function, \(W\) and b are the model parameters of the output layer.

figure a

Joint training mechanism

The fusion output of the 1DCNN model and attention mechanism is taken as the input of nodes in the BiLSTM unit. BiLSTM performs serial modeling to realize the fused prediction of features on the long-period and short-period sequences, and the final heat load prediction result is obtained. The fusion strategy can achieve the coarse-grained features by considering the long-period coarse-grained features in temporal series and the sharp changes in short-period fine-grained features. Algorithm 1 gives the details of the proposed attention-based deep learning method. Firstly, the raw data set needs to be normalized to obtain a standard data format for model training and testing. A mutual information-based feature selection follows this. Each variable is computed concerning the heat load to measure their correlation, and parameters with correlation values greater than 0.3 are retained. These two steps allow the key parameters that affect the heat load and the attention mechanism-based heat load prediction model to be further initialized. The feature extraction and prediction modeling are achieved through attention mechanism-based 1DCNN and BiLSTM models. Then, BPNN is used to compute the gradient of parameters, and the Adam algorithm is used to update the network parameters.

Industrial data-based experiments

In this section, the actual industry data and the comparative ablation experiments are introduced first, and then the feature selection experiment is analyzed. Finally, the validity of the proposed model is examined by comparing the prediction results of different models through experiments. The comparative prediction methods include Multi-Layer Perceptron (MLP), LSTM, and 1DCNN-BiLSTM models in these experiments, and these are the more practical and novel models in the existing literature.

Data description and experiment setting

The proposed model has been validated by experiments on the No.4 BF of a large iron and steel plant in Shanghai. The BF has a production capacity of 4 million tons per year, equivalent to a pig iron output of 10,000 tons per day, with the most significant internal volume in operation in China. Figure 7 is a schematic diagram of this real BF ironmaking system and its heat load data collection system (Li et al. 2021). We collected 26 features from the No. 4 BF in the company. After sifting and eliminating some vacant or invalid data, we finally collated 14 features as shown in Table 1. The sampling collection frequency is one hour, including six months of production data. Table 1 lists the multi-granularity features measured directly, including fine-grained features of the same temperature type but distributed in different parts and coarse-grained features of varying measurement objects. Based on the actual ironmaking process data, the mutual information-based feature selection model is designed, and the statistical results of the prediction model under each evaluation metric are further analyzed. Finally, the predicted and the actual heat load in the next three days are compared and analyzed.

Fig. 7
figure 7

Schematic diagram of BF and heat load data collection system

Table 1 Input and output variables using for the prediction model

The accuracy of the prediction model is evaluated by comparing the predicted heat load with the actual data. We use the macro averaging criterion to assess the proposed prediction model. They are mean absolute error (MAE), root means square error (RMSE), mean absolute percentage error (MAPE), and maximum absolute error (MAX Error). The four evaluation metrics are as follows:

$$MAE(y,\widehat{y})=\frac{1}{n}\sum_{i=1}^{n}|{y}_{i}-{\widehat{y}}_{i}|$$
(14)
$$RMSE(y,\widehat{y})=\sqrt{\frac{1}{n}\sum_{i=1}^{n}{({y}_{i}-{\widehat{y}}_{i})}^{2}}$$
(15)
$$MAPE(y,\widehat{y})=\frac{100\%}{n}\sum_{i=1}^{n}|\frac{{y}_{i}-{\widehat{y}}_{i}}{{y}_{i}}|$$
(16)
$$MAX Error(y,\widehat{y})=MAX(\left|{y}_{i}- {\widehat{y}}_{i}\right|)$$
(17)

where, \({y}_{i}\) represents the actual data, \({\widehat{y}}_{i}\) represents the predicted heat load.

Ablation comparative experiments

The attention-based 1DCNN-BiLSTM proposed in this paper was compared with the current mainstream deep learning methods and temporal-series models: MLP, LSTM, and 1DCNN-BiLSTM models. Then tenfold cross-validation of the proposed attention-based 1DCNN-BiLSTM model is performed. The optimal case of loss function evaluation parameters is finally selected as the final model hyperparameters, as shown in Table 2. we use parameter a to equalize the length of 1DCNN and attention sequences lengths. We have designed three alternative values for parameter a, which are 0.25, 0.5, 0.75. Finally, the parameter tests have revealed that the best results are achieved when a = 0.25 identifies sharp heat load changes. Thus, the sequence length CNN _s = 24 for 1DCNN feature extraction and the sequence length Attention _sa = 6 for feature extraction of the attention branch. In the BF ironmaking process, timely and effective acquisition of the heat load change trend is critical for taking preventive measures in advance and ensuring the safety of the BF body. In this case, the fast and stable convergence of the prediction model is vital to the application deployment of the algorithm. Therefore, the convergence speed of the proposed model is evaluated and compared with the MLP, LSTM, and 1DCNN-BiLSTM models. The comparison of loss functions in the validation set is shown in Fig. 8. It can be seen from the figure that the MLP model without temporal-series data processing ability has a slow convergence speed during the training process. The LSTM and 1DCNN-BiLSTM temporal-series prediction models take more time to achieve the final stable convergence under the validation set. However, the proposed attention-based 1DCNN-BiLSTM algorithm has a faster convergence speed than other prediction algorithms, which meets the requirements of real-time online prediction.

Table 2 Hyperparameters setting of the attention-based 1dcnn- bilstm model
Fig. 8
figure 8

The comparison of loss functions in validation set

Feature selection results

In this section, the feature selection method is applied by referring to the technique in existing literature (Xu et al. 2020), and mutual information selects the critical influencing factors. Finally, these key factors are then used as inputs to the prediction model, providing necessary reliable data for timely and accurate grasping of the changing trend of heat load and elevating the prediction model's response time.

By calculating the nonlinear correlation between variables through mutual information, the heat map of the relationships can be obtained, as shown in Fig. 9. However, based on the description by Xu et al., (2020), if the correlation coefficient between the two variables is less than 0.3, between the variables is weak and is usually considered irrelevant. Therefore, we finally select the six key variables the correlation from the 14 input variables: 3-pulverized coal injection, 4-blast capacity, 5-oxygen consumption, 6-humidity, 12-Fe content, and 13-FeO content.

Fig. 9
figure 9

The heat map of mutual information based relationships

Metrics evaluation of different algorithms

During the process of the statistical results, each model was run 30 times separately, then recorded and drew a box plot to reduce the contingency of a single prediction experiment. The results are shown in Fig. 10. It can be seen from the plot that the MLP model performs the worst in each evaluation indicator. The fluctuation of the predictive effect in each metric is the most prominent. Compared with the MLP model, the prediction results of the LSTM model decreased in all indicators, indicating that the model with time-series feature processing capability is more suitable for characterizing the trend of heat load. However, further improvement of the LSTM model by adding the 1DCNN model led to a decrease in the mean value of each metric of the prediction model. Still, the magnitude of the abnormal fluctuations of each metric was greater than that of the LSTM model.

Fig. 10
figure 10

The performance of the four prediction algorithms in different evaluation metrics

Further analysis of each index's box plot result values is shown in Table 3. Although the mean value of the improved index decreased, the maximum and minimum values of the prediction results did not decrease. Still, they increased, indicating that the improved 1DCNN-BiLSTM model could not effectively extract the characteristics of rapid changes in the process of heat load change. Finally, the attention mechanism is added to the 1DCNN-BiLSTM model to extract the short-term fluctuation characteristics of heat load in a targeted manner. The results of each index show that this improvement is effective, significantly reducing the index error, and the fluctuation of the index is decreased considerably.

Table 3 Statistical results of the MLP, LSTM, 1DCNN- BiLSTM,Attention- 1DCNN- BiLSTM Prediction model

Comparison of the heat load prediction results of different algorithms

With the selected key variables as inputs and the heat load as outputs over the next three days, the prediction results of the different models are shown in Fig. 11. As shown in Fig. 11a, The prediction results of the MLP model were relatively accurate in the first 20 h. Still, the following prediction errors became larger, especially from the highest point at the 20th hour to the lowest point at the 40th hour. It indicates that the MLP model has some advantages in short-term time series prediction, and the ability to extend the prediction to a more extended period is insufficient. In Fig. 11b, The LSTM model outperforms the MLP model in predicting the same situation due to its time-series feature extraction and memory function. Although it is similar to the actual heat load variation trend, the results for the highest and lowest points of heat load are not accurately predicted. The analysis of the real data collected shows that the data related to the heat load has multi-granularity features, i.e., both coarse and fine granularity features. Therefore, a 1DCNN feature extraction module is added to the LSTM model to enhance the feature extraction ability. The LSTM is changed to a BiLSTM model to enhance the memory of temporal features. The improved 1DCNN-BiLSTM model does not significantly improve heat load trend prediction compared with the original LSTM model, as shown in Fig. 11c. The maximum and minimum values of heat load are not accurately predicted. The reason for this may be that the addition of new functional modules simultaneously increases the complexity of the model, and the model's ability to sense extreme cases is insufficient, resulting in poor model prediction. Further attempts are made to add an attention mechanism to the improved 1DCNN-BiLSTM model. And the model is simplified by reducing the feature extraction and temporal prediction layers so that the model can focus more on learning heat load trends in extreme cases. Sure enough, the above improvements led to a significant improvement in the prediction accuracy of the latest model, with great precision in both the prediction of heat load extremes and the prediction of change trends, as shown in Fig. 11d.

Fig. 11
figure 11

The comparison between the predicted values and the real values. a MLP; b LSTM; c 1DCNN-BiLSTM; and d Attention-1DCNN-BiLSTM

Statistical analysis of the heat load prediction results of different algorithms

Further, we perform a statistical analysis of the prediction results of different models. The scatter plots corresponding to the four groups of prediction models are shown in Fig. 12. The horizontal and vertical coordinates represent the actual and predicted heat load. The red and blue dividing lines represent the prediction error value greater than or less than \(\pm \) 0.03%, respectively. When the point falls between the two lines, the prediction result is closer to the actual heat load, and the error is considered acceptable in the engineering application. It can be seen from Fig. 12a that only 47.89% of the points in the MLP model fall within the acceptable error range, while 49.3% of the points have error values greater than 0.03%, and only 2.82% of the points have error values less than 0.03%. From Fig. 12b, it can be seen that more than 87.32% of the heat load prediction errors in the LSTM model are within the interval [− 0.03%, + 0.0.3%], which is a great improvement compared to the MLP model. As shown in Fig. 12c, the prediction results of the 1DCNN-BiLSTM model were further counted. It was found that 78.87% of the heat load predictions were within the acceptable error range, significantly lower than the LSTM model. It shows that the high model complexity causes a decrease in prediction accuracy. Finally, from Fig. 12d, the attention-based 1DCNN-BiLSTM model was obtained by simplifying the 1DCNN-BiLSTM model and adding an attention mechanism. Its prediction error within the acceptable range reaches 94.37%, with higher accuracy than MLP, LSTM, and other advanced models. The prediction models we have proposed above as well as the traditional LSTM models have achieved good prediction accuracy. Although the complexity of the LSTM model is lower than that of our proposed model, the convergence speed and accuracy of our proposed model are better. However, our proposed model is more effective in convergence speed and accuracy. This is because the complexity of the original data is significantly reduced through a series of dimensionality reductions and 1DCNN feature extraction based on the attention mechanism. Then the BILSTM model can learn the intrinsic mechanism of the parameters better. This results in better performance than the LSTM model alone.

Fig. 12
figure 12

Scatter plots of heat load with different models a MLP; b LSTM; c 1DCNN-BiLSTM; and d Attention-1DCNN-BiLSTM

The SOTA comparison tests

In order to verify the superiority of the proposed method in the paper, we further carried out state-of-the-art (SOTA) comparison tests. Such as Temporal Convolutional Network (Kok et al. 2020) (TCN), and the Transformer model (Acheampong et al. 2021). The Attention-1DCNN-BiLSTM model is compared with these two typical SOTA models. The predicted output of each model is compared with the true value to obtain the error comparison result at each moment, and the error value of each moment is counted in 24 h, the heat load was counted for four consecutive days, and the error at each moment is counted, as shown in Fig. 13. From the comparison chart, it can be seen that the TCN model and Transformer model do not perform well. This may be because TCN models are generally unidirectional in structure and fail to accurately identify sudden heat loads with backward and forward correlations. The Transformer model is not as good as the RNN and CNN models in acquiring local information, and the top gradient tends to disappear. Therefore, the attention mechanism network model designed in this paper for the characteristics of rapid changes in heat load can achieve better prediction results.

Fig. 13
figure 13

Error frequency histogram of heat load with different models a the first cycle; b the second cycle; c the third cycle; and d the fourth cycle

Conclusion

This research proposes an attention mechanism-based deep learning for heat load prediction in the BF ironmaking process. The main focus of this research is to offer a systematic framework of data-driven methods for the BF heat load prediction problem. Two parts are included, a two-stage data pre-processing and a prediction model based on an attention mechanism. In particular, in the second part, we focus on the lagging nature of traditional methods and the unpredictability of trends when the heat load changes with sharp fluctuations. An attention mechanism is added to ensure real-time heat load prediction and the identification of abnormal fluctuations. Finally, the experiment results of different prediction models show that the proposed prediction model can accurately grasp the rapid change trend of heat load, and the accuracy of the prediction model achieves 94.37%, which is higher than that of the comparison prediction models. However, there are still some deficiencies in this research. The feature selection section provides a nonlinear correlational relationship feature selection method based on mutual information, whose thresholds are determined mainly based on empirical values. Next, we also consider adaptive threshold determination methods and correlational coupling effects between variables. Besides, the impact of model complexity on prediction performance is evident from the experimental part. In future research works, we will investigate other models and work on the simplification methods of complex models to achieve high prediction accuracy of simple models.