1 Introduction

Dams are indispensable infrastructure components used in flood control, power generation, irrigation and shipping. With improvements in engineering technology, the scale and maximum dam height of concrete dams have increased rapidly in recent years. The Jinping I arch dam is the highest concrete dam built in the world, with a maximum dam height of 305 m. However, dams bear complex static and dynamic loads and long-term environmental erosion during operation, and the actual load may be larger than the designed scenario, such as the increase in thermal load caused by global climate change in recent years, which is likely to cause potential safety hazards [1]. To ensure dam safety, structural health monitoring plays an important role in dam construction and operation management [2]. In this regard, benefitting from the rapid development of analysis theory and computer technology, dam safety management has gradually realised a transformation to the ‘digital dam’ and then developed towards the ‘smart dam’, which is characterised by the extensive intelligent real-time analysis of massive monitoring data and the full integration of in-site monitoring, numerical simulations and intelligent control [3].

Among all measured quantities, displacement is the most intuitive reflection of the structural state of concrete dams. The most frequently used mathematical monitoring model is the hydraulic, seasonal and time (HST) model, which represents the thermal deformation effect of concrete dams by the seasonal harmonic factor [4, 5]. To adapt to the development of technology and the emergence of new engineering problems, the HST model has been optimised in two aspects. First, new causal factors have been established to explain the abnormal deformation behaviours of some concrete dams. Hu et al. [6] added a crack opening component into the HST model, and the new component was used to quantitatively evaluate the influence of radial penetrating cracks on the Chencun arch dam. To interpret the measured hysteretic hydraulic deformation behaviour, Wang et al. [7, 8] established a hydraulic, hysteretic, seasonal and time (HHST) model and a hydraulic, exponential, seasonal and time (HEST) model for the Jinping I arch dam, by which the viscoelastic parameters of dam concrete were inversed. Second, with the increase in dam height and the application of thermal insulation measures in severely cold areas, the time lag effects of reservoir water depth and operation mode on the temperature field of concrete dams become more complex, which makes it difficult for the seasonal harmonic factor in the HST model to accurately reflect the actual thermal deformation effect of high-concrete dams. Therefore, to improve the interpretation and prediction accuracy of the displacement monitoring model of concrete dams, it is an effective way to use the measured dam temperature to establish the temperature deformation factor [9]. However, there are hundreds or even thousands of thermometers embedded in each high-concrete dam; thus, it is unrealistic and unreasonable to use all of them as modelling factors. How to extract effective information from the measured massive temperature data is still a key problem that needs to be further solved in the current research. To achieve this goal, Kang et al. [10] took the nonequidistant piecewise average values of the previous air temperature at the dam site as temperature deformation factors, and the length of the previous air temperature period used was determined according to the dam type and dam body thickness. Tatin et al. [11] established the HST-Grad and HST-Layer models using the measured dam temperature in each elevation layer. Mata et al. [12] and Prakash et al. [13] used the main principal components as temperature deformation factors, the former of which were extracted from the measured temperature time series of multiple monitoring points. Based on the shape similarity of temperature time series, Wang et al. [14] proposed a spatial clustering method for the measured dam temperature field, and principal components were then extracted from the temperature time series of all monitoring points in the same cluster. Belmokre et al. [15] used the concrete temperature of four points as thermal inputs of the random forest-based displacement monitoring model of arch dams. Two temperature points are located on the upstream and downstream faces, and the other two are 2 m inwards of each face. These concrete temperatures can first be calculated through a deterministic thermal model proposed by the same team [16].

Based on optimised causal factors, it is important to improve the prediction and interpretation ability of mathematical monitoring models using different modelling methods. The traditional multilinear regression (MLR) method, including multiple regression and stepwise regression, regards dam response as a linear explicit function of causal factors. In practice, causal mechanisms are nonlinear and dynamic. To solve this problem, artificial intelligence algorithm-integrated machine learning models, such as artificial neural network (ANN), support vector machine (SVM), extreme learning machine (ELM), regression tree (RT), random forest (RF), and long short-term memory (LSTM), have strong nonlinear data mining ability and have been widely used in structural health monitoring [15, 17,18,19,20,21]. With an optimal kernel function, the prediction performance of machine learning models is generally better than that of the MLR model. However, the overfitting problem of the former should be handled carefully, and the Bayesian regularisation method and the deformation spatial association-coupled double objective optimisation method have been proven to be effective for alleviating overfitting [22]. Another disadvantage of machine learning models is that they are usually considered black box models, in which the nonlinear implicit relationship between model input and output is based on network structures. As a result, the application of machine learning models mainly focuses on prediction, while the actual demands for the causal interpretation ability of mathematical models in the field of dam safety monitoring are ignored [23].

The volume of a high arch dam is very large, and the lag influencing mechanisms of air temperature and reservoir water temperature on the dam temperature field are extremely complex [16]. To select a very small number of the most representative members from hundreds or even thousands of dam body temperature monitoring points, by which their measured temperatures are directly used as temperature deformation factors, the importance of modelling factors is evaluated by the SVM, and automatic and manual criteria are formulated to eliminate the large number of unimportant or similar temperature monitoring points. Then, a hydraulic, exponential, thermal and time (HETT) model is established for the Jinping I arch dam, and the causal mechanisms of the hydraulic and thermal effects on dam displacement are quantitatively interpreted by a partial dependence diagram (PDP).

2 Measured dam temperature-based displacement monitoring model

For the Jinping I arch dam, this paper intends to establish an HETT model for dam displacement based on the SVM and the measured temperatures of the dam body. The modelling process is shown in Fig. 1. The main approach is to select the most representative dam body temperature monitoring points through an importance evaluation of SVM input factors, and the selected dam temperature time series are then denoised and used as temperature deformation factors in the dam displacement monitoring model. To compare with the seasonal harmonic temperature factor-based model, prediction evaluation indices of the mean square error (MSE), maximum error (ME) and correlation coefficient (R2) and the hydraulic and temperature components separated by the PDP are used to evaluate the prediction accuracy and causal interpretation ability of the proposed HETT model.

Fig. 1
figure 1

Modelling process of the measured temperature-based HETT model for the Jinping I arch dam

2.1 The HETT model

For the Jinping I arch dam discussed in this paper, previous research results indicate that the measured dam deformation behaviour shows an obvious viscoelastic hysteretic effect, and a hysteretic hydraulic component should be added to the HST model [7]. Therefore, the displacement of this dam needs to be interpreted by four causal components. (1) An instantaneous elastic hydraulic component caused by water pressure, \(\delta_{He}\). (2) A viscoelastic creep-induced hysteretic hydraulic component, \(\delta_{Hv}\), is used to characterise the abnormal phenomenon that the measured radial displacement of the dam body continues increasing towards the downstream direction, which mainly appears at the time period that the upstream reservoir water level maintains at the elevation of 1880 m for approximately 100–170 days every year, and it can be represented by a step-type exponential function [8]. (3) A temperature component, \(\delta_{T}\), where the measured temperature of the dam body is used to establish the temperature deformation factor, and it is then compared with the traditional harmonic temperature factor. (4) The irreversible time effect component, \(\delta_{\theta }\), is mainly accumulated from creep, plastic deformation, material deterioration, bank slope extrusion and other factors of dam concrete and foundation rock mass. Based on the traditional HST model, the HETT model can then be established as follows:

$$\delta = \sum\limits_{i = 1}^{4} {a_{i} H^{i} } + \sum\limits_{j = 1}^{N} {b_{j} T_{j} } + c_{1} \theta + c_{2} \ln \theta + f(\tau ) \cdot \left\{ {d_{1} [1 - \exp ( - \alpha \tau )] + d_{2} [1 - \exp ( - \beta \tau )]} \right\},$$
(1)

where \(H\) is the water depth of the upstream reservoir on the displacement monitoring day, \(T_{j}\) is the measured temperature of the jth selected dam body temperature monitoring point, \(N\) is the total number of used temperature monitoring points, \(t\) is the number of cumulative days from the initial monitoring day, and \(\theta = t/100\). \(f(\tau )\) is the step-type function; when the upstream reservoir water level is maintained at an elevation of 1880 m, its value is 1; otherwise, it is 0. \(\tau\) is the duration days of the current water level stable stage, and \(a_{i}\), \(b_{j}\), \(c_{1}\), \(c_{2}\), \(d_{1}\) and \(d_{2}\) are regression coefficients. \(\alpha = {{E_{K1} } \mathord{\left/ {\vphantom {{E_{K1} } {\eta_{K1} }}} \right. \kern-0pt} {\eta_{K1} }}\), and \(\beta = {{E_{K2} } \mathord{\left/ {\vphantom {{E_{K2} } {\eta_{K2} }}} \right. \kern-0pt} {\eta_{K2} }}\). Here, \(E_{K1}\) and \(E_{K2}\) are hysteretic elastic moduli in the generalised Kelvin model of dam concrete, and \(\eta_{K1}\) and \(\eta_{K2}\) are viscosity coefficients. For the Jinping I arch dam, \(\alpha = 0.5283\), and \(\beta = 0.0052\) [8].

2.2 SVM model

The essence of SVM regression is to find an optimal classification surface to separate the samples into two groups to minimise the training error. For a training set \(\left\{ {\left( {{\varvec{x}}_{{\varvec{i}}} ,{\varvec{y}}_{{\varvec{i}}} } \right),i = 1,2, \cdots ,n} \right\}\), where \({\varvec{x}}_{{\varvec{i}}} = \left[ {x_{i}^{1} ,x_{i}^{2} , \cdots ,x_{i}^{l} } \right]^{T}\) contains \(l\) input factors, \({\varvec{y}}_{{\varvec{i}}} \in R\) is the output of the SVM model, and \(n\) is the total number of samples. A linear regression model can be established in high-dimensional feature space as follows:

$$f\left( x \right) = {\varvec{w}}\phi \left( x \right) + m,$$
(2)

where \(\phi \left( x \right)\) is a nonlinear mapping function, \({\varvec{w}}\) is the weight, and \(m\) is a constant.

\(\varepsilon\) is the linear insensitive loss function and is defined as follows:

$$L\left( {f\left( x \right),y,\varepsilon } \right) = \left\{ {\begin{array}{*{20}c} 0 \\ {\left| {y - f\left( x \right)} \right| - \varepsilon ,} \\ \end{array} } \right.\quad \begin{array}{*{20}c} {\left| {y - f\left( x \right)} \right| \le \varepsilon } \\ {\left| {y - f\left( x \right)} \right| > \varepsilon } \\ \end{array} .$$
(3)

Using the Lagrange function, the above regression problem can be transformed into a coupled optimisation problem as follows [4]:

$$\begin{gathered} \mathop {\max }\limits_{{\alpha ,\alpha^{i} }} \left[ { - \frac{1}{2}\sum\limits_{i = 1}^{l} {\sum\limits_{j = 1}^{l} {\left( {\alpha_{i} - \alpha_{i}^{*} } \right)\left( {\alpha_{j} - \alpha_{j}^{*} } \right)K\left( {x_{i} ,x_{j} } \right) - \sum\limits_{i = 1}^{l} {\left( {\alpha_{i} + \alpha_{i}^{*} } \right)\varepsilon + \sum\limits_{i = 1}^{l} {\left( {\alpha_{i} - \alpha_{i}^{*} } \right)y_{i} } } } } } \right] \hfill \\ {\text{s.t.}}\sum\limits_{i = 1}^{l} {\left( {\alpha_{i} - \alpha_{i}^{*} } \right) = 0} ;0 \le \alpha_{i} ,\alpha_{i}^{*} \le C, \hfill \\ \end{gathered}$$
(4)

where \(K\left( {x_{i} ,x_{j} } \right) = \phi (x_{i} )\phi (x_{j} )\) is the kernel function, \(C\) is the penalty factor, and \(\alpha_{i} \ge 0\) is the Lagrange multiplier.

The optimal solution of the Lagrange multiplier can be obtained as \({{\varvec{\upalpha}}} = \left[ {\alpha_{1} ,\alpha_{2} , \cdots ,\alpha_{i} } \right]\) and \({{\varvec{\upalpha}}}^{*} = \left[ {\alpha_{1}^{*} ,\alpha_{2}^{*} , \cdots ,\alpha_{i}^{*} } \right]\), and thus, the SVM regression function can then be expressed as follows:

$$f\left( x \right) = w^{*} \phi \left( x \right) + b^{*} = \sum\limits_{i = 1}^{l} {\left( {\alpha_{i} - \alpha_{i}^{*} } \right)K(x_{i} ,x) + b^{*} } .$$
(5)

2.3 Measured temperature deformation factors

2.3.1 Importance analysis of SVM modelling factors

On the premise of ensuring accuracy, to reduce the number of modelling factors used in the dam displacement SVM model, the input factors of the initial SVM model can be optimised by eliminating some unimportant or similar modelling factors. The importance of an input factor can be expressed as the partial derivative of the model output \(f\left( {x_{i} } \right)\) to the input factor, so the importance degree of the rth modelling factor can be quantified as follows [24]:

$$S_{r} = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {\frac{{\partial f\left( {x_{i} } \right)}}{{\partial x_{ir} }}} \right|} \quad r = 1,2,\ldots , l,$$
(6)

where n is the total number of model training samples.

For the frequently used radial basis kernel function, the partial derivative in Eq. (6) can be calculated as follows:

$$\begin{gathered} \frac{\partial f(x)}{{\partial x_{r} }} = \frac{\partial }{{\partial x_{r} }}\left( {\sum\limits_{i = 1}^{{n_{s} }} {( - \hat{\alpha }_{i} + \hat{\alpha }_{i}^{ * } )K(x_{i} ,x)} + \hat{b}} \right) \hfill \\ \, = \frac{\partial }{{\partial x_{r} }}\left( {\sum\limits_{i = 1}^{{n_{s} }} {( - \hat{\alpha }_{i} + \hat{\alpha }_{i}^{ * } )\exp \left( { - \gamma \sum\limits_{j = 1}^{n} {(s_{ij} - x_{rj} )^{2} } } \right)} + \hat{b}} \right) \hfill \\ \, = 2\gamma \sum\limits_{i = 1}^{{n_{s} }} {( - \hat{\alpha }_{i} + \hat{\alpha }_{i}^{ * } )\exp \left( { - \gamma \sum\limits_{j = 1}^{n} {(s_{ij} - x_{rj} )^{2} } } \right)} (s_{ij} - x_{rj} ), \hfill \\ \end{gathered}$$
(7)

where ns is the total number of support vectors, s is the support vector, and \(\gamma\) is the parameter optimised in the kernel function.

2.3.2 Optimisation of the most representative dam temperature monitoring points

In practice, there are hundreds or even thousands of temperature monitoring points arranged on the dam body of a high-concrete arch dam, and the temperature evolution laws of different monitoring points have both similarities and differences, which depend on the spatial distance between them. Therefore, the main issue of establishing a measured temperature-based displacement monitoring model is to select a very small number of the most representative dam temperature monitoring points. To achieve this goal, based on the importance analysis of the input factors of the SVM model, some temperature monitoring points, the measured temperature of which has the lowest effect on the performance of the displacement SVM model, can be eliminated in turn. The optimisation method can be implemented as follows:

  • Step 1: Take all effective temperature monitoring points of the dam body as temperature deformation factors, and an initial SVM model including all input factors can be established.

  • Step 2: According to Eq. (6), calculate the importance degree of each input factor used in the current SVM model.

  • Step 3: Eliminate the least important input factor and create a new input factor set with all remaining factors.

  • Step 4: Based on the optimised input factor set, a new SVM model can be established.

  • Step 5: Calculate the mean square error (MSE) of the new SVM model, which is used to evaluate the performance of the SVM model, and return to Step 2 until only one modelling factor is reserved.

  • Step 6: Find the minimum MSE among all the above SVM models and mark it as MSEmin. On this basis, recur forwards from the last SVM model, and the best SVM model in the automatic elimination process can be determined according to the first exceeding criterion of the MSE, namely, the MSE exceeds the value of (1 + preset threshold) * MSEmin for the first time.

  • Step 7: Implement the manual elimination process to further optimise the temperature deformation factors automatically retained in Step 6.

The implementation process of establishing the measured dam temperature-based HETT model is shown in Fig. 2. In Step 6, the preset threshold is the maximum allowable increase ratio of the MSE of the best SVM model with respect to the MSEmin, and it can be determined as 5%, according to the conventional accuracy requirement of engineering projects. If the reduction in the number of modelling factors of the best SVM model, compared with the SVM model with respect to the MSEmin, does not exceed 5% of the total number of all initial modelling factors, the latter can then be adjusted as the best SVM model to improve the accuracy. With the increase in elimination order, the remaining modelling factor has a greater effect on the prediction performance of the SVM model, and its elimination will cause a larger increase in the MSE; thus, the selection of the best SVM model should be conducted from the last towards the first SVM model in turn. Although the automatic process in Step 6 can effectively eliminate all unimportant factors before the MSE is exceeded, there are still a small number of similar temperature factors. Therefore, if the temperature time series of two automatically retained monitoring points and their importance degree are both similar, the lower importance point in each duplicate pair can then be manually eliminated on the premise that the MSE increase ratio of the new SVM model does not exceed the threshold.

Fig. 2
figure 2

Implementation process of establishing the measured dam temperature-based displacement monitoring model

2.3.3 Wavelet denoising of measured temperature time series

Measured temperatures of a dam body are usually affected by solar radiation, reservoir water level fluctuation and monitoring errors, especially because the monitoring points arranged near the dam surface are directly radiated by sunlight; thus, temperature fluctuations are more severe. If these temperatures are directly used to establish a displacement monitoring model, it will lead to the overmining of fluctuation data in the machine learning model, which will ultimately affect the prediction accuracy of the SVM model. Therefore, it is necessary to denoise any temperature time series with large fluctuations.

Wavelet multiple-resolution analysis is a signal analysis method in the time–frequency domain. It can decompose a temperature time series into subcomponents with different frequency characteristics, and some noise components with high frequency can then be eliminated. In the decomposition process, only the low-frequency component obtained in the previous step is decomposed again, and the decomposition process can be expressed as follows:

$$f_{0} = f_{1} + d_{1} = f_{2} + d_{2} + d_{1} = \cdots = f_{N} + d_{N} + d_{N - 1} + \cdots + d_{2} + d_{1} = f_{N} + \sum\limits_{i = 1}^{N} {d_{i} } ,$$
(8)

where \(f_{0}\) is the original signal, and \(f_{i}\) and \(d_{i}\) are the low- and high-frequency subcomponents.

2.4 Causal interpretation of dam deformation behaviour based on the SVM model

The nonlinear relationship between the input and output of a machine learning model is implicit and based on network structure, so it is difficult to interpret the causal mechanism of dam deformation behaviour. To solve this problem, the PDP can be used to mine the implicit relationship of machine learning models. For an established machine learning model, when using the PDP to quantify the influence of an input factor on model output, only this factor is determined as the independent variable, and its input data are all replaced by a fixed value, which continuously increases from the actual minimum value to the maximum value, while all other input factors are used as control variables with their actual values. Through the evolution law of the model output, a characteristic function that only depends on the studied input factor can then be obtained. Finally, the causal interpretation of a machine learning model can be represented by calculating the relative increment of each input factor to the model output. Therefore, in this paper, the PDP can be used to separate the hydraulic and temperature components in machine learning models of dam displacement. However, it is not completely reasonable to use the PDP to quantify the influences of the two time effect factors because their input values in the training set increase sequentially, and each value appears only once.

If the model input factor \(X\) is divided into \(x_{s}\) and its supplement \(x_{c} = {X \mathord{\left/ {\vphantom {X {x_{s} }}} \right. \kern-0pt} {x_{s} }}\), the partial dependence of the model output on the response of \(x_{s}\) can be defined as follows:

$$f_{s} \left( {x_{s} } \right) = E_{{x_{c} }} \left[ {\mathop f\limits^{ \wedge } \left( {x_{s} ,x_{c} } \right)} \right] = \int {\mathop f\limits^{ \wedge } \left( {x_{s} ,x_{c} } \right)p_{c} \left( {x_{c} } \right)dx_{c} } ,$$
(9)

where \(p_{c} \left( {x_{c} } \right)\) is the marginal probability density of \(x_{c}\), namely \(p_{c} \left( {x_{c} } \right) = \int {p\left( x \right)} dx\).

This can be further estimated from a set of discrete training data as follows:

$$f_{s} \left( {x_{s} } \right) = \frac{1}{n}\sum\limits_{i = 1}^{n} {\mathop f\limits^{ \wedge } \left( {x_{s} ,x_{i,c} } \right)} ,$$
(10)

where \(x_{i,c} \left( {i = 1,2,\cdots ,n} \right)\) is the value of training sample \(x_{c}\).

2.5 Model performance evaluation

The fitting and prediction accuracy of mathematical monitoring models are frequently evaluated by the MSE, ME and R2, as shown in Eqs. (11) to (13), in which a smaller MSE and ME and a larger R2 indicate better performance of the model:

$$MSE = \frac{1}{n}\sum\limits_{t = 1}^{n} {\left( {\delta_{t} - \hat{\delta }_{t} } \right)^{2} } ,$$
(11)
$$ME = \max \left| {\delta_{t} - \hat{\delta }_{t} } \right|,$$
(12)
$$R^{2} = {{\left[ {\sum\limits_{t = 1}^{n} {(\hat{\delta }_{t} - \overline{\hat{\delta }})(\delta_{t} - \overline{\delta })} } \right]^{2} } \mathord{\left/ {\vphantom {{\left[ {\sum\limits_{t = 1}^{n} {(\hat{\delta }_{t} - \overline{\hat{\delta }})(\delta_{t} - \overline{\delta })} } \right]^{2} } {\sum\limits_{t = 1}^{n} {(\hat{\delta }_{t} - \overline{\hat{\delta }})^{2} \sum\limits_{t = 1}^{n} {(\delta_{t} - \overline{\delta })^{2} } } }}} \right. \kern-0pt} {\sum\limits_{t = 1}^{n} {(\hat{\delta }_{t} - \overline{\hat{\delta }})^{2} \sum\limits_{t = 1}^{n} {(\delta_{t} - \overline{\delta })^{2} } } }},$$
(13)

where \(\overline{\hat{\delta }}\) and \(\overline{\delta }\) are average values of the fitted (predicted) displacement \(\hat{\delta }_{t}\) and measured displacement \(\delta_{t}\), respectively.

3 Case study

The Jinping I arch dam, located on the main stream of the Yalong River in Liangshan Prefecture, Sichuan Province, China, is currently the highest constructed concrete dam in the world. The maximum dam height is 305 m. It is a double curvature arch dam and consists of 26 dam sections, of which the top and bottom thicknesses of the No. 13 central cantilever dam section are 16 and 63 m, respectively. Dam construction started in 2005, and on December 23, 2013, the dam body was fully poured to the dam crest elevation of 1885 m. The upstream reservoir water level reached the designed normal elevation of 1880 m on Aug. 24, 2014 for the first time, and it then cycled with an annual period between elevations of 1800 and 1880 m.

The downstream view of the Jinping I arch dam and layout of the plumb line monitoring system are shown in Fig. 3a and b, respectively. To monitor the temperature field of the dam body, thermometer monitoring systems are arranged in the No. 9, 13 and 19 dam sections. Generally, the temperature field of the central cantilever dam section is the most representative for concrete arch dams. Therefore, all temperature monitoring points of the No. 13 dam section, as shown in Fig. 3c, are preliminarily selected as temperature deformation factors. Except for the three bottom elevation layers, along the centreline of the dam section, there are five thermometers embedded on each elevation layer; two are 5 cm inside the upstream and downstream surface, respectively, and the other three are arranged with equal intervals between these two.

Fig. 3
figure 3

a Downstream view of the Jinping I arch dam, b layout of the plumb line monitoring system, and c layout of thermometers in the No. 13 dam section

The time period of the measured radial dam displacement and temperature used in this paper is from September 1, 2015 to December 31, 2018, and the sampling frequency is once a day, of which the observation data after May 28, 2018 are used for model prediction performance testing. To avoid the accidental influence of modelling results evaluated by a single monitoring point, a total of seven dam displacement monitoring points, which are all normal plumb line monitoring points on the No. 13 central cantilever dam section and the PL9-1 and PL11-1 monitoring points on the dam crest, are modelled by the proposed approach.

4 Results and discussion

4.1 Measured temperature-based temperature deformation factors

There are 148 modelling factors in the initial SVM model, including 4 hydraulic factors, 140 measured temperature factors, 2 time effect factors and 2 hysteretic hydraulic factors. The main objective of the importance analysis of modelling factors is to reduce the total number of modelling factors, during which a large number of unimportant or similar members from the total 140 initial temperature factors are eliminated.

4.1.1 Optimisation of temperature monitoring points

The evolution of the MSE during the successive elimination process of modelling factors is shown in Fig. 4. Overall, before the 130th SVM model, after removing the least important modelling factors, the MSE of the SVM model decreases slightly or remains basically unchanged, and thus, the elimination of these modelling factors will not weaken the prediction performance of the SVM model. In contrast, in most cases, this elimination plays a role in improvement. In addition, modelling factors eliminated in an early order are all measured temperature factors, and only a small number of non-temperature factors are eliminated a little earlier before the first exceeding of the MSE. To compare with the seasonal harmonic factor-based HEST model, these non-temperature factors are still retained in the final HETT model. In this study, the threshold of the MSE increase ratio, compared with the MSEmin, is set to 5%. According to Fig. 4, the best SVM model can be determined and is shown in Table 1.

Fig. 4
figure 4

Evolution of the MSE during the successive elimination process of modelling factors

Table 1 Sequence number of the best SVM model in the automatic elimination process

Taking the PL13-3 monitoring point as an example, in Fig. 5a and Table 2, it can be seen that among these automatically selected temperature monitoring points, the temperature time series of T13-146 and T13-139, T13-160 and T13-158, T3-65 and T13-70, and T13-15 and T13-38 are similar in pairs, in which the former in each pair is eliminated in an early order. If T13-146, T13-160, T13-65 and T13-15 are eliminated in manual mode, the MSE of the new SVM model, compared with the previous SVM model before eliminating these factors in the overall elimination process, increases by no more than 5% or even decreases, so these four temperature deformation factors can be effectively eliminated.

Fig. 5
figure 5

Measured temperature series selected for PL13-3

Table 2 Modelling factors of PL13-3 retained by automatic elimination

Manual elimination processes are also conducted for the other six displacement monitoring points, and the final retained modelling factors are given in Table 3. The measured temperature time series of the selected dam body temperature monitoring points for each displacement monitoring point are shown in Fig. 6. In addition, as seen in Table 3, the two time effect modelling factors, \(\theta\) and \(\ln \theta\) in Eq. (1), have both been removed for all seven analysed displacement points in the automatic elimination process. The reasons are that these two factors are used to model the trend component of dam displacement, but their evolution laws are repeated with some of the measured temperature factors shown in Fig. 6.

Table 3 Retained modelling factors after automatic and manual elimination
Fig. 6
figure 6

Measured temperature time series selected for each displacement monitoring point

The numbers of finally selected measured temperature factors for PL9-1, PL11-1, PL13-1, PL13-2, PL13-3, PL13-4 and PL13-5 are 9, 4, 8, 7, 11, 12 and 11, respectively. On average, among the 140 temperature monitoring points in the central cantilever dam section, only 6.3% need to be used to represent the thermal deformation effect of the dam, which indicates the high efficiency of the proposed optimisation approach.

4.1.2 Rationality of the selected temperature deformation factors

In the displacement monitoring model of concrete dams, the influence of a temperature deformation factor is mainly reflected in two aspects: shape characteristics of the used time series and its time lag effect with respect to the air temperature. Therefore, the measured temperature time series shown in Fig. 6 can be divided into six categories according to their shape characteristics and the lag effect mechanism of the dam temperature field. The measured temperature time series of some typical temperature monitoring points in each category are shown in Fig. 7, and their evolution characteristics and distributions are summarised as follows:

Fig. 7
figure 7

Measured temperature time series of typical temperature deformation factors

  1. (1)

    Category I: The downstream surface and crest of the dam body are directly affected by solar radiation, and thus their measured temperature time series are very similar to the air temperature of the dam site and have great fluctuations, such as T13-81, T3-121 and T13-153. To reduce the interference of data fluctuation on machine learning models, temperature time series in Category I need to be denoised.

  2. (2)

    Category II: Temperature monitoring points in Category II are mainly distributed in the upstream surface and middle of the dam body that are affected by the beneficial regulation of the reservoir water level, such as T13-65, T13-102 and T13-112. In the annual rising stage of air temperature, the measured temperatures of these monitoring points rise rapidly due to the low reservoir water level. However, when the air temperature annually drops, the reservoir water level at this stage is maintained at the highest elevation of 1880 m; thus, the reservoir water-induced heat conduction leads to an obvious time lag phenomenon between the temperature change at these monitoring points and the air temperature.

  3. (3)

    Category III: The annual periodic evolution law of measured temperatures is basically in accordance with the seasonal harmonic factor, such as T13-47 and T13-149, and they are mainly distributed in the upstream surface of the middle and lower elevation parts of the dam body and the interior of the dam crest. The influence of the annual cycle change in the upstream reservoir water level on these areas has been gradually weakened.

  4. (4)

    Category IV: This category is similar to Category III, and the temperature evolution law is also basically consistent with the seasonal harmonic factor, but the annual variation amplitude of measured temperatures has significantly decreased, such as T13-80, T13-119 and T13-135. Category IV is mainly distributed in the middle and upper elevation parts of the interior dam body.

  5. (5)

    Category V: The measured temperature time series shows a linear downwards trend, such as T13-19, T13-43 and T13-79, which are mainly distributed inside the dam heel and the middle and lower elevation parts of the interior dam body. The upstream reservoir water depth and dam body thickness in these areas are both very large, which makes them less affected by the ambient temperature. In addition, the influence of concrete hydration heat in these areas is significantly reduced, and the dam temperature will gradually drop to the joint closure temperature field and form a stable temperature field.

  6. (6)

    Category VI: This category is obviously affected by the water storage process of the upstream reservoir, and the measured temperatures rise sharply during this period, such as T13-77, T13-82 and T13-109. The upstream surface of the dam body near the dead water level elevation of 1800 m, with a height range of approximately 100 m, belongs to Category VI.

In summary, for the modelled seven displacement monitoring points in this case study, the evolution characteristics of the measured dam temperature time series used for PL9-1, PL13-1, PL13-2 and PL13-3 are the most complete. The main reason is that these displacement monitoring points are located in the middle and upper elevation parts of the dam body and are widely affected by temperature changes in the whole dam body. Other displacement monitoring points have also selected these representative temperature time series, which verifies the rationality of the proposed approach for optimising the measured temperature deformation factors.

4.1.3 Wavelet denoising of measured temperature time series

Temperature monitoring points in Category I are distributed on the downstream surface of the dam body and the dam crest; thus, the measured temperature is directly affected by solar radiation and fluctuates greatly. To ensure the prediction accuracy of the SVM model, it should be denoised first. The temperature time series of T13-121 before and after wavelet denoising are shown in Fig. 8. By removing some high-frequency components with large fluctuations, the denoised temperature time series is smooth and still maintains the original evolution characteristics. Temperature time series of T13-11, T13-77, T13-81, T13-102, T13-116, T13-131 and T13-153 are also denoised by wavelet multiresolution analysis.

Fig. 8
figure 8

Temperature time series of T13-121

4.2 Prediction performance of the HETT-SVM model

Combined with the instantaneous hydraulic component, hysteretic hydraulic component and time effect component, the HETT and HEST models are established using the measured temperature factor and seasonal harmonic temperature factor, respectively, and an HETT-o model is also established, in which the selected dam temperature time series have not been denoised. The SVM is used to conduct the nonlinear modelling. The radial displacement time series of the measured and fitted (predicted) values of PL11-1 and PL13-4 are shown in Fig. 9, in which the positive and negative values represent the radial displacement towards the downstream and upstream directions, respectively. The results of other displacement monitoring points are similar to these two. For all seven analysed displacement monitoring points, in the fitting stage, the displacement time series of the HETT, HETT-o and HEST models are basically consistent and very close to the measured value. In the prediction stage, PL11-1 has the best accuracy, while PL13-4 has the largest deviation. With the duration extension of the reservoir water level maintained at an elevation of 1880 m, the prediction deviations of some displacement monitoring points increase gradually. In general, the displacement predicted by the HETT model is closer to the measured value.

Fig. 9
figure 9

Radial displacement time series of the measured and fitted (predicted) values

The prediction performance evaluation indices of the three models are shown in Fig. 10. As seen in the figure, the prediction performance of the HETT-o model fluctuates greatly. The reason is that this model is established without conducting the wavelet denoising process, whereas the measured temperatures of some monitoring points are seriously affected by solar radiation and have obvious fluctuations. After wavelet denoising, the prediction MSEs of the HETT models of PL9-1, PL13-1, PL13-2, PL13-3, PL13-4 and PL13-5 are significantly better than those of the HEST models. Overall, for the seven analysed displacement monitoring points, the average prediction MSE of the HETT model is only 60.1% of that of the HEST model. The prediction MEs of the HETT models of PL9-1, PL11-1, PL13-3, PL13-4 and PL13-5 are also significantly smaller than those of the HEST models, which indicates that the HETT model can better describe the actual thermal deformation effect of concrete arch dams. The average ME and R2 of multiple monitoring points are 86.5% and 101.5% of those of the HEST model, respectively. In conclusion, compared with the traditional seasonal harmonic temperature factor, the measured temperature deformation factor, optimally selected through the importance analysis of SVM modelling factors, can better characterise the thermal deformation effect of arch dams, and the established displacement monitoring model has higher prediction accuracy. However, wavelet denoising must be conducted for the temperature time series of these monitoring points that are directly affected by solar radiation.

Fig. 10
figure 10

Evaluation indices of model prediction performance

4.3 Causal interpretation of the HETT-SVM model

Based on the PDP, the evolution law of hydraulic displacement separated from the SVM model with respect to reservoir water depth is shown in Fig. 11, and the temperature component is shown in Fig. 12. The results of other displacement monitoring points are similar. On the whole, using the measured temperature factor, the hydraulic and temperature components of displacement monitoring points distributed at the middle and upper elevation parts of the dam body are almost unchanged, but they have changed obviously for displacement monitoring points with lower elevation. Figure 10 shows that the influence laws of reservoir water depth on the radial displacement of the dam body obtained by the three models are basically the same. The radial hydraulic displacement increases with reservoir water depth and the elevation of the displacement monitoring point, and the nonlinear relationship is more obvious for the dam crest area. Except for PL13-5 with lower elevation, the relationship between hydraulic displacement and water depth can be expressed by the quartic polynomial function, which is consistent with the deformation theory of concrete arch dams.

Fig. 11
figure 11

Relationship between the SVM model-separated hydraulic radial displacement and upstream reservoir water depth

Fig. 12
figure 12

Radial displacement time series of the SVM model-separated temperature component

The overall evolution laws of the temperature components in Fig. 11 are also basically the same, and they change with an annual periodicity, which is also in accordance with the actual situation of arch dams. The temperature component of PL13-5 shows an increasing trend towards the upstream direction. This phenomenon is caused by the lower elevation of this displacement monitoring point, whereas the measured temperature near the dam heel shows that this area is currently in an overall temperature drop state, as shown in Category V in Fig. 6. Therefore, the local temperature drop effect in the dam heel area causes the upstream direction temperature deformation trend of the low and medium elevation parts of the dam body. Although the HEST model also interprets this temperature deformation trend, its evolution process is discontinuous, and the trend component of the temperature displacement suddenly increases in the fourth year. However, the trend temperature displacements obtained by the HETT and HETT-o models are gentler and in accordance with the actual situation of the Jinping I arch dam, the measured temperature field of which slowly drops to the joint closure temperature field at the current operation stage.

The comparison results show that temperature monitoring points selected through the importance analysis of SVM modelling factors can better represent the measured temperature field and the thermal deformation effect of high arch dams, and thus the proposed approach can be used to extract effective information from the massive temperature monitoring data of the dam body. Compared with the traditional seasonal harmonic temperature factor-based HEST model, the newly established HETT model has better prediction and interpretation performance.

5 Conclusion

To use the measured dam temperature as the temperature deformation factor of high arch dams, the key issue is to extract effective information from the massive temperature monitoring data of the dam body. In this paper, the importance analysis of SVM modelling factors is used to select the most representative dam temperature monitoring points, and the proposed HETT model has good prediction performance and causal interpretation ability for the Jinping I arch dam. The following conclusions can be drawn:

  1. (1)

    The importance analysis of SVM modelling factors can effectively optimise the measured temperature deformation factors used in the dam displacement monitoring model, and the temperature time series of the selected temperature monitoring points can comprehensively describe the temporal and spatial evolution characteristics of the measured temperature field of high arch dams. For the seven analysed displacement monitoring points of the Jinping I arch dam, the selected maximum and minimum numbers of the measured temperature deformation factors are 12 and 4, respectively, with an average number of 8.9 and an average rate of 6.4% from the total 140 effective temperature monitoring points in the central cantilever dam section.

  2. (2)

    The prediction performance of the measured temperature factor-based HETT model is better than that of the seasonal harmonic temperature factor-based HEST model, in which the average prediction MSE, ME and R2 of multiple displacement monitoring points of the former are 60.1%, 86.5% and 101.5% of those of the latter, respectively. However, the temperature time series of the solar radiation-affected temperature monitoring points on the dam surface must be denoised.

  3. (3)

    The PDP can be used to explore the causal interpretation of dam deformation behaviour modelled by machine learning models. The hydraulic and temperature components separated from the SVM model agree with the deformation mechanism of high arch dams. The temperature at the middle and lower elevation parts of a high arch dam is mainly affected by the stable temperature field of the upstream reservoir water, and thus the measured temperature evolution laws of these dam areas obviously deviate from the seasonal harmonic temperature factor. The measured temperature factor has a better interpretation ability for the thermal deformation behaviour of high arch dams.