1 Introduction

Dam engineering is an important kind of infrastructure that can bring significant benefits in the economic and social fields, such as flood control, power generation, water supply, and irrigation. The operation state of dams is very complicated for its relations with the water level, ambient temperature dam material properties and geo-mechanical factors [1]. The failure of dam engineering could pose uncontrolled flood and cause disaster to downstream areas. The past century has witnessed many severe accidents of dam engineering failure worldwide such as China (Gouhou CRFD, 1993), France (Malpasset Arch Dam, 1959), Italy (Gleno Multiple-Arch Dam, 1923; Vajont Arch Dam, 1963), Spain (Tous dam, 1982), USA (St. Francis Gravity Dam, 1928; Teton Earth Dam, 1976) [2].

Timely and effective detection of observational data anomalies according to the monitoring model may prevent hidden dangers and the occurrence of accidents. In the daily practice, the prediction models of dam displacement can be categorized into three groups: statistical model, hybrid model, and deterministic model, which are widely used during the construction period, storage period and operation period of dam engineering [1, 3, 4].

The deterministic model, also called numerical models, is established based on the numerical methods, such as the finite element method and discrete element method [5, 6]. The deterministic model can interpret dam displacement in mechanics concept, the modeling is based on many numerical computations and structural simplifications. However, in this approach, modeling and calculations are time-consuming considering various geometries and operation conditions [5]. Moreover, due to the limitation of computational techniques and parameter settings, some special monitoring effects (e.g. seepage, uplift pressure) are difficult to be predicted accurately using the deterministic model. Sometimes a deterministic model cannot be supplied with the thermal effects because of the lack of temperature measurements. In this case, the hybrid model is a good solution where the thermal effect is represented by periodic time functions and the other effects are the same as the deterministic model [1].

The statistical models are based on the previous data and basic mathematical functions. Very well-known statistical models in dam safety procedures are hydrostatic-season-time (HST) and hydrostatic-temperature-time (HTT) model [7], and the former one is widely used in the positive analysis and inverse analysis of dam health monitoring [5, 8, 9]. The unknown coefficients can be obtained by using different regression techniques, such as multiple linear regression (MLR) [10, 11], partial least squares regression (PLSR) [12] and stepwise regression [13].

However, linear regression-based models also contain some disadvantages. On the one hand, they are not well-suited to model nonlinear interactions between input factors and dam displacement [6]. On the other hand, they are easily ill-conditioned [14]. The limitations of conventional statistical models have motivated dam engineers to work on new approaches for dam behavior modeling [6]. With the development of machine learning, a great number of application methods have been proposed recent years and applied in dam engineering, including dam health monitoring [6], reliability analysis [15, 16], seismic evaluation [17], computational cost reduction [18] and uncertainty quantification [19]. Artificial neural network (ANN) [20] and support vector machine (SVM) [21] are the most popular methods, having good computing performance for solving nonlinear problems. There are various types of ANN models and most of the applications are based on the multilayer perceptron (MLP), where the major challenge is learning time requirement and structure selection. Single hidden layer feedforward neural networks (SLFNs), such as radial basis function neural network (RBF-NN) [22, 23] and extreme learning machine (ELM) [24], were tested for modeling in dam health monitoring due to the simple structure and efficient algorithms. RBF-NN is an SLFN that utilizes radial basis functions as activation functions in the hidden layer where the output is a linear combination of the inputs and neuron parameters [20]. ELM was proposed by Huang et al [25]. Compared with other standard SLFNs, ELM decreases the required training time and has fewer parameters settings. Despite the poor stability of the output due to the stochastic untrained weights and biases in the input to the hidden layer, the average performance of ELM is verified to be superior to standard SLFNs, stepwise regression model and MLR model in the application of dam displacement prediction [24]. The SVM, a kernel-based technique, is currently the most popular machine learning method. SVM has a relatively good capability of solving nonlinear problems, especially for the data with fewer samples and high dimensions. In order to enable SVM to solve the regression problem, the insensitive loss coefficient \(\varepsilon\) was introduced into the SVM and support vector regression (SVR) was develop [21]. Some researchers have reported the superior performance of the SVR in dam structural monitoring [26,27,28]. Except for the aforementioned models, adaptive neural fuzzy inference system (ANFIS) [29], multivariate adaptive regression splines (MARS) [30] and Gaussian process regression (GPR) [31, 32] are also competitive ML models used in dam health monitoring though with high computational cost and complexity. A detailed literature review about ML models used in dam health monitoring can be referred to [5, 6].

Relevance vector machine (RVM) is a predictive machine learning model proposed by Tipping [33]. RVM is a flexible and powerful tool that modifies the principal ideas behind the SVM with Bayesian theory and has a similar form to SVM. The advantage of RVM is its capacity to provide reasonable inferences at low computational cost and improves the inadequacy of SVM in many aspects, including the utilization of non-mercer kernels, reduced sensitivity to hyper-parameter settings and probabilistic output with fewer relevance vectors for a given dataset [33]. RVM is suitable for dealing with complex regression and classification problems and has been verified in many practical problems. Imani et al. [34] examined the capability of relevance vector machine models for predicting sea-level variations and concluded that the RVM approach was superior to ELM in terms of accuracy during the test periods. Zhang et al. [35] utilized the relevance vector machine for stability inference of soil slopes. Wang et al. [36] used the multiclass relevance vector machine approach to classify faulty samples of multilevel inverter system. Kong et al. [37] utilized the relevance vector machine to realize real-time monitoring of the tool wear in machining process. Most of the previous application studies of RVM were based on the Gaussian kernel, where trial-and-error or pilot calculation were often used to determine the hyper-parameter value. It should be noticed that trial-and-error and pilot calculation is time-consuming if the dataset is large and the iteration step is small [35]. In fact, kernel function and hyper-parameters value are important factors that have impacts on the sparsity and generalization performance of RVM.

Currently, there is no general consensus on the appropriate setting of the kernel function and hyper-parameters. The determination of the kernel function and corresponding hyper-parameter value of the RVM model under a given problem can be recognized as a constrained optimization problem. Evolutionary algorithms and swarm intelligence algorithms are two important kinds of population-based heuristic algorithms [38], which have been widely used in a variety of engineering problems. Genetic algorithm [39] and artificial immune algorithm [40] are two typical evolutionary algorithms. Particle swarm optimization [41], artificial fish swarm algorithm [42], artificial bee colony algorithm [43] are popular swarm intelligence algorithms. In addition to the evolutionary and swarm intelligence-based algorithms, there are a variety of other algorithms that work on the principles of different natural phenomena. Jaya algorithm is a recently proposed novel global optimization method. Compared with other popular algorithms, it does not contain any algorithm-specific parameter to be tuned, which makes the algorithm convenient to implement in the practical application [37, 38, 44, 45].

The purpose of this paper is to develop a novel monitoring model for the probabilistic prediction of concrete dam displacement. An efficient optimization framework for RVM parameters is developed based on the parallel Jaya algorithm (PJA). The proposed optimized relevance vector machine (ORVM) has a good performance on estimating the optimal hyper-parameter values of the RVM real-timely and is able to provide the reliable predicted results of concrete dam displacement. In addition, this paper compares the nonlinear mapping capabilities of ORVM models with different kernel functions (e.g. simple kernel functions and multi-kernel functions) and discuss the most suitable choice for given data. The developed ORVM model is performed on a super-high concrete arch dam located in China and a discussion is conducted compared with equivalent SVR, RBF-NN, ELM, and HST-MLR models.

The rest of the paper is organized as follows. In Section 2, related methodologies, such as the statistical monitoring model of concrete dam displacement and description of the proposed ORVM, are illustrated in detail. Data collection, detailed analyses, and comparisons of predicted results are shown in Section 3. The conclusion and future work are summarized in Section 4.

2 Methodologies

2.1 Statistical model for concrete dam displacement monitoring

As a comprehensive response of dam structural behavior, dam displacement is a nonlinear function of hydrostatic pressure, temperature, time effect, and other unexpected unknown factors [1, 5]. According to the current research, the hydrostatic-seasonal-time (HST) model is one of the most popular statistical models for dam deformation monitoring [6]. HST model is based on the analysis of structure and mechanics theory, which can be quantitatively interpreted and approximated by the following expression:

$$\delta = \delta_{H} \left( t \right) + \delta_{T} \left( t \right) + \delta_{\theta } \left( t \right)$$
(1)

where water pressure component \(\delta_{H} \left( t \right)\) denotes reversible effect of static hydrostatic pressure, temperature component \(\delta_{T} \left( t \right)\) denotes reversible effect influenced by seasonal thermal and ambient temperature, \(\delta_{\theta } \left( t \right)\) denotes time component (also called aging component).

Under the action of water pressure, \(\delta_{H} \left( t \right)\) can be described by a polynomial function consisting of reservoir water levels \(H\) and coefficients \(a_{i}\), which is given in Eq. (2). The value of \(h\) depends on the dam type. For gravity dam, \(h\) is set to 3; for arch dam, \(h\) is set to 4.

$$\delta_{H} \left( t \right) = \sum\limits_{i = 1}^{h} {a_{i} H^{i} }$$
(2)

The temperature component \(\delta_{T} \left( t \right)\) describes the displacement caused by the temperature changes in bedrock and dam concrete. The calculation of \(\delta_{T} \left( t \right)\) depends on the layouts of the thermometers. If the thermometers equipped in dam engineering is enough and the measured data is sufficient as well as continuous, these measurements can describe the dam temperature field very well and \(\delta_{T} \left( t \right)\) can be calculated by Eq. (3). Otherwise, \(\delta_{T} \left( t \right)\) can be calculated by a combination of harmonic function given in Eq. (4).

$$\delta_{T} \left( t \right) = \sum\limits_{i = 1}^{{l_{1} }} {b_{i} T_{i} } \;{\text{or}}\;\delta_{T} = \sum\limits_{i = 1}^{{l_{2} }} {b_{1i} } \bar{T}_{i} + \sum\limits_{i = 1}^{{l_{2} }} {b_{2i} } \beta_{i}$$
(3)
$$\delta_{T} \left( t \right) = b_{1} \sin \left( d \right) + b_{2} \cos \left( d \right) + b_{3} \sin \left( d \right)\cos \left( d \right) + b_{4} \sin^{2} \left( d \right)$$
(4)

where \(b_{i}\), \(b_{1i}\) and \(b_{2i}\) are coefficients; \(T_{i}\) is the observation value of the ith thermometer, \(l_{1}\) denotes the number of thermometers used for modeling; \(\bar{T}_{i}\) and \(\beta_{i}\) denote the average value of the measured temperature at ith layer and temperature gradient, respectively; \(l_{2}\) denotes is the number of layers by where the thermometers are installed; \(d = {{ 2\pi t} \mathord{\left/ {\vphantom {{ 2\pi t} {365}}} \right. \kern-0pt} {365}}\); \(t\) is the number of days from the observation date to the beginning of the monitoring sequence.

The time component \(\delta_{\theta } \left( t \right)\) reflects the irreversible deformation of the dam body or dam foundation toward a certain direction over time. For a normal concrete dam, \(\delta_{\theta } \left( t \right)\) rapidly changes at the initial service life and then stabilizes in the later stage. According to the current research [1], different and strictly monotone functions can be used for modeling the time effect \(\delta_{\theta } \left( t \right)\), as shown in Eq. (5)

$$\delta_{\theta } \left( t \right) = c_{1} \theta + c_{2} \ln \left( \theta \right) + c_{3} \left( {1 - e^{ - \theta } } \right)$$
(5)

where \(c_{i}\) are coefficients; \(\theta = {t \mathord{\left/ {\vphantom {t {100}}} \right. \kern-0pt} {100}}\).

2.2 Optimized relevance vector machine with multi-kernel

2.2.1 Theory of relevance vector machine

The RVM, originally proposed by Tipping [33], is a predictive machine learning model and has the comparable form to SVM as shown in Eq.(6). RVM can be utilized for solving regression problems and provides probabilistic estimates, as opposed to the SVM’s point estimates. Given a set of input vectors \(\{ x_{n} ,t_{n} \}_{n = 1}^{N}\), presume that \(t_{n} = y\left( {{\mathbf{x}}_{n} ,{\mathbf{w}}} \right) + \varepsilon_{n}\), where \(\varepsilon_{n} \sim{\mathbf{N}}\left( {0,\sigma^{2} } \right)\) and \({\mathbf{N}}\left( {0,\sigma^{2} } \right)\) denotes the normal distribution with mean-zero with variance \(\sigma^{2}\). The output, combined with kernel function \(K\left( {x,x_{n} } \right)\), can be written as

$${\mathbf{y}} = f({\mathbf{x}}) = \sum\limits_{n = 1}^{N} {w_{n} K\left( {x,x_{n} } \right)} + b$$
(6)

where \(w_{n}\) denotes the weights vector which is to be adjusted for the training set, \(b\) denotes the bias.

The probabilistic formulation of RVM model can be defined as

$$p\left( {\left. {t_{n} } \right|{\mathbf{X}}} \right) = {\mathcal{N}}\left( {\left. {t_{n} } \right|y({\mathbf{x}}_{n} ),\sigma^{2} } \right)$$
(7)

where \({\mathcal{N}}\) denotes the normal distribution over \(t_{n}\) with mean of \(y({\mathbf{x}}_{n} )\) and variance \(\sigma^{2}\). \(y({\mathbf{x}})\) represents a linearly weighted sum of nonlinear fixed basis functions, which contains the same definition as Eq. (6). On account of the assumption of independence of the \(t_{n}\), the likelihood of the whole dataset can be defined as follows

$$p\left( {\left. {\mathbf{t}} \right|{\mathbf{w}},\sigma^{2} } \right) = \left( {2\pi \sigma^{2} } \right)^{ - N/2} \exp \left\{ {\left. { - \frac{1}{{2\sigma^{2} }}\left\| {{\mathbf{t}} - {\varvec{\Phi}}{\mathbf{w}}} \right\|^{2} } \right\}} \right.$$
(8)

where \({\mathbf{t}} = \left( {t_{1} \ldots t_{N} } \right)^{\text{T}}\), \({\mathbf{w}} = \left( {w_{0} \ldots w_{N} } \right)^{\text{T}}\) and \({\varvec{\Phi}}\left( {x_{n} } \right) = \left[ {1,K\left( {x_{n} ,x_{1} } \right),K\left( {x_{n} ,x_{2} } \right), \ldots ,K\left( {x_{n} ,x_{\text{N}} } \right)} \right]^{\text{T}}\). There are several types of kernel functions could be used in \({\varvec{\Phi}}\), such as simple kernel and multi-kernel. We will discuss them in detail in Section 2.2.2.

For the purpose of overcoming over-learning from implement of maximum-likelihood estimation for \({\mathbf{w}}\) and \(\sigma^{2}\), additional constraint on the parameters can be imposed by adding a complexity penalty to the likelihood or error function. A zero-mean Gaussian prior probability distribution over \({\mathbf{w}}\) is shown in Eq. (9).

$$p\left( {\left. {\mathbf{w}} \right|{\varvec{\upalpha}}} \right) = \prod\limits_{i = 0}^{N} {{\mathcal{N}}\left( {\left. {w_{i} } \right|0,\alpha_{i}^{ - 1} } \right)}$$
(9)

where \({\varvec{\upalpha}}\) is a vector of \(N + 1\) hyper-parameters.

By utilizing Bayesian posterior inference, the posterior distribution over \({\mathbf{w}}\) is given as follows.

$$p\left( {\left. {\mathbf{w}} \right|{\mathbf{t}},{\varvec{\upalpha}},\sigma^{2} } \right) = \frac{{p\left( {\left. {\mathbf{t}} \right|{\mathbf{w}},\sigma^{2} } \right)p\left( {\left. {\mathbf{w}} \right|{\varvec{\upalpha}}} \right)}}{{p\left( {\left. {\mathbf{t}} \right|{\varvec{\upalpha}},\sigma^{2} } \right)}}$$
(10)

Eq. (10) can be written as follows

$$p\left( {\left. {\mathbf{w}} \right|{\mathbf{t}},{\varvec{\upalpha}},\sigma^{2} } \right) = (2\pi )^{ - (1 + N)/2} \left| \sum \right|^{ - 1/2} \exp \left[ { - \frac{1}{2}\left( {{\mathbf{w}} - \mu } \right)^{T} \sum^{ - 1} \left( {{\mathbf{w}} - \mu } \right)} \right]$$
(11)

Here, the posterior covariance \(\sum\) and mean \(\mu\) are given as follows

$$\sum = \left( {\sigma^{ - 2} {\varvec{\Phi}}^{T} {\varvec{\Phi}} + {\mathbf{A}}} \right)^{ - 1}$$
(12)
$$\mu = \sigma^{ - 2} \sum {\varvec{\Phi}}^{T} {\mathbf{t}}$$
(13)

where \({\mathbf{A}} = diag\left( {\alpha_{0} ,\alpha_{1} , \ldots ,\alpha_{N} } \right)\).

For the uniform hyperpriors over \(\sigma^{2}\) and \(\alpha\), the term \(p\left( {\left. {\mathbf{t}} \right|{\varvec{\upalpha}},\sigma^{2} } \right)\) needs to be maximized and we can get

$$p\left( {\left. {\mathbf{t}} \right|{\varvec{\upalpha}},\sigma^{2} } \right) = (2\pi )^{ - N/2} \left| {\sigma^{2} {\mathbf{I}} + {\varvec{\Phi}}{\mathbf{A}}^{ - 1} {\varvec{\Phi}}^{T} } \right|^{ - 1/2} \exp \left[ { - \frac{1}{2}{\mathbf{t}}^{T} \left( {\sigma^{2} {\mathbf{I}} + {\varvec{\Phi}}{\mathbf{A}}^{ - 1} {\varvec{\Phi}}^{T} } \right)^{ - 1} {\mathbf{t}}} \right]$$
(14)

Values of \(\sigma^{2}\)and \(\alpha\)that maximize Eq. (14) can be obtained iteratively by using the following updating rules.

$$\left( {\alpha_{i} } \right)^{\text{New}} = \frac{{\gamma_{i} }}{{\mu_{i}^{2} }},$$
(15)
$$\left( {\sigma_{i}^{2} } \right)^{\text{New}} = \frac{{\left\| {{\mathbf{t}} - {\varvec{\Phi}}{\varvec{\upmu}}} \right\|^{2} }}{{N - \sum\nolimits_{i} \gamma_{i} }}$$
(16)

where \(\mu_{i}^{{}}\)is the \(i\)th element of the estimated posterior weight \(w\). Define the quantities \(\gamma_{i} \equiv 1 - \sigma_{i}^{2} \varSigma_{ii}\)and \(\varSigma_{ii}\) denotes the \(i\)th diagonal element of the posterior covariance matrix \(\sum\) from Eq. (12).

The maximization of \(p\left( {\left. {\mathbf{t}} \right|{\varvec{\upalpha}},\sigma^{2} } \right)\) is known as the type-II maximum likelihood method [46] and evidence for hyper-parameter [47]. Once the iterative procedure has converged to the most probable values \({\varvec{\upalpha}}_{MP}\) and \(\sigma_{MP}^{2}\), the predictive results for a new set of inputs \(x_{*}\), can be written as follows

$$p\left( {\left. {t_{*} } \right|{\mathbf{t}},{\varvec{\upalpha}}_{\text{MP}} ,\sigma_{\text{MP}}^{2} } \right) = \int {p\left( {\left. {t_{*} } \right|{\mathbf{w}},\sigma_{\text{MP}}^{2} } \right)} p\left( {\left. {\mathbf{w}} \right|{\mathbf{t}},{\varvec{\upalpha}}_{\text{MP}} ,\sigma_{\text{MP}}^{2} } \right){\text{d}}{\mathbf{w}} = N\left( {\left. {t_{*} } \right|y_{*} ,\sigma_{*}^{2} } \right)$$
(17)

where \(y_{*} = \mu^{T} \phi (x_{*} )\), \(\sigma_{*}^{2} = \sigma_{\text{MP}}^{2} + \phi (x_{*} )^{T} \sum \phi (x_{*} )\). The mean \(y_{*}\) denotes the predicted value of RVM at the test point \(x_{*}\). The variance \(\sigma_{*}^{2}\) can capture the uncertainty of the predicted distribution at the test point \(x_{*}\). For example, the 95% confidence interval (CI) of the predicted results can be determined by \(\left[ {y_{*} - 1.96\sigma_{*} ,y_{*} + 1.96\sigma_{*} } \right]\).

2.2.2 Multi-kernel technique

In RVM modeling, there is no constraint over the type of kernel functions (e.g., the kernels do not have to satisfy Mercer condition) [33]. Meanwhile, it is necessary to select the suitable kernel function empirically and determine the appropriate values of hyper-parameters. However, the construction of new kernel functions with high-performance is very complicated, requiring a lot of trials and computing resources. Therefore, using simple mathematical operations on the simple kernel functions to construct a new kernel is a simple and effective way. In this study, different kernel functions are studied for modeling, including three simple kernel functions and three constructed multi-kernel functions using a weighted combination strategy. Gaussian kernel, Polynomial kernel, and Laplace kernel are commonly-used kernel functions. Through the weighted combination of the simple kernel functions, three multi-kernel functions (SumGL kernel, SumGP kernel, and SumLP kernel) are constructed. The kernel functions aforementioned are summarized in Table 1. Where \(r\) denotes the hyper-parameters, and the values of hyper-parameters need to be optimized to make sure that the kernel functions can map the data with high performance.

Table 1 The used simple kernels and the constructed multi-kernels

2.2.3 Parallel Jaya algorithm

Jaya algorithm, a powerful and state-of-art optimization algorithm, was proposed by Rao [38]. The advantage of Jaya algorithm is that it contains only one control parameter. Jaya algorithm was developed based on the idea that the solution obtained (population) for a given problem should get away the worst solution and move towards the best solution. The description of the Jaya algorithm is as follows.

Let \(f(X)\) is the function to be optimized. The number of parameters to be determined is \(m\), and the population size is \(n\) (population size \(k = 1, \ldots ,n\).). Therefore, the total population can be considered a matrix of dimension \((m, \, n)\). Let \(f(x)_{best}\) is the best value of the objective function produced by the best candidate. The worst candidate can be defined as the worst objective value (i.e., \(f(x)_{\text{worst}}\)). The solution is updated according to the difference between the best candidate and the existing solution as well as the worst candidate. \(X_{j,k,i}\) denotes the value of the \(j\)th variable for the \(k\)th candidate during the \(i\)th iteration, then this value is updated by Eq. (6).

$$X^{\prime}_{j,k,i} = X_{j,k,i} + r_{1j,i} \left( {X_{{j,{\text{best}},i}} - \left| {X_{j,k,i} } \right|} \right) - r_{2j,i} \left( {X_{{j,{\text{worst}},i}} - \left| {X_{j,k,i} } \right|} \right)$$
(18)

where \(r_{1j,i}\) and \(r_{2j,i}\) are two different random numbers, uniformly distributed in the range \([0,\;1]\). \(X_{{j,{\text{best}},i}}\) denotes the value of the \(j\) variable for the best candidate, and \(X_{{j,{\text{worst}},i}}\) denotes the value of the \(j\) variable for the worst candidate. The detailed description of Jaya algorithm can be found in [48]. In order to improve the calculation efficiency, the concept of multi-population is introduced to establish the parallel Jaya algorithm (PJA) based on the static multi-population [49]. The population is divided into several sub-populations, and the sub-population structure is performed to parallelize the sequential algorithm. The flowchart of the multipopulation-based PJA is shown in Fig. 1.

Fig. 1
figure 1

Flowchart of the multi-population-based parallel Jaya algorithm (PJA)

2.2.4 Parameters optimization method for RVM using parallel Jaya algorithm

As mentioned above, the kernel function and its hyper-parameter have significant impacts on the performance and sparsity of the RVM-based model. In general, the kernel function and hyper-parameter value are defined before model implementing. In order to obtain the optimal value of hyper-parameter and prevent over-fitting in the validation dataset, the parameters of RVM to be optimized are encoded in the solutions of Jaya algorithm. The solution can be represented as \(s = \left( {s_{1} ,s_{2} , \ldots ,s_{r} } \right)\), where \(s_{r}\) denote the parameters of the kernel function and \(r\) is the number of the parameters to be optimized. \(k\)-fold cross-validation is a popular way to estimate the model generalization performance. In \(k\)-fold cross-validation, the selected dataset is equally partitioned into \(k\) subsets. In these \(k\) subsets, a single subset is used for validation and the remaining \(k - 1\) subsets are used for training. The cross-validation method is then carried out \(k\) times, with each of the \(k\) subsets used exactly once as the validation dataset.

The target function should be defined in a proper form. In this study, the root mean square error (RMSE) of the solution is chosen as the target function, as shown in Eq. (19).

$$F_{\text{RMSE}} \left( \varvec{s} \right) = \frac{1}{K}\sum\limits_{k = 1}^{K} {\sqrt {\frac{1}{N}\sum\limits_{i = 1,k}^{N} {\left( {y_{i} - y(i)_{s} } \right)^{2} } } }$$
(19)

where \(K\) is the number of subsets, and \(N\) is the number of validation samples. \(y_{i}\) denotes the target value, and \(y(i)_{{\mathbf{s}}}\) denotes the predicted value using RVM model. In this paper, 5-fold cross-validation is carried out for model training. Therefore, \(K\) is set to 5.

The adaption of hyper-parameters using PJA contains the following steps:

  1. (1)

    Set the population size in the Jaya algorithm, and initialize the kernel function. Calculate the initial solutions by target function shown in Eq. (19).

  2. (2)

    Split the population into \(P\) sub-populations and build parallel calculation structure. Find the best solution and worst solution in each population.

  3. (3)

    In each sub-population, update each solution \(\varvec{s}_{i}\) as a candidate solution \(\varvec{c}_{i}\) by Eq. (19). Calculate the value of the target function by carrying out \(5\)-fold cross-validation.

  4. (4)

    For each solution, if \(f\left( {\varvec{c}_{i} } \right) < f\left( {\varvec{s}_{i} } \right)\), update \(\varvec{s}_{i}\) with \(\varvec{c}_{i}\); else, do not update \(\varvec{s}_{i}\).

  5. (5)

    Repeat steps (3) ~ (4), until the maximum number of iterations is reached.

  6. (6)

    Record the optimal solution in each sub-population.

  7. (7)

    Record the best solution among the optimal solutions obtained in each sub-population. In this manner, the target function value reaches the minimal value.

Combined with the specific monitoring dataset, the optimal hyper-parameter \(r_{opt}\) as well as the fitted or predicted outputs can be obtained by performing the above steps.

2.3 Procedure of ORVM for the prediction of concrete dam displacement

As mentioned in Section 2.1, water pressure component, temperature component and time component are selected as independent variables of the model. The displacement is adopted as the dependent variable. It is noted that the initial value should be deducted for the establishment of hydrostatic pressure factors and time factors. Therefore, the input \(\varvec{x}\) of the model can be denoted as a vector shown below.

$$\begin{aligned} \varvec{x} & = \left\{ {H - H_{0} ,\left( {H - H_{0} } \right)^{2} ,\left( {H - H_{0} } \right)^{3} ,\left( {H - H_{0} } \right)^{4} ,} \right. \\ & \quad \left. {\sin \left( d \right),\cos \left( d \right),\sin \left( d \right)\cos \left( d \right),\sin^{2} \left( d \right),t - t_{0} ,\left( {e^{{ - t_{0} }} - e^{ - t} } \right),\ln \left( t \right) - \ln \left( {t_{0} } \right)} \right\} \\ \end{aligned}$$
(20)

where \(H_{0}\) denotes the water level on the initial monitoring day and \(t_{0}\) denotes the initial monitoring day. The other symbols have the same meaning as the variables in Eq. (2) ~ Eq. (5).

To eliminate the influence of the dimension on the dataset, the input data are normalized within a range of [0, 1] by

$$f\left( {x_{i} } \right) = \frac{{x_{i} - x_{i\hbox{min} } }}{{x_{i\hbox{max} } - x_{i\hbox{min} } }}$$
(21)

where \(x_{i}\) represents the value to be normalized. \(x_{i\hbox{max} }\) and \(x_{i\hbox{min} }\) denote the maximum and minimum value of the data to be normalized, respectively.

The flowchart of the proposed ORVM-based probabilistic prediction model for dam displacement is illustrated in Fig. 2, and the main procedure is described as follows.

Fig. 2
figure 2

The flowchart of the ORVM predictive model of concrete dam displacement

  1. (1)

    Choose the influential components and determinate the displacement.

  2. (2)

    Data preparation and normalization. Collect the monitoring data from the dam monitoring system and build the inputs of the model. All the data should be normalized within a range of [0, 1].

  3. (3)

    Dataset division. Based on the obtained data, establish the training set and testing test for modeling.

  4. (4)

    Optimization of model parameters. Select specific kernel function and determine the value of the hyper-parameters using the adaptive parameters selection method for RVM described in 2.3.3.

  5. (5)

    Model establishment. The RVM-based model is built using the training data, the selected kernel function and the optimal value of the hyper-parameters.

  6. (6)

    Performance verification. Utilize testing set to verify whether the trained ORVM model has good generalization performance on unknown monitoring data.

In this study, six statistical metrics are used as the criteria to comprehensively evaluate the predictive performance of the models, including the coefficient of determination (\(R^{2}\)), the root mean square error (RMSE), the mean absolute error (MAE), the maximum absolute error (ME), the average width of confidence interval (AWCI) and the average variance of confidence interval (AVCI), which are expressed in Appendix A. It should be noted that a model is more precise if it contains not only lower RMSE, MAE and ME values, but also the higher \(R^{2}\) value in both training and testing dataset. AWCI and AVCI reflect the stability and smoothness of the 95% CI obtained by ORVM, respectively. A smaller AWCI represents the more reliable predicted results of ORVM. The smaller AVCI is, the smoother and more stable the 95% CI of ORVM becomes.

3 Application

3.1 Dam engineering profile

Jinping-I hydropower station project is located in Sichuan Province, China. It is mainly composed of a double-curvature concrete arch dam, underground power plant, and water conveyance structures. The dam crest elevation is at 1885m and the maximum height of the arch dam is 305m. The arch dam consists of 26 sections, with a dam crest length of 552m. Jinping-I arch dam is the highest arch dam currently in the world. The dam is equipped with an advanced automatic monitoring system which composed of various instruments, including water level gauges, pendulums, thermometers, strain gauges, osmometers, and piezometers. In this study, the radial displacement measured at the reading station PL13-3 of the central pendulum system is analyzed for modeling. The overlooking of the dam and location of the pendulum in the No. 13 dam section are shown in Fig. 3.

Fig. 3
figure 3

The concrete arch dam: (a) downstream view; (b) Location of pendulums in No. 13 dam section

3.2 Data collection and preparation

The radial displacement measurements of monitoring point PL13-3, which are the data implemented in this study, are measured from 1st September 2013 to 7th November 2016 with 380 groups of data samples in total. Figs. 4 and 5 illustrate the time evolution of the measured radial displacement and reservoir water level, respectively. Since there’s no enough continued dam body temperature near PL13-3, the harmonic functions as given in Eq. (4) are used for modeling the temperature effect indirectly. Therefore, a total of 11 factors are selected as independent variables of the models. The measured radial displacement value of monitoring point PL13-3 is selected as the dependent variable.

Fig. 4
figure 4

Measured radial displacement

Fig. 5
figure 5

The reservoir water level

The first 320 samples that correspond to the period between 3rd September 2013 and 30th July 2016, are used for cross-validation and training. The remaining 60 samples which correspond to a period between 31st July 2016 and 7th October 2016, are utilized as the testing set for evaluating the model performance. The testing set is subdivided into six parts, with the samples of 10, 20, 30, 40, 50, and 60, respectively. The goal is to test the predictive performance and robustness of RVM for dam displacement. The detailed information on training and testing sets is listed in Table 2. It is noted that the deformation data are weekly recorded rather than daily recorded during the period from September 2014 to March 2016 due to the instrument maintenance and debugging of the monitoring system.

Table 2 Training set and testing test

3.3 Performance evaluation of the ORVM models with different kernel functions

For the ORVM model, different simple kernel functions and multi-kernel functions are selected for testing and comparison. The hyper-parameter values of kernel functions are obtained by the optimization method introduced in Section 2.2.4. For the PJA, the population size is set to 40 and the maximum iteration number is set to 150. In order to establish the parallel calculation structure, the population is divided into 4 parts equally in the multi-population structure and each sub-population has a size of 10.

The search space of the hyper-parameter values and the obtained optimal values of different ORVM models are listed in Table 3. The convergence process curves of simple kernel-based ORVM models and multi-kernel ORVM models are shown in Fig. 6. From the results shown in Fig. 6 and Table 3, it can be observed that PJA has good performance for parameters optimization of RVM as the fitness values of the six models remain stable after 100 iterations. The fitness value of Gaussian kernel-based optimized relevance vector machine model (G-ORVM) is the smallest among three simple kernel-based RVM models, with the fitness value is 0.592. The fitness value of SumGP kernel-based optimized relevance vector machine (GP-ORVM) is 0.566, which is the smallest among the three multi-kernel ORVM models. On the whole, the fitness value of GP-ORVM is the smallest among the six models. Therefore, it could be inferred that the performance of GP-ORVM is the best to some degree.

Table 3 Search space of the RVM hyper-parameters
Fig. 6
figure 6

Convergence characteristics of RVM with different kernels

Take the testing set 3 as an example, four evaluation metrics of ORVM models with different kernel functions are listed in Table 4, and the statistically superior results are shown in boldface. In general, the six ORVM models with different kernel functions achieve satisfactory performance since the coefficients of determination in the training set and testing set are larger than 0.95. The three ORVM models with multi-kernel function also provide reasonably good results especially in the testing set, where the GP-ORVM model has the smallest RMSE, MAE and ME values with the values are 0.2765, 0.2441 and 0.4726, respectively. As for the ORVM models with simple kernel function, the G-ORVM model provides the best predictive performance. Overall, it can be seen from Table 4 that the GP-ORVM model and the G-ORVM model perform better than the other four models considering the generalization and predictive accuracy.

Table 4 The performance of ORVM models based on different kernel functions

Figs. 7 and 8 depict the advantage of probabilistic prediction for concrete dam displacement using GP-ORVM and G-ORVM models, where the confidence level is set to 95% and the blue line denotes the measured displacement. In the training sets, the upper and lower bounds of 95% CI get closer to the measured displacement. Except for individual peak points, most of the measured displacement fall into the 95% CIs. In the testing sets, although all the measured displacement falls into the 95% CIs, the GP-ORVM seems to provide a narrower CI than G-ORVM. The relatively narrower CI is significant since it is more sensitive and can capture the abnormal displacement data effectively.

Fig. 7
figure 7

The output results of the GP-ORVM-based prediction model with 95% CI

Fig. 8
figure 8

The output results of the G-ORVM-based prediction model with 95% CI

To test the probabilistic prediction performance of GP-ORVM and G-ORVM on different data, six testing sets are adopted for simulation and the calculated values of AWCI and AVCI on different testing sets are shown in Fig. 9. It can be seen that the AWCI values of the GP-ORVM model fluctuate around 1.5mm and are slightly smaller than those of the G-ORVM model, which means the predicted 95% CI using GP-ORVM get a certain degree of compression and the predicted results are more reliable. The AVCI values of both two ORVM models are small with values within the range from 0.08 to 0.24. However, the AVCI values of the GP-RVM model are significantly smaller than those of the G-RVM model, which reflects that the 95% CI of the GP-ORVM is more stable and smoother. The advantage of the smooth CI is that it can improve the reliability of anomaly recognition.

Fig. 9
figure 9

Performance evaluation of the 95% CIs obtained by GP-ORVM and G-ORVM

3.4 Performance comparison of the existing models

In this section, the RBF-NN, the SVR, ELM and the HST-based multiple linear regression (MLR-HST) models are selected as benchmark models for performance comparison with the proposed ORVM models. For the MLR-HST model, the regression coefficients are computed by the least square method. For the RBF-NN, SVR and ELM models, the hyper-parameters are determined in an optimal manner rigorously, where 5-fold cross-validation and PJA are performed to estimate the hyper-parameters, and the target function is the same as Eq. (19).

In the RBF-NN model, the value of spread \(S_{R}\) and the number of neurons \(N_{R}\) in the hidden layer are parameters to be optimized. The training objective of the mean square error is set to 10-4. The value of spread \(S_{R}\) is selected in the interval of [0.01, 100], and the number of hidden layer nodes \(N_{R}\) is selected in the interval of [11, 50]. The optimal values of the spread and number of hidden layer nodes obtained are 7.99 and 14, respectively.

For the ELM model, the sigmoidal function is chosen as activation function. The number of hidden layer nodes \(N_{E}\) is selected in the interval of [11, 50] and the calculated optimal value is 14. Note that the result is obtained based on the average performance from fifty continuous training to reduce the uncertainty.

As for the SVR model, the penalty factor \(C\), kernel parameter \(\sigma^{\prime}\) and insensitive loss function \(\varepsilon\) are three control parameters. The penalty factor is selected in the interval of [0.01, 100]. The Gaussian kernel is selected as the kernel function of the SVR model, and the value of the kernel parameter is selected in the interval of [0.01, 50]. The insensitive loss function is selected in the interval of [0.001, 0.1]. The optimal values of the penalty factor, the kernel parameter and the insensitive loss function obtained are 2.82, 0.09 and 0.038, respectively.

The search range of the control parameters and the obtained optimal parameter values of the G-ORVM, GP-ORVM, SVR, RBF-NN and ELM models are summarized in Table 5.

Table 5 Search range of the control parameters and the obtained optimal values of different models

In the same manner, take the testing set 3 as an example, a detailed comparison on the prediction performance of the six models is carried out. Evaluation metrics of the fitted and predicted results are listed in Table 6 and the best results are shown in boldface. The fitted and predicted results of these six models are shown in Fig. 10. It can be observed that the \(R^{2}\) values of the fitted results computed by six models are all approximate to 1.0, which reflects that these six models have satisfactory fitting performance. The RMSE, MAE and ME values of predictive results using GP-ORVM model is the smallest, which indicates the perfect predictive performance of the GP-ORVM model. From Fig. 10 it can be seen that the residual of the predicted results using GP-ORVM and HST-MLR models at peak value (e.g. data on 2016-8-10) is significantly smaller than other models.

Table 6 Performance of different models using the testing set 3
Fig. 10
figure 10figure 10

Performance comparison of the six models

In the training period, the GP-ORVM model utilizes 6.56 % of training data as relevance vectors while the G-ORVM model uses 5.94 % of training data as relevance vectors. Compared with the RVM models, the number of support vectors required is near the size of the training set. The developed RVM models can obtain a more sparse solution with very few relevant vectors, namely, most of the \(\alpha_{i}\) tend to infinity and corresponding \(w_{i} = 0\). Therefore, the possibility of overtraining as well as the computational time is minimized.

In order to evaluate the performance of the proposed ORVM models objectively, a detailed comparison using different unknown testing data is made. Bar charts of the predictive performance for these models under six testing sets are shown in Fig. 11, and the evaluation metrics are listed in Table 7. It is noticeable that the performance of GP-ORVM and G-ORVM models are more robust and reliable than the other models regardless of the size of testing sets. The GP-ORVM model has the smallest RMSE among the models on testing set 1 and testing set 3~6, which are 0.165, 0.277,0.329, 0.466, 0.501, respectively. In testing set 2, the RMSE value of GP-ORVM is 0.267, which is the second lowest and only slightly higher than the 0.248 and 0.259 values for the MLR-HST and G-ORVM models. The GP-ORVM model has the smallest MAE value on testing set 1 and testing set 3~5 with the values are 0.129, 0.203, 0.244, 0.286, 378, respectively. In testing set 2, the MAE value of GP-ORVM is the second lowest with the value is 0.202. It could not be neglected that with the increase of the testing samples, the predictive precision of the MLR-HST model decreases obviously, even if it has good performance in small test samples (e.g. testing set 1~ testing set 3). For SVR and RBF-NN models, the variation trend of the predictive performance does not have the characteristics of monotonous, however, the model performance is not satisfactory in the short-term prediction of concrete dam displacement.

Fig. 11
figure 11

Performance evaluation of the six models for dam displacement prediction

Table 7 Comparison of the RMSE and MAE values in six testing sets

Overall, the predictive performance of the two ORVM-based models is satisfactory. The GP-ORVM model has the minimum values of average RMSE and MAE, which indicates that the GP-ORVM is the most effective model among the listed models. The G-ORVM model has the similar performance to that of the GP-ORVM model as its average RMSE and MAE values of different testing data are only slightly higher than those of the GP-ORVM model.

4 Conclusion and future work

In this study, a novel probabilistic prediction model of concrete dam displacement is presented to build the structural health monitoring framework and mine the effects of hydrostatic, seasonal and irreversible time components on dam deformation. The model is combined with RVM, multi-kernel technique, HST statistical model and PJA using initial service life monitoring measurements collected from a super-high concrete arch dam. The proposed parameters optimization method is verified to be effective for RVM to achieve accurate and robust predictions. Different kernel functions, such as Gaussian kernel, Laplace kernel, Polynomial kernel, and the multi-kernels of their weighted combination are also exploited to build the ORVM to verify their impact on predictive performance. The main conclusions and contributions are summarized as follows:

  • The developed ORVM model is suitable for the prediction of non-stationary and nonlinear concrete dam displacement since it provides satisfactory performance both on training and testing sets. The proposed optimization framework for parameters estimation using PJA can optimize the hyper-parameters of RVM effectively and avoid falling into local optimum, which improves the model predictive performance and robustness.

  • The kernel functions and hyper-parameters have significant impacts on the performance of the RVM model. The results suggest that the weighted combination strategy for multi-kernel construction is feasible, and the multi-kernel ORVM models have superior performance than the simple kernel ORVM models.

  • Compared with the listed benchmark models, the ORVM is proved to be robust and effective for establishing the dam health monitoring model in predicting the concrete dam displacement. The developed ORVM models with SumGP kernel and Gaussian kernel perform superior to the optimized SVR, RBF-NN, ELM and HST-MLR models in most of the testing sets, reducing the residual. In addition, the developed ORVM is sparser than SVM.

  • The developed ORVM model not only obtains the most accurate results in the single point prediction of dam displacement measurements, but also provides probabilistic CIs, which can be used to quantify the uncertainty and identify the abnormal value of dam displacement. In addition, the multi-kernel ORVM can compress and smooth the CI to polish up the reliability of displacement anomaly recognition.

For the future work, the proposed model can be adopted for analysis and prediction of other monitoring measurements in concrete dam engineering, such as tangential displacement, settlement, and seepage. Besides, future studies should involve analyzing and developing the multi-output ORVM model in order to solve the high dimensional regression task and provide more reliable prediction and identification of dam spatial deformation.