Abstract
In this article, we consider the problem of estimating quantiles related to the outcome of experiments with a technical system given the distribution of the input together with an (imperfect) simulation model of the technical system and (few) data points from the technical system. The distribution of the outcome of the technical system is estimated in a regression model, where the distribution of the residuals is estimated on the basis of a conditional density estimate. It is shown how Monte Carlo can be used to estimate quantiles of the outcome of the technical system on the basis of the above estimates, and the rate of convergence of the quantile estimate is analyzed. Under suitable assumptions, it is shown that this rate of convergence is faster than the rate of convergence of standard estimates which ignore either the (imperfect) simulation model or the data from the technical system; hence, it is crucial to combine both kinds of information. The results are illustrated by applying the estimates to simulated and real data.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The design of complex technical systems by engineers always has to take into account some kind of uncertainty. This uncertainty might occur because of lack of knowledge about future use or about properties of the materials used to build the technical system (e.g., the exact value of the damping coefficient of a spring–mass damper). In order to take this uncertainty into account, we model in the sequel the outcome Y of the technical system by a random variable. For simplicity, we restrict ourselves to the case that Y is a real-valued random variable. Thus, we are interested in properties of the distribution of Y; for example, we are interested in quantiles
for \(\alpha \in (0,1)\) (which describe for \(\alpha \) close to one values which we expect to be upper bounds on the values occurring in an application), or in the density \(g_Y:\mathbb {R}\rightarrow \mathbb {R}\) of Y with respect to the Lebesgue–Borel measure, which we assume later to exist.
In the sequel, we model the lack of knowledge about the future use of the system or about properties of materials used in it by introducing an additional \(\mathbb {R}^d\)-valued random variable X, which contains values for uncertain parameters describing the system or its future use, and from which we assume either to know the distribution or are able to generate an arbitrary number of independent realizations. Furthermore, we assume that we have available a model describing the relation between X and Y by a function \(\bar{m}:\mathbb {R}^d\rightarrow \mathbb {R}\). This function \(\bar{m}\) might be constructed by using a physical model of our technical system, and in some sense \(\bar{m}(X)\) is an approximation of Y. However, as all models our model is imperfect in the sense that \(Y=\bar{m}(X)\) does not hold. This might be due to the fact that Y cannot be exactly characterized by a function of X (since X might not describe the randomness of Y completely), or since our relation between Y and X is not correctly specified by \(\bar{m}\), or because of both. So although we know \(\bar{m}\) and can generate an arbitrary number of independent copies \(X_1\), \(X_2\), ...of X, we cannot use \(\bar{m}(X_1)\), \(\bar{m}(X_2)\), ...as observations of Y, since there is an error between these values and a sample of Y.
In order to control this error, we assume that we have available \(n \in \mathbb {N}\) observations of the Y-values corresponding to the first n values of X. To formulate our prediction problem precisely, let (X, Y), \((X_1,Y_1)\), \((X_2,Y_2)\), ...be independent and identically distributed and let \(L_n, N_n \in \mathbb {N}\). We assume that we are given the data
and we want to use these data in order to estimate the quantiles \(q_{Y,\alpha }\) or the density \(g_Y\) of Y (which we later assume to exist). The main difficulty in solving this problem is that the sample size n of the observations of Y (which corresponds to the number of experiments we are making with the technical system) is rather small (since these experiments are time consuming or costly).
Before we describe various existing approaches to solve this problem in the literature, we will illustrate the problem by an example. Here, we consider a demonstrator for a suspension strut, which was built at the Technische Universität Darmstadt and which serves as an academic demonstrator to study uncertainty in load distributions and the ability to control vibrations, stability and load paths in suspension struts such as aircraft landing gears. The photograph of this suspension strut and its experimental test setup is shown in Fig. 1(left); a CAD illustration of this suspension strut can be found in Fig. 1(middle).
This suspension strut consists of upper and lower structures, where the lower structure contains a spring–damper component and an elastic foot. The spring–damper component transmits the axial forces between the upper and lower structures of the suspension strut. The aim of our analysis is the analysis of the behavior of the maximum relative compression of the spring–damper component in case that the free fall height is chosen randomly. Here, we assume that the free fall heights are independent normally distributed with mean 0.05 meter and standard deviation 0.0057 meter.
We analyze the uncertainty in the maximum relative compression in our suspension strut using a simplified mathematical model of the suspension strut [cf., Fig. 1(right)], where the upper and the lower structures of the suspension strut are two lump masses m and \(m_1\), the spring–damper component is represented by a stiffness parameter k and a suitable damping coefficient b, and the foot is represented by another stiffness parameter \(k_{ef}\). Using a linear stiffness and an axiomatic damping, it is possible to compute the maximum relative compression by solving a differential equation using Runge–Kutta algorithm (cf., model a) in Mallapur and Platz (2017). Figure 2 shows \(L_n=500\) data points from the computer experiment and also \(n=20\) experimental data points. Since they do not look like they come from the same source, our computer experiment is obviously imperfect. Our aim in the sequel is to us the \(n=20\) data points from our experiments with the suspension strut together with the \(L_n=500\) data points from the computer experiments in order to analyze the uncertainty in the above-described experiments with the suspension strut. This can be done, e.g., by making some statistical inference about quantiles or the density of the maximal occurring compression in experiments with the suspension strut. Here, we do not only want to adjust for a constant shift in order to match the simulator and the experimental data closely, but we also want to take into account that the values of Y are not a deterministic function of X.
There are various possible approaches to solve the above estimation problem. The simplest idea is to ignore the model \(\bar{m}(X)\) completely and to make inference about \(q_{Y,\alpha }\) and \(g_Y\) using only the observations
of Y. For example, we can estimate the quantile \(q_{Y,\alpha }\) by the plug-in estimate
corresponding to the estimate
of the cumulative distribution function (cdf) \(G(y)={\mathbf P}\{Y \le y\}\) of Y, which result in an order statistics as an estimate of the quantile. Or we can estimate the density \(g_Y\) of Y by the well-known kernel density estimate of Rosenblatt (1956) and Parzen (1962), where we first choose a density \(K:\mathbb {R}\rightarrow \mathbb {R}\) (so-called kernel) and a so-called bandwidth \(h_n>0\) and define our estimate by
However, since the sample size n of our data (3) is rather small, this will in general not lead to satisfying results.
Another simple idea is to ignore the real data (3), and to use the model data
as a sample of Y with additional measurement errors, and to use this sample to define quantile and density estimates as above. In this way, we estimate \(q_{Y,\alpha }\) by
where
and we can estimate the density g of Y by
Since the function \(\bar{m}\) of our model \(\bar{m}(X)\) of Y might be costly to evaluate (e.g., in case that its values are defined as solutions of a complicated partially differential equation) and consequently \(L_n\) might not be really large, it makes sense to use in a first step the data
to compute a surrogate model
of \(\bar{m}\), and to compute in the second step the quantile and density estimates \(\hat{q}_{\hat{m}_{L_n}(X),N_n,\alpha }\) and \(\hat{g}_{\bar{m}_{L_n}(X),N_n}\) using the data
Surrogate models have been introduced and investigated with the aid of the simulated and real data in connection with the quadratic response surfaces in Bucher and Bourgund (1990), Kim and Na (1997) and Das and Zheng (2000), in context of support vector machines in Hurtado (2004), Deheeger and Lemaire (2010) and Bourinet et al. (2011) in connection with neural networks in Papadrakakis and Lagaros (2002), and in context of kriging in Kaymaz (2005) and Bichon et al. (2008).
Under the assumption that we have \(\bar{m}(X)=Y\), the above estimates have been theoretically analyzed in Devroye et al. (2013), Bott et al. (2015), Felber et al. (2015a, b), Enss et al. (2016) and Kohler and Krzyżak (2018).
However, in practice there usually will be an error in the approximation of Y by \(\bar{m}(X)\), and it is unclear how this error influences the error of the quantile and density estimates.
Kohler et al. (2016) and Kohler and Krzyżak (2016) used the data
obtained by experiments with the technical system in order to control this error. In particular, confidence intervals for quantiles and confidence bands for densities are derived there. Wong et al. (2017) used the above data of the technical system in order to calibrate a computer model and estimated the error of the resulting model by using bootstrap. Kohler and Krzyżak (2017) used these data in order to improve the surrogate model and analyzed the density estimate based on the improved surrogate model.
Kohler et al. (2016) and Kohler and Krzyżak (2016, 2017) try to approximate Y by some function of X and make statistical inference on the basis of this approximation. Wong et al. (2017) do this similarly, but take into account additional measurement errors of the y-values. The basic new idea in this article is to estimate instead a regression model
where
is the residual error of our model \(\bar{m}(X)\), which is not related to measurement errors but instead is due to the fact that an approximation of Y by a function of X cannot be perfect. In this model, we estimate simultaneously \(\bar{m}\) and the conditional distribution \({\mathbf P}_{\bar{\epsilon } | X=x}\) of \(\bar{\epsilon }\) given \(X=x\). As soon as we have available estimates \(\hat{m}_{L_n}\) and \({\mathbf P}_{\bar{\epsilon } | X=x}\) for both, we generate data
(where \(\hat{\epsilon }(x)\) has the distribution \(\hat{{\mathbf P}}_{\bar{\epsilon } | X=x}\) conditioned on \(X=x\)) and use these data to define corresponding quantile estimates.
We assume in the sequel that the conditional distribution of \(\bar{\epsilon }\) given X has a density with respect to the Lebesgue–Borel measure. In order to estimate this conditional density, we use the well-known conditional kernel density estimate introduced already in Rosenblatt (1969). Concerning existing results on conditional density estimates, we refer to Fan et al. (1996), Fan and Yim (2004), Gooijer and Zerom (2003), Efromovich (2007), Bott and Kohler (2016, 2017) and the literature cited therein.
Our main result, which is formulated in Sect. 3, shows that our newly proposed quantile estimates achieve under suitable regularity condition rates of convergence, which are faster than the rates of convergence of the estimates (4), (6) and the modifications of (6) using \(\hat{m}_{L_n}\) instead of \(\bar{m}\). Furthermore, we show with simulated data that in the situations which we consider in our simulations this effect also occurs for finite sample sizes, and illustrate the usefulness of our newly proposed method by applying it to a spring–damper system introduced earlier.
Throughout this paper, we use the following notation: \(\mathbb {N}\), \(\mathbb {N}_0\) and \(\mathbb {R}\) are the sets of positive integers, nonnegative integers and real numbers, respectively. Let \(p=k+\beta \) for some \(k \in \mathbb {N}_0\) and \(0 < \beta \le 1\), and let \(C>0\). A function \(m:\mathbb {R}^d \rightarrow \mathbb {R}\) is called (p, C)-smooth, if for every \(\alpha =(\alpha _1, \ldots , \alpha _d) \in \mathbb {N}_0^d\) with \(\sum _{j=1}^d \alpha _j = k\) the partial derivative \(\frac{\partial ^k m}{\partial x_1^{\alpha _1}\ldots \partial x_d^{\alpha _d}}\) exists and satisfies
for all \(x,z \in \mathbb {R}^d\). If X is a random variable, then \({\mathbf P}_X\) is the corresponding distribution, i.e., the measure associated with the random variable. If (X, Y) is a \(\mathbb {R}^d\times \mathbb {R}\)-valued random variable and \(x \in \mathbb {R}^d\), then \({\mathbf P}_{Y|X=x}\) denotes the conditional distribution of Y given \(X=x\). Let \(D \subseteq \mathbb {R}^d\) and let \(f:\mathbb {R}^d \rightarrow \mathbb {R}\) be a real-valued function defined on \(\mathbb {R}^d\). We write \(x = \arg \min _{z \in D} f(z)\) if \(\min _{z \in {\mathcal {D}}} f(z)\) exists and if x satisfies
For \(x \in \mathbb {R}^d\) and \(r >0\), we denote the (closed) ball with center x and radius r by \(S_r(x)\). If A is a set, then \(I_A\) is the indicator function corresponding to A, i.e., the function which takes on the value 1 on A and is zero elsewhere. For \(A \subseteq \mathbb {R}\), we denote the infimum of A by \(\inf A\), where we use the convention \(\inf \emptyset = \infty \). If \(x \in \mathbb {R}\), then we denote the smallest integer greater than or equal to x by \(\lceil x \rceil \).
The outline of this paper is as follows: In Sect. 2, the construction of the newly proposed quantile estimate is explained. The main results are presented in Sect. 3 and proven in Sect. 5. The finite sample size performance of our estimates is illustrated in Sect. 4 by applying it to simulated and real data.
2 Definition of the estimate
In the sequel, we assume that we are given data (2), where \(n,L_n,N_n \in \mathbb {N}\), the \(\mathbb {R}^d\times \mathbb {R}\) valued random variables (X, Y), \((X_1,Y_1)\), \((X_2,Y_2)\), ...are independent and identically distributed, and where \(\bar{m}:\mathbb {R}^d\rightarrow \mathbb {R}\) is measurable. Our aim is to estimate the quantile \(q_{Y,\alpha }\) defined in (1) for some \(\alpha \in (0,1)\).
To do this, we start by constructing an estimate of \(\bar{m}\). For this, we use the data
and define the penalized least squares estimates of \(\bar{m}\) by
and
for some \(\beta _{L_n}>0\), where \(k \in \mathbb {N}\) with \(2 k > d\),
is a penalty term penalizing the roughness of the estimate, \(W^k (\mathbb {R}^d)\) denotes the Sobolev space
and where \(\lambda _{L_n}>0\), \(T_L(x)=\max \{-L,\min \{L,x\}\}, L>0\) is the truncation operator and \(L_2(\mathbb {R}^d)\) denotes square integrable functions on \(\mathbb {R}^d\). The condition \(2k>d\) implies that the functions in \(W^k(\mathbb {R}^d)\) are continuous and hence the value of a function at a point is well defined.
Then, we compute the residuals of this estimate on the data \((X_1,Y_1), \ldots , (X_n,Y_n)\), i.e., we set
We use these residuals in order to estimate the conditional distribution of \(\bar{\epsilon }=Y-\bar{m}(X)\) given \(X=x\). Here, we assume that this distribution has a density and estimate this density by applying a conditional density estimator to the data
To do this, we set \(G=I_{[-1,1]}\) and let \(K:\mathbb {R}\rightarrow \mathbb {R}\) be a density, let \(h_n,H_n>0\) and set
Once we have constructed the estimates \(\hat{m}_n\) and \(\hat{g}_{\hat{\epsilon }|X}\), we construct a sample of size \(N_n\) of the distribution of
where the random variable \(\hat{\epsilon }(X)\) has the conditional density \(\hat{g}_{\hat{\epsilon }|X}(\cdot ,X)\) given X, and estimate the quantile by the empirical quantile corresponding to this sample. To do this, we use an inversion method: We define for \(u \in (0,1)\) and \(x \in \mathbb {R}^d\)
choose independent and identically distributed random variables \(U_1\), \(U_2\), ..., with uniform distribution on (0, 1), such that they are independent of all other previously introduced random variables, and set
This implies in case
that \(\hat{\epsilon }_{n+L_n+i}\) conditioned on \(X_{n+L_n+i}\) has the density \(\hat{g}_{\hat{\epsilon }|X}(\cdot ,X_{n+L_n+i})\)).
With these random variables, we estimate the cdf of Y by setting
and
and use the corresponding plug-in estimate
as an estimate of \(q_{Y,\alpha }\).
Remark 1
Since there is no measurement error in the observation from the simulator \(\bar{m}\), we could also use an interpolation estimate (instead of the penalized least squares estimate \(\hat{m}_{L_n}\)) in order to estimate \(\bar{m}\). For example, in this context we could apply the spline estimate from Bauer et al. (2017).
3 Main result
Before we formulate our main result, we summarize some important notations.
- Y:
Outcome of the experience
- X:
Parameters of the experiment
- \(\bar{m}\):
Function \(\bar{m}:\mathbb {R}^d\rightarrow \mathbb {R}\) describing the computer model
- \(\bar{\epsilon }\):
Residual of the computer model
- \(g_{\bar{\epsilon }|X}\):
Conditional density of \(\bar{\epsilon }\).
Our main result is the following theorem, which gives a nonasymptotic bound on the error of our quantile estimate.
Theorem 1
Let (X, Y), \((X_1,Y_1)\), \((X_2,Y_2)\), ...be independent and identically distributed \(\mathbb {R}^d\times \mathbb {R}\)-valued random variables, and let \(\bar{m}:\mathbb {R}^d\rightarrow \mathbb {R}\) be a measurable function. Let \(g_{\bar{\epsilon }|X} : \mathbb {R}\times \mathbb {R}^d\rightarrow \mathbb {R}\) be a measurable function with the property that \(g_{\bar{\epsilon }|X}(\cdot ,X)\) is a density of the conditional distribution of \(\bar{\epsilon }=Y-\bar{m}(X)\) given X. Assume that the following regularity conditions hold for some \(C_1,C_2>0\), \(r,s \in (0,1]\):
- (A1)
\(| g_{\bar{\epsilon }|X}(y,x_1) - g_{\bar{\epsilon }|X}(y,x_2)| \le C_1 \cdot \Vert x_1-x_2\Vert ^r\) for all \(x_1,x_2 \in \mathbb {R}^d, y \in \mathbb {R}\),
- (A2)
\(| g_{\bar{\epsilon }|X}(u,x)-g_{\bar{\epsilon }|X}(v,x)| \le C_2 \cdot |u-v|^s\) for all \(u,v \in \mathbb {R}, x \in \mathbb {R}^d\).
Let \(n, L_n , N_n \in \mathbb {N}\) and assume \(N_n^2 \ge 8 \cdot \log n\). For \(\alpha \in (0,1)\) define the estimate \(\hat{q}_{\hat{Y},N_n,\alpha }\) of the quantile \(q_{Y,\alpha }\) [given by (1)] as in Sect. 2, where \(h_n, H_n>0\), G is the naive kernel and where \(K:\mathbb {R}^d\rightarrow \mathbb {R}\) is a bounded and symmetric density, which decreases monotonically on \(\mathbb {R}_+\) and satisfies
Let \(\gamma _n>0\), assume \(2 \cdot \sqrt{d} \cdot \gamma _n \ge H_n\), and for \(x \in \mathbb {R}^d\) let \(-\infty< a_n(x) \le b_n(x) < \infty \). Set
where
and set
Let \(e_n>0\) and assume that the cdf of Y satisfies
and
Then,
Remark 2
Assume that Y has a density \(g_Y:\mathbb {R}\rightarrow \mathbb {R}\) with respect to the Lebesgue measure which satisfies for some \(c_2,c_3>0\)
Assume that positive \(\epsilon _n, \delta _n, \eta _n\) defined in Theorem 1 satisfy
and set
Then, (10) and (11) hold, and consequently, we can conclude from Theorem 1
Indeed, the assumptions above imply
Consequently, because of the assumption on the density of Y we have
By the definition of \(e_n\), we have
which implies (10). In the same way, one can show (11).
Remark 3
The rate of convergence in Remark 1 depends on \(\epsilon _n\), \(\delta _n\) and \(\eta _n\). Here, \(\epsilon _n\) is by its definition related to the \(L_2\) error of \(\hat{m}_{L_n}\). It follows from the proof of Theorem 1 (cf., Lemma 1) that \(\delta _n\) is related to the \(L_1\) error of the conditional density estimate \(\hat{g}_{\hat{\epsilon }|X}\) and \(\eta _n\) is related to the probability that this estimate is not a density.
Remark 4
Set \(\gamma _n=\log (n)\). Assume that the conditional distribution \(\bar{\epsilon }\) given \(X=x\) has compact support contained in \([a_n(x),b_n(x)]\), which implies that we have
Under suitable smoothness assumptions on \(\bar{m}:\mathbb {R}^d\rightarrow \mathbb {R}\), suitable assumptions on the tails of \(\Vert X\Vert \) and in case that \(\lambda _{L_n}\) and \(\beta _{L_n}\) are suitably chosen it is well known that the expected \(L_2\) error of the smoothing spline estimate satisfies
(cf., e.g., Theorem 2 in Kohler and Krzyżak (2017)). Thus, for \(L_n\) large compared to n and under suitable assumptions on the tails of \(\Vert X\Vert \) it follows from Remark 2 that the error of our quantile estimate in Theorem 1 is up to some constant given by
Minimizing the expression above with respect to \(h_n\) and \(H_n\) as in the proof of Corollary 2 in Bott and Kohler (2017) shows that in case of a suitable choice of the bandwidths \(h_n,H_n>0\) the error of our quantile estimate in Theorem 1 is up to some logarithmic factor given by the minimum of
and
Assume that the distribution of (X, Y) and \(\bar{m}\) change with increasing sample size and that \(|b_n(x)-a_n(x)|\) is the diameter of the support of the conditional distribution of \(\bar{\epsilon }\) given \(X=x\). Then, the error of our quantile estimate can converge to zero arbitrarily fast in case that \(\int _{[-\log (n),\log (n)]^d} |b_n(x)-a_n(x)| \, {\mathbf P}_X(\mathrm{d}x)\) goes to zero fast enough. In particular, the rate of convergence of our quantile estimate might be much better than the well-known rate of convergence \(1/\sqrt{n}\) of the simple quantile estimate (4), and in case of imperfect models, it will also be better than the rate of convergence of the surrogate quantile estimate.
Remark 5
The results in Remark 4 require that the parameters of the estimates (e.g., \(h_n\) and \(H_n\)) are suitably chosen. A data-dependent way of choosing these parameters in an application will be proposed in the next section, and by using simulated data, it will be shown that in this case our newly proposed estimates outperform the other estimates for finite sample size in the situations which we consider there.
4 Application to simulated and real data
In this section, we illustrate the finite sample size performance of our estimates by applying them to simulated and real data. We start with an application to simulated data, where we compare the simple order statistics estimate (est. 1) defined by (4) and a surrogate quantile estimate (est. 2) defined by (6) (where we replace \(\bar{m}\) by \(\hat{m}_{L_n}\) and evaluate this function on \(N_n\)x-values) with our newly proposed estimate based on estimation of the conditional density (est. 3) as defined in Sect. 2.
In the implementation of est. 2 and est. 3, we use thin plate splines (with smoothing parameter chosen by generalized cross-validation) as implemented by the routine Tps() of R in order to estimate a surrogate model for our computer experiment. Here, the implementation of the surrogate quantile estimate est. 2 computes a sample of size \(N_n=100{,}000\) of \(\hat{m}_{L_n}(X)\) and estimates the quantile by the corresponding order statistics.
In the implementation of our newly proposed est. 3, we use the naive kernel \(G(x)=I_{[-1,1]}(x)\) and the Epanechnikov kernel \(K(y)=(3/4)\cdot (1-y^2)_+\) for the conditional density estimate
Here, the bandwidths h and H are chosen in a data-dependent way from the sets
and
where IQR denotes an interquartile range, i.e., the distance between 25th and 75th percentiles. To do this, we use the famous combinatorial method for bandwidth selection of the kernel density estimate introduced in Devroye and Lugosi (2001), which aims at choosing a bandwidth which minimizes the \(L_1\) error. Here, we apply a variant of this method for conditional density estimation introduced and described in Bott and Kohler (2016). To do this, we choose the bandwidths by minimizing
with respect to \(h \in {\mathcal {P}}_h\) and \(H \in {\mathcal {P}}_H\), where \(n_l=\lfloor n/2 \rfloor \), \(n_t=n-n_l\),
and
In the implementation of this method, we approximate the integral
by a Rieman sum based on an equidistant grid of
consisting of 200 grid points (which enables an “efficient” implementation of the above minimization problem by first computing of \(\hat{g}_{\hat{\epsilon }|X}^{(n_l,(h,H)}(y,X_i)\) for all grid points y, all \(h \in {\mathcal {P}}_h\), all \(H \in {\mathcal {P}}_H\) and all \(i=n_l+1, \ldots , n\)). After the computation of \(\hat{g}_{\epsilon |X}\), we use the inversion method to generate random variables with the conditional density \(\hat{g}_{\epsilon |X}(\cdot ,X_i)\). Here, we do not have to consider values outside of the above interval, since our density estimate is zero outside of this interval. In order to implement the inversion method, we discretize the corresponding conditional cumulative distribution function
by considering only its values on an equidistant grid of
consisting of 1000 points, and by approximating the above integral by a Rieman sum corresponding to this grid. This enables again an “efficient” computation of the values of the conditional empirical cumulative distribution function by computing in advance
for all grid points z and all \(i=1, \ldots ,n\). Using so computed values of the random variables, we compute a sample of size \(N_n=100{,}000\) of Y and estimate the quantile by the corresponding order statistics.
We compare the above three estimates in the regression model
where X is a standard normally distributed random variable,
and the conditional distribution of \(\epsilon \) given X is normally distributed with mean zero and standard deviation
Here, \(\sigma >0\) is a parameter of our distribution for which we allow the values 0.5, 1 and 2. Furthermore, we assume that our simulation model is based on the function
where \(\delta \in \mathbb {R}\) is the constant model error of our model for which we consider the values 0 (i.e., no error) and 1 (i.e., negative error). Here, we consider a negative value for the model error, since the surrogate quantile estimate tends to underestimate the quantile in the above example, so that a positive error might accidentally improve the surrogate quantile estimate.
We apply our estimates to samples of size \(n \in \{20, 50, 100\}\) of (X, Y) and \(L_n=500\) of \((X, \bar{m}(X))\), and use them to estimate quantiles of order \(\alpha =0.95\) and \(\alpha =0.99\).
In order to judge the errors of our quantile estimate, we use a simple order statistics with sample size 1, 000, 000 applied to a sample of Y as a reference value for the (unknown) quantile \(q_{Y,\alpha }\) and compute the relative errors
Of course, our estimates \(\hat{q}_{Y,\alpha }\) and hence also the above relative errors depend on the random samples selected above, and hence are random. Therefore, we repeat the computation of the above error 100 times with newly generated independent samples and report the median and the interquartile ranges of the 100 errors in each of the considered cases for \(\alpha \), \(\sigma \), \(\delta \) and n, which results in errors for \(2 \cdot 3 \cdot 2 \cdot 3=36\) different situations. The values we obtained in case \(\alpha =0.95\) and in case \(\alpha =0.99\) are reported in Tables 1 and 2, respectively.
Looking at the results in Tables 1 and 2, we see that our newly proposed estimate outperforms the order statistics estimate in all 36 settings of the simulations. Furthermore, it outperforms the surrogate quantile estimates whenever the model error is not zero, and also in case of the model error being zero whenever \(\sigma \) is large. There are a few cases with small \(\sigma \) value and zero model error where the surrogate quantile estimate is better than our newly proposed estimate, but in this case the difference between the errors is not large in contrast to the improvement of the error of the surrogate quantile estimate by our newly proposed estimate in most of the other cases.
Finally, we illustrate the usefulness of our newly proposed method for uncertainty quantification by using it in analysis of the uncertainty occurring in experiments with the suspension strut in Fig. 1(left) described in Introduction. We use the results of \(L_n=500\) computer experiments to construct a surrogate estimate \(\hat{m}_{L_n}\) as described above, and we apply the method proposed in Sect. 2 to compute the conditional density of the residuals. To do this, we choose as described above the bandwidths h and H from the sets
and
by using the combinatorial method of Bott and Kohler (2016). This results in \(h=0.000191\) and \(H= 0.0043\). As described above, we use the corresponding density estimate together with the surrogate model to generate an approximate sample of size 100, 000 of Y and estimate the \(\alpha =0.95\) quantile of Y by the corresponding order statistics, which results in the estimate 0.0855. In contrast, the simple order statistics estimate of the quantile based only on the \(n=20\) experimental data points yields the smaller value 0.0849.
5 Proofs
5.1 Estimation of quantiles on the basis of conditional density estimates
Let (X, Y), \((X_1,Y_1)\), \((X_2,Y_2)\), ...be independent and identically distributed \(\mathbb {R}^d\times \mathbb {R}\)-valued random vectors and let \(\bar{m}:\mathbb {R}^d\rightarrow \mathbb {R}\) be a measurable function. Assume that the conditional distribution of \(\bar{\epsilon }=Y-\bar{m}(X)\) given X has the density \(g_{\bar{\epsilon }|X}(\cdot ,X):\mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\) with respect to the Lebesgue–Borel measure, where \(g_{\bar{\epsilon }|X}:\mathbb {R}\times \mathbb {R}^d\rightarrow \mathbb {R}\) is measurable. Let \(n, L_n, N_n \in \mathbb {N}\) and set
Let \(\hat{m}_{L_n}(\cdot )=\hat{m}_{L_n}(\cdot ,{\mathcal {D}}_n):\mathbb {R}^d\rightarrow \mathbb {R}\) and let
be a measurable function satisfying
Let U, \(U_1\), \(U_2\), ...be independent random variables which are uniformly distributed on (0, 1) and which are independent of (X, Y), \((X_1,Y_1)\), ...and set
and
Set
For \(\alpha \in (0,1)\) set
where
and
where
Lemma 1
Let \(\alpha \in (0,1)\), \(n \in \mathbb {N}\) and \(L_n , N_n \in \mathbb {N}\) and define the estimate \(\hat{q}_{Y,N_n,\alpha }\) of \(q_{Y,\alpha }\) as above. Assume that \(\hat{g}_{\hat{\epsilon }|X}\) satisfies
Let \(\epsilon _n, \delta _n, \eta _n, e_n>0\) be such that
and
Then
Proof
Set
By the Dvoretzky–Kiefer–Wolfowitz inequality [cf., Massart (1990)] applied conditionally on \({\mathcal {D}}_n\) we get
Since
it suffices to show that
and
imply
By the definition of \(\hat{q}_{Y,N_n,\alpha }\), we know that (22) is implied by
and
so it suffices to show that (18)–(21) imply (23) and (24), what we do next.
So assume from now on that (18)–(21) hold. Before we start with the proof of (23) we show
Indeed, we observe first
Furthermore, we have
Since we have
which follows from assumption (15)) and which implies
and
this implies (25).
Next we prove (23). Using (18), (21), (25) and (19), we get
where the last inequality follows from (16).
In the same way, we argue that
which finishes the proof. \(\square \)
5.2 A bound on the \(L_1\) error of a conditional density estimate
Lemma 2
Let (X, Y), \((X_1,Y_1), \ldots , (X_n,Y_n)\) be independent and identically distributed \(\mathbb {R}^d\times \mathbb {R}\)-valued random vectors. Assume that the conditional distribution \({\mathbf P}_{Y|X}\) of Y given X has the density \( g_{Y|X}(\cdot ,X):\mathbb {R}\rightarrow \mathbb {R}\) with respect to the Lebesgue–Borel measure, where
is a measurable function which satisfies
and
for some \( r,s \in (0,1]\) and some \(C_1,C_2 >0\). Let \(\gamma _n>0\). For \(x \in \mathbb {R}^d\) let \(- \infty<a_n(x) \le b_n(x) < \infty \) be such that
Set \(G=I_{[-1,1]}\) and let \(K:\mathbb {R}\rightarrow \mathbb {R}\) be a density satisfying
Let \(h_n,H_n>0\) be such that \(2 \cdot \sqrt{d} \cdot \gamma _n \ge H_n\), and set
where \(\frac{0}{0} :=0\). Then
where the constant
does not depend on \({\mathbf P}_{(X,Y)}\), \(C_1\) or \(C_2\).
In the proof, we will need the following well-known auxiliary result:
Lemma 3
Let \(n \in \mathbb {N}\), let \(H_n, \gamma _n>\) be such that \(2 \cdot \sqrt{d} \cdot \gamma _n \ge H_n\), and let X be an \(\mathbb {R}^d\)-valued random variable. Then, it holds:
Proof
The assertion follows from the proof of equation (5.1) in Györfi et al. (2002), a complete proof is available from the authors on request. \(\square \)
Proof of Lemma 2
By triangle inequality, we have
In the first step of the proof, we show
The inequality of Cauchy–Schwarz implies
hence it suffices to show
To show this, we observe first that the independence of the data implies
Hence,
Application of Lemma 4.1 in Györfi et al. (2002) and Lemma 3 yields
which completes the proof of (32).
In the second step of the proof, we show
Using the independence of the data and arguing similar as in the proof of inequality (32), we get
By Lemma 3, we get
Furthermore, by triangle inequality and assumptions (26) and (27), which imply
we get
Summarizing the above results we get the assertion. \(\square \)
5.3 Proof of Theorem 1
In the proof of Theorem 1, we will use Lemma 1, Lemma 2 and the following auxiliary result from Bott et al. (2015).
Lemma 4
Let \(K:\mathbb {R}\rightarrow \mathbb {R}\) be a symmetric and bounded density which is monotonically decreasing on \(\mathbb {R}_+\). Then, it holds
for arbitrary \(z_1,z_2 \in \mathbb {R}\).
Proof
See Lemma 1 in Bott et al. (2015). \(\square \)
Proof of Theorem 1
By Lemma 1 and Markov inequality, it suffices to show
and
In the first step of the proof, we observe that (34) is a trivial consequence of the independence of the data and the definition of \(\epsilon _n\).
In the second step of the proof, we show (35). In case \(\sum _{j=1}^n G \left( \frac{\Vert x-X_j\Vert }{H_n} \right) \ne 0\) we have that \(\hat{g}_{\hat{\epsilon }|X}(\cdot ,x)\) is a density, and we can conclude by the Lemma of Scheffé and triangle inequality
In case \(\sum _{j=1}^n G \left( \frac{\Vert x-X_j\Vert }{H_n} \right) =0\), we have
and the above sequence of inequalities does trivially hold.
Using this, we get
where
Application of Lemma 4 yields
where the last inequality followed from the fact that G is the naive kernel. Using this together with the independence of the data, Lemma 4.1 in Györfi et al. (2002) and Lemma 3 we get
Application of Lemma 2 yields
Summarizing the above results, the proof of (35) is complete.
In the third step of the proof, we show (36). As in the proof of Lemma 2, we get
Summarizing the above results, the proof is complete. \(\square \)
References
Bauer, B., Devroye, L., Kohler, M., Krzy.zak, A., Walk, H. (2017). Nonparametric estimation of a function from noiseless observations at random points. Journal of Multivariate Analysis, 160, 90–104.
Bichon, B., Eldred, M., Swiler, M., Mahadevan, S., McFarland, J. (2008). Efficient global reliability analysis for nonlinear implicit performance functions. AIAA Journal, 46, 2459–2468.
Bott, A., Kohler, M. (2016). Adaptive estimation of a conditional density. International Statistical Review, 84, 291–316.
Bott, A., Kohler, M. (2017). Nonparametric estimation of a conditional density. Annals of the Institute of Statistical Mathematics, 69, 189–214.
Bott, A. K., Felber, T., Kohler, M. (2015). Estimation of a density in a simulation model. Journal of Nonparametric Statistics, 27, 271–285.
Bourinet, J.-M., Deheeger, F., Lemaire, M. (2011). Assessing small failure probabilities by combined subset simulation and support vector machines. Structural Safety, 33, 343–353.
Bucher, C., Bourgund, U. (1990). A fast and efficient response surface approach for structural reliability problems. Structural Safety, 7, 57–66.
Das, P.-K., Zheng, Y. (2000). Cumulative formation of response surface and its use in reliability analysis. Probabilistic Engineering Mechanics, 15, 309–315.
Deheeger, F., Lemaire, M. (2010). Support vector machines for efficient subset simulations: \(^{2}\)SMART method. In Proceedings of the 10th international conference on applications of statistics and probability in civil engineering (ICASP10), Tokyo, Japan.
Devroye, L., Lugosi, G. (2001). Combinatorial methods in density estimation. New York: Springer.
Devroye, L., Felber, T., Kohler, M. (2013). Estimation of a density using real and artificial data. IEEE Transactions on Information Theory, 59(3), 1917–1928.
Efromovich, S. (2007). Conditional density estimation in a regression setting. Annals of Statistics, 35, 2504–2535.
Enss, C., Kohler, M., Krzyżak, A., Platz, R. (2016). Nonparametric quantile estimation based on surrogate models. IEEE Transactions on Information Theory, 62, 5727–5739.
Fan, J., Yim, T. H. (2004). A crossvalidation method for estimating conditional densities. Biometrika, 91, 819–834.
Fan, J., Yao, Q., Tong, H. (1996). Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems. Biometrika, 83, 189–206.
Felber, T., Kohler, M., Krzyżak, A. (2015a). Adaptive density estimation based on real and artificial data. Journal of Nonparametric Statistics, 27, 1–18.
Felber, T., Kohler, M., Krzyżak, A. (2015b). Density estimation with small measurement errors. IEEE Transactions on Information Theory, 61, 3446–3456.
Gooijer, J. G. D., Zerom, D. (2003). On conditional density estimation. Statistica Neerlandica, 57, 159–176.
Györfi, L., Kohler, M., Krzyżak, A., Walk, H. (2002). A distribution-free theory of nonparametric regression. New York: Springer.
Hurtado, J. E. (2004). Structural reliability: Statistical learning perspectives. Lecture notes in applied and computational mechanics (Vol. 17). Berlin: Springer.
Kaymaz, I. (2005). Application of Kriging method to structural reliability problems. Strutural Safety, 27, 133–151.
Kim, S.-H., Na, S.-W. (1997). Response surface method using vector projected sampling points. Structural Safety, 19, 3–19.
Kohler, M., Krzyżak, A. (2016). Estimation of a density from an imperfect simulation model (submitted).
Kohler, M., Krzyżak, A. (2017). Improving a surrogate model in uncertainty quantification by real data (submitted).
Kohler, M., Krzyżak, A. (2018). Adaptive estimation of quantiles in a simulation model. IEEE Transactions on Information Theory, 64, 501–512.
Kohler, M., Krzyżak, A., Mallapur, S., Platz, R. (2018). Uncertainty quantification in case of imperfect models: A non-Bayesian approach. Scandinavian Journal of Statistics. https://doi.org/10.1111/sjos.12317.
Mallapur, S., Platz, R. (2017). Quantification and evaluation of uncertainty in the mathematical modelling of a suspension strut using bayesian model validation approach. In Proceedings of the international modal analysis conference IMAC-XXXV, Garden Grove, California, USA, Paper 117, 30 January–2 February, 2017.
Massart, P. (1990). The tight constant in the Dvoretzky–Kiefer–Wolfowitz inequality. Annals of Probability, 18, 1269–1283.
Papadrakakis, M., Lagaros, N. (2002). Reliability-based structural optimization using neural networks and Monte Carlo simulation. Computer Methods in Applied Mechanics and Engineering, 191, 3491–3507.
Parzen, E. (1962). On the estimation of a probability density function and the mode. Annals of Mathematical Statistics, 33, 1065–1076.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. Annals of Mathematical Statistics, 27, 832–837.
Rosenblatt, M. (1969). Conditional probability density and regression estimates. In P. R. Krishnaiah (Ed.), Multivariate analysis II (pp. 25–31). New York: Academic Press.
Wong, R. K. W., Storlie, C. B., Lee, T. C. M. (2017). A frequentist approach to computer model calibration. Journal of the Royal Statistical Society, Series B, 79, 635–648.
Acknowledgements
The authors would like to thank an anonymous referee for invaluable comments and suggestions, and they would like to thank Caroline Heil, Audrey Youmbi and Jan Benzing for pointing out an error in an early version of this manuscript. The first author would like to thank the German Research Foundation (DFG) for funding this project within the Collaborative Research Centre 805. The second author would like to acknowledge the support from the Natural Sciences and Engineering Research Council of Canada under Grant RGPIN 2015-06412.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
About this article
Cite this article
Kohler, M., Krzyżak, A. Estimating quantiles in imperfect simulation models using conditional density estimation. Ann Inst Stat Math 72, 123–155 (2020). https://doi.org/10.1007/s10463-018-0683-8
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-018-0683-8