Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

11.1 Introduction and Context

In this paper, we consider efficient estimation of the probability of large deviations of a multivariate sum of independent, identically distributed, light-tailed, and non-lattice random vectors.

Consider \(\mathbf{X}_{1}^{n}:= \left (\mathbf{X}_{1},\ldots,\mathbf{X}_{n}\right )\) n i.i.d. random vectors with known common density p X on \(\mathbb{R}^{d}\), \(d\geqslant 1,\) copies of \(\mathbf{X}:= \left (\mathbf{X}^{(1)},\ldots,\mathbf{X}^{(d)}\right ).\) The superscript (j) pertains to the coordinate of a vector and the subscript i pertains to replications. Consider also u a measurable function defined from \(\mathbb{R}^{d}\) to \(\mathbb{R}^{s}.\) Define U: = u(X) with density p U and

$$\displaystyle{ \mathbf{U}_{1,n}:=\sum _{ i=1}^{n}\mathbf{U}_{ i}. }$$

We intend to estimate for large but fixed n

$$\displaystyle{ P_{n}:= P\left (\mathbf{U}_{1,n} \in nA\right ) }$$
(11.1)

where A is a non-empty measurable set of \(\mathbb{R}^{s}\) such as \(E[u\left (\mathbf{X}\right )]\notin A.\) In [3], the authors consider in detail the case where \(d = s = 1\), \(A:= A_{n} = (a_{n},\infty )\) and a n is a convergent sequence.

The basic estimate of P n is defined as follows: generate L i.i.d. samples X 1 n(l) with underlying density p X and define

$$\displaystyle{ \widetilde{P_{n}}:= \frac{1} {L}\sum _{l=1}^{L}\mathbb{1} _{ \mathcal{E}_{n}}\left (X_{1}^{n}(l)\right ) }$$

where

$$\displaystyle{ \mathcal{E}_{n}:= \left \{(x_{1},\ldots,x_{n}) \in \left (\mathbb{R}^{d}\right )^{n}: \left (u\left (x_{ 1}\right ) + \cdots + u\left (x_{n}\right )\right ) \in nA\right \}. }$$
(11.2)

The Importance Sampling estimator of P n with sampling density g on \(\left (\mathbb{R}^{d}\right )^{n}\) is

$$\displaystyle{ \widehat{P_{n}}:= \frac{1} {L}\sum _{l=1}^{L}\hat{P}_{ n}(l)\mathbb{1} _{\mathcal{E}_{n}}\left (Y _{1}^{n}(l)\right ) }$$
(11.3)

where \(\hat{P}_{n}(l)\) is called “importance factor” and can be written

$$\displaystyle{ \hat{P}_{n}(l):= \frac{\prod \limits _{i=1}^{n}p_{\mathbf{X}}\left (Y _{i}(l)\right )} {g\left (Y _{1}^{n}(l)\right )} }$$
(11.4)

where the L samples \(Y _{1}^{n}(l):= \left (Y _{1}(l),\ldots,Y _{n}(l)\right )\) are i.i.d. with common density g; the coordinates of Y 1 n(l) however need not be i.i.d. It is known that the optimal choice for g is the density of \(\mathbf{X}_{1}^{n}:= \left (\mathbf{X}_{1},\ldots,\mathbf{X}_{n}\right )\) conditioned upon \(\left (\mathbf{X}_{1}^{n} \in \mathcal{E}_{n}\right )\), leading to a zero variance estimator. We refer to [5] for the background of this section.

The state-independent IS scheme for rare event estimation (see [6] or [12]), rests on two basic ingredients: the sampling distribution is fitted to the so-called dominating point (which is the point where the quantity to be estimated is mostly captured; see [11]) of the set to be measured; independent and identically distributed replications under this sampling distribution are performed. More recently, a state-dependent algorithm leading to a strongly efficient estimator is provided by [2] when d = s, u(x) = x and A has a smooth boundary and a unique dominating point. Indeed, adaptive tilting defines a sampling density for the i−th r.v. in the run which depends both on the target event \(\left (\mathbf{U}_{1,n} \in nA\right )\) and on the current state of the path up to step i − 1. Jointly with an ad hoc stopping rule controlling the excursion of the current state of the path, this algorithm provides an estimate of P n with a coefficient of variation independent upon n. This result shows that nearly optimal estimators can be obtained without approximating the conditional density.

The main issue of the method described above is to find dominating point. However, when the dimension of the set A increases, finding a dominating point can be very tricky or even impossible. A solution will be to divide the set under consideration into smaller subset and, for each one of this subset, find a dominating point. Doing so makes the implementation of an IS scheme harder and harder as the dimension increases.

Our proposal is somehow different since it is based on a sharp approximation result of the conditional density of long runs. The approximation holds for any point conditioning of the form \(\left (\mathbf{U}_{1,n} = nv\right ).\) Then sampling v in A according to the distribution of U 1, n conditioned upon \(\left (\mathbf{U}_{1,n} \in nA\right )\) produces the estimator. By its very definition this procedure does not make use of any dominating point, since it randomly explores the set A. Indeed, our proposal hints on two choices: first do not make use of the notion of dominating point and explore all the target set instead (no part of the set A is neglected); secondly, do not use i.i.d. replications, but merely sample long runs of variables under a proxy of the optimal sampling scheme.

We will propose an IS sampling density which approximates this conditional density very sharply on its first components y 1, , y k where k = k n is very large, namely \(k/n \rightarrow 1.\) However, but in the Gaussian case, k should satisfy \(\left (n - k\right ) \rightarrow \infty \) by the very construction of the approximation. The IS density on \(\left (\mathbb{R}^{d}\right )^{n}\) is obtained multiplying this proxy by a product of a much simpler state-independent IS scheme following [13].

The paper is organized as follows. Section 11.2 is devoted to notations and hypothesis. In Sect. 11.3, we expose the approximation scheme for the conditional density of X 1 k under \(\left (\mathbf{U}_{1,n} = nv\right )\). Our IS scheme is introduced in Sect. 11.4. Simulated results are presented in Sect. 11.5 which enlighten the gain of the present approach over state-dependent Importance Sampling schemes.

We rely on [7] where the basic approximation (and proofs) used in the present paper can be found. The real case is studied in [4] and applications for IS estimators can be found in [3].

11.2 Notations and Hypotheses

We consider approximations of the density of the vector X 1 k on \(\left (\mathbb{R}^{d}\right )^{k}\), when the conditioning event writes (11.1) and k: = k n is such that

$$\displaystyle{ 0\leqslant \limsup _{n\rightarrow \infty }\frac{k} {n}\leqslant 1 }$$
(K1)
$$\displaystyle{ \lim _{n\rightarrow \infty }(n - k) = +\infty. }$$
(K2)

Therefore we may consider the asymptotic behavior of the density of the random walk on long runs.

Throughout the paper the value of a density p Z of some continuous random vector Z at point z may be written p Z (z) or \(p\left (\mathbf{Z} = z\right ),\) which may prove more convenient according to the context.

Let p nv (and distribution P nv ) denote the density of X 1 k under the local condition \(\left (\mathbf{U}_{1,n} = nv\right )\)

$$\displaystyle{ p_{nv}\left (\mathbf{X}_{1}^{k} = Y _{ 1}^{k}\right ):= p(\left.\mathbf{X}_{ 1}^{k} = Y _{ 1}^{k}\right \vert \mathbf{U}_{ 1,n} = nv) }$$
(11.5)

where Y 1 k belongs to \(\left (\mathbb{R}^{d}\right )^{k}\) and v belongs to A.

We will also consider the density p nA (and distribution P nA ) of X 1 k conditioned upon \(\left (\mathbf{U}_{1,n} \in nA\right )\)

$$\displaystyle{ p_{nA}\left (\mathbf{X}_{1}^{k} = Y _{ 1}^{k}\right ):= p(\left.\mathbf{X}_{ 1}^{k} = Y _{ 1}^{k}\right \vert \mathbf{U}_{ 1,n} \in nA). }$$
(11.6)

The approximating density of p nv is denoted g nv ; the corresponding approximation of p nA is denoted g nA . Explicit formulas for those densities are presented in the next section.

11.3 Multivariate Random Walk Under a Local Conditioning Event

Let ɛ n be a positive sequence such as

$$\displaystyle{ \lim _{n\rightarrow \infty }\varepsilon _{n}^{2}(n - k) = \infty }$$
(E1)
$$\displaystyle{ \lim _{n\rightarrow \infty }\varepsilon _{n}(\log n)^{2} = 0 }$$
(E2)

It will be shown that \(\varepsilon _{n}\left (\log n\right )^{2}\) is the rate of accuracy of the approximating scheme.

We assume that \(\mathbf{U}:= u\left (\mathbf{X}\right )\) has a density p U (with p.m. P U ) absolutely continuous with respect to Lebesgue measure on \(\mathbb{R}^{s}.\) Furthermore, we assume that u is such that the characteristic function of U belongs to L r for some \(r\geqslant 1.\)

Denote \(\underline{0}\) is the vector of \(\mathbb{R}^{s}\) with all coordinates equal to 0 and \(V (\underline{0})\) a neighborhood of \(\underline{0}.\)

We assume that U satisfy the Cramer condition, meaning

$$\displaystyle{\varPhi _{\mathbf{U}}(t):= E[\exp < t,\mathbf{U} >] < \infty,{\ \ }t \in V (\underline{0}) \subset \mathbb{R}^{s}.}$$

and define

$$\displaystyle\begin{array}{rcl} m(t):= ^{t}\nabla \log (\varPhi _{\mathbf{ U}}(t)),{\ \ }t \in V (\underline{0}) \subset \mathbb{R}^{s}& & {}\\ \end{array}$$

and

$$\displaystyle\begin{array}{rcl} \varkappa (t):= ^{t}\nabla \nabla \log (\varPhi _{\mathbf{ U}}(t)),{\ \ }t \in V (\underline{0}) \subset \mathbb{R}^{s}.& & {}\\ \end{array}$$

as the mean and the covariance matrix of the tilted density defined by

$$\displaystyle\begin{array}{rcl} \pi _{u}^{\alpha }(x):= \frac{\exp < t,u(x) >} {\varPhi _{\mathbf{U}}(t)} p_{\mathbf{X}}(x).& &{}\end{array}$$
(11.7)

where t is the only solution of m(t) = α for α in the convex hull of P U . Conditions on Φ U (t) which ensure existence and uniqueness of t are referred to steepness properties (see [1], p153 ff, for all properties of moment generating function used in this paper).

We now state the general form of the approximating density. Let v ∈ A and denote

$$\displaystyle\begin{array}{rcl} g_{0}(y_{1}\vert y_{0}):=\pi _{ u}^{v}(y_{ 1})& &{}\end{array}$$
(11.8)

with an arbitrary y 0 and π u v defined in (11.7).

For \(1\leqslant i\leqslant k - 1\), we recursively define \(g(y_{i+1}\vert y_{1}^{i})\). Set \(t_{i} \in \mathbb{R}^{s}\) to be the unique solution to the equation

$$\displaystyle\begin{array}{rcl} m(t_{i}) = m_{i,n}:= \frac{n} {n - i}\left (v -\frac{u_{1,i}} {n} \right )& &{}\end{array}$$
(11.9)

where \(u_{1,i} = u(y_{1}) + \cdots + u(y_{i}).\)

Denote

$$\displaystyle\begin{array}{rcl} \varkappa _{(i,n)}^{j,l}:= \frac{d^{2}} {dt^{(j)}dt^{(l)}}\left (\log E_{\pi _{\mathbf{U}}^{m_{i,n}}}\exp < t,\mathbf{U} >\right )\left (\underline{0}\right )& & {}\\ \end{array}$$

and

$$\displaystyle\begin{array}{rcl} \varkappa _{(i,n)}^{j,l,m}:= \frac{d^{3}} {dt^{(j)}dt^{(l)}dt^{(m)}}\left (\log E_{\pi _{\mathbf{U}}^{m_{i,n}}}\exp < t,\mathbf{U} >\right )\left (\underline{0}\right ).& & {}\\ \end{array}$$

for j, l and m in {1, , s}. In the sequel, \(\varkappa _{(i,n)}\) will denote the matrix with elements \(\left (\varkappa _{(i,n)}^{j,l}\right )_{1\leqslant j,l\leqslant s}.\)

Denote

$$\displaystyle\begin{array}{rcl} g(y_{i+1}\vert y_{1}^{i}):= C_{ i}\mathfrak{n}_{s}\left (u(y_{i+1});\beta \alpha +v,\beta \right )p_{\mathbf{X}}(y_{i+1})& &{}\end{array}$$
(11.10)

where C i is a normalizing factor, \(\mathfrak{n}_{s}\left (u(y_{i+1});\beta \alpha +v,\beta \right )\) is the normal density at u(y i+1) with mean β α + v and covariance matrix β. α and β are defined by

$$\displaystyle{\alpha:= \left (t_{i} + \frac{\varkappa _{(i,n)}^{-2}\gamma } {2(n - i - 1)}\right )}$$

and

$$\displaystyle{\beta:=\varkappa _{(i,n)}(n - i - 1)}$$

and γ defined by

$$\displaystyle{\gamma:= \left (\sum _{j=1}^{s}\varkappa _{ (i,n)}^{j,j,p}\right )_{ 1\leqslant p\leqslant s}.}$$

Then

$$\displaystyle\begin{array}{rcl} g_{nv}(y_{1}^{k}):= g_{ 0}(y_{1}\vert y_{0})\prod _{i=1}^{k-1}g(y_{ i+1}\vert y_{1}^{i})& &{}\end{array}$$
(11.11)

Theorem 1.

Assume(E1), (E2), (K1) and (K2).

  • Let Y 1 k be a sample from density p nv . Then

    $$\displaystyle\begin{array}{rcl} p\left (\mathbf{X}_{1}^{k} = Y _{ 1}^{k}\vert \mathbf{U}_{ 1,n} = nv\right ) = g_{nv}(Y _{1}^{k})(1 + o_{ P_{nv}}(1 +\varepsilon _{n}(\log n)^{2}))& & {}\end{array}$$
    (11.12)
  • Let Y 1 k be a sample from density g nv . Then

    $$\displaystyle\begin{array}{rcl} p\left (\mathbf{X}_{1}^{k} = Y _{ 1}^{k}\vert \mathbf{U}_{ 1,n} = nv\right ) = g_{nv}(Y _{1}^{k})(1 + o_{ G_{nv}}(1 +\varepsilon _{n}(\log n)^{2}))& & {}\end{array}$$
    (11.13)

Remark 11.1.

The approximation of the density of X 1 k is not performed on the sequence of entire spaces \(\left (\mathbb{R}^{d}\right )^{k}\) but merely on a sequence of subsets of \(\left (\mathbb{R}^{d}\right )^{k}\) which contains the trajectories of the conditioned random walk with probability going to 1 as n tends to infinity. The approximation is performed on typical paths. For the sake of applications in Importance Sampling, (11.13) is exactly what we need. Nevertheless, as proved in [7], the extension of our results from typical paths to the whole space \(\left (\mathbb{R}^{d}\right )^{k}\) holds: convergence of the relative error on large sets imply that the total variation distance between the conditioned measure and its approximation goes to 0 on the entire space.

Remark 11.2.

The rule which defines the value of k for a given accuracy of the approximation is stated in Sect. 5 of [7].

Remark 11.3.

When the X i ’s are i.i.d. multivariate Gaussian with diagonal covariance matrix and u(x) = x, the results of the approximation theorem are true for \(k = n - 1\) without the error term. Indeed, it holds \(p(\left.\mathbf{X}_{1}^{n-1} = x_{1}^{n-1}\right \vert \mathbf{U}_{1,n} = nv) = g_{nv}\left (x_{1}^{n-1}\right )\) for all x 1 n−1 in \(\left (\mathbb{R}^{d}\right )^{n-1}\).

As stated above the optimal choice for the sampling density is p nA . It holds

$$\displaystyle{ p_{nA}(x_{1}^{k}) =\int _{ A}p_{nv}\left (\mathbf{X}_{1}^{k} = x_{ 1}^{k}\right )p(\left.\mathbf{U}_{ 1,n}/n = v\right \vert \mathbf{U}_{1,n} \in nA)dv }$$
(11.14)

so that, in contrast with [2] or [6], we do not consider the dominating point approach but merely realize a sharp approximation of the integrand at any point of A and consider the dominating contribution of all those distributions in the evaluation of the conditional density p nA . 

11.4 Adaptive IS Estimator for Rare Event Probability

The IS scheme produces samples \(Y:= \left (Y _{1},\ldots,Y _{k}\right )\) distributed under g nA , which is a continuous mixture of densities g nv as in (11.11) with \(p\left (\mathbf{U}_{1,n}/n = v\vert \mathbf{U}_{1,n} \in nA\right )\) ​.

Simulation of samples U 1, n n under this density can be performed through Metropolis–Hastings algorithm, since

$$\displaystyle{ r(v,v^{{\prime}}):= \frac{p(\left.\mathbf{U}_{1,n}/n = v\right \vert \mathbf{U}_{1,n} \in nA)} {p(\left.\mathbf{U}_{1,n}/n = v^{{\prime}}\right \vert \mathbf{U}_{1,n} \in nA)} }$$

turns out to be independent upon \(P\left (\mathbf{U}_{1,n} \in nA\right ).\) The proposal distribution of the algorithm should be supported by A. 

The density g nA is extended from \(\left (\mathbb{R}^{d}\right )^{k}\) onto \(\left (\mathbb{R}^{d}\right )^{n}\) completing the nk remaining coordinates with i.i.d. copies of r.v’s Y k+1, , Y n with common tilted density

$$\displaystyle{ g_{nA}\left (\left.y_{k+1}^{n}\right \vert y_{ 1}^{k}\right ):=\prod \limits _{ i=k+1}^{n}\pi _{ u}^{m_{k} }(y_{i}) }$$
(11.15)

with \(m_{k}:= m(t_{k}) = \frac{n} {n-k}\left (v -\frac{u_{1,k}} {n} \right )\) and

$$\displaystyle{ u_{1,k} =\sum _{ i=1}^{k}u(y_{ i}). }$$

The last nk r.v’s Y i ’s are therefore drawn according to the state independent i.i.d. scheme in phase with Sadowsky and Bucklew [13].

We now define our IS estimator of P n . Let Y 1 n(l): = Y 1(l), , Y n (l) be generated under g nA . Let

$$\displaystyle{ \widehat{P_{n}}(l):= \frac{\prod _{i=1}^{n}p_{\mathbf{X}}(Y _{i}(l))} {g_{nA}(Y _{1}^{n}(l))} \mathbb{1} _{\mathcal{E}_{n}}\left (Y _{1}^{n}(l)\right ) }$$
(11.16)

and define

$$\displaystyle{ \widehat{P_{n}}:= \frac{1} {L}\sum _{l=1}^{L}\widehat{P_{ n}}(l). }$$
(11.17)

in accordance with (11.3).

Remark 11.4.

In the real case and for A = (a, ), the authors of [3] show that under certain regularity conditions the resulting relative error of the estimator is proportional to \(\sqrt{n - k_{n}}\) and drops by a factor \(\sqrt{n - k_{n}}/\sqrt{n}\) with respect to the state independent IS scheme. In [8], the authors propose a slight modification in the extension of g nA which allows to prove the strong efficiency of the estimator (11.17) using arguments from both [2] and [3].

11.5 When the Dimension Becomes Very High

This section compares the performance of the present approach with respect to the standard tilted one using i.i.d. replications under (11.7) on an extension of a well-known example developed in [9] and in [10]. Let \(B:= \left (\mathcal{E}_{100}\right )^{d}\) which is the d-Cartesian product of \(\mathcal{E}_{100}\) defined by

$$\displaystyle{ \mathcal{E}_{100}:= \left \{x_{1}^{100}: \frac{\left \vert x_{1} + \cdots + x_{100}\right \vert } {100} > 0.28\right \}. }$$

We want to estimate P 100 = P[B] and explore the gain in relative accuracy when the dimension of the measured set increases. Consider 100 r.v.’s X i ’s i.i.d. random vectors in \(\mathbb{R}^{d}\) with common i.i.d. N(0. 05, 1) distribution. Our interest is to show that in this simple asymmetric case our proposal provides a good estimate, while the standard IS scheme ignores a part of the event B. The standard i.i.d. IS scheme introduces the dominating point a = t(0. 28, , 0. 28) and the family of i.i.d. tilted r.v’s with common N(a, 1) distribution. It can be seen that a large part of B is never visited through the procedure, inducing a bias in the estimation. Indeed, the rogue path curse (see [9]) produces an overwhelming loss in accuracy, imposing a very large increase in runtime to get reasonable results. Under the present proposal the distribution of the Importance Factor concentrates around P 100 avoiding rogue path.

This example is not as artificial as it may seem; indeed, it leads to a 2d dominating points situation which is quite often met in real life. Exploring at random the set of interest avoids any search for dominating points. Drawing L i.i.d. points v 1, , v L according to the distribution of U 1, 100∕100 conditionally upon B we evaluate P 100 with k = 99; note that in the Gaussian case Theorem 1 provides an exact description of the conditional density of X 1 k for all k between 1 and n. The following figure shows the gain in relative accuracy w.r.t. the state independent IS scheme according to the growth of d. The value of P 100 is 10−2d (Fig. 11.1).

Fig. 11.1
figure 1

Relative Accuracy of the adaptive estimate (dotted line) w.r.t. i.i.d. tilted one (solid line) as a function of the dimension d for L = 1, 000

Conclusion

In this paper, we explore a new way to estimate multi-constraints large deviation probability. In future work, the author will investigate the theoretical behavior of the relative error of our proposed estimator.