In this chapter, we describe the mathematical knowledge necessary for understanding this book. First, we discuss matrices, open sets, closed sets, compact sets, the Mean Value Theorem, and Taylor expansions. All of these are topics covered in the first year of college. Next, we discuss absolute convergence and analytic functions. Then, we discuss the Law of Large Numbers and the Central Limit Theorem, as well as defining the symbols \(O_P(\cdot )\) and \(o_P(\cdot )\) used in subsequent chapters. Finally, we define the Fisher information matrix and discuss the properties of regular and realizable cases. For algebraic geometry and related topics, please refer to Chap. 6. Readers who already understand the content of this chapter may skip it as appropriate. At the end of the chapter, we provide the proof of Proposition 2, which was postponed in Chap. 1. It is assumed that with the preliminary knowledge of this chapter, it can be understood.

4.1 Elementary Mathematics

Here we discuss matrices and eigenvalues, open sets, closed sets, compact sets, the Mean Value Theorem, and Taylor expansions.

4.1.1 Matrices and Eigenvalues

A matrix \(A\in {\mathbb R}^{n\times n}\) (\(n\ge 1\)) with the same number of rows and columns is called a square matrix. A diagonal matrix is a matrix whose off-diagonal elements are all zero. A diagonal matrix with all diagonal elements equal to 1 is called an identity matrix, denoted as \(I_n\in {\mathbb R}^{n\times n}\). The sum of the diagonal elements of a square matrix is called the trace. A matrix with zero elements in the (ij) (\(i<j\)) positions is called a lower triangular matrix.

In the following, for a square matrix \(A\in {\mathbb R}^{n\times n}\), we assume that there exists an \(X\in {\mathbb R}^{n\times n}\) such that \(AX=I_n\), and we try to find it. To do this, we perform two types of operations on the matrix \([A\mid I_n]\in {\mathbb R}^{n\times 2n}\), which consists of A and \(I_n\) arranged side by side:

  1. 1.

    Subtract a multiple of one row from another row

  2. 2.

    Swap two rows

We obtain a matrix such that the left half becomes a lower triangular matrix. Assuming that we performed operation 2 a total of m times, the product of the diagonal elements of the left half of the matrix at this point, multiplied by \((-1)^m\), is called the determinant of the matrix A. In the following, we write the determinant of the matrix A as \(\det A\).

After performing these operations, initially with \(B=I_n\), \([A\mid B]\) is transformed into \([A'\mid B']\), but X satisfying \(AX=B\) also satisfies \(A'X=B'\). If the determinant of A is not zero, we perform the above two operations further to make the left half a diagonal matrix. Finally, by

  1. 1.

    Dividing each row by the value of the diagonal element

we make the left half the identity matrix \(I_n\).Footnote 1 If \(A''=I_n\), then for \([A''\mid B'']\), \(A''X=B''\), so the right half \(B''\) at that time is X. Conversely, if the determinant of A is zero, such a matrix X does not exist. When a square matrix exists such that \(AX=I_n\) (when the determinant of A is not zero), X is called the inverse matrix of A, denoted as \(X=A^{-1}\).

Example 24

In each of the cases \(d\not =0\) and \(d=0\),

$$\begin{aligned} \displaystyle \left[ \begin{array}{cc} a&{}b\\ c&{}d \end{array} \right] \rightarrow \left[ \begin{array}{cc} a-bc/d&{}0\\ c&{}d \end{array} \right] \ ,\ \left[ \begin{array}{cc} a&{}b\\ c&{}0 \end{array} \right] \rightarrow \left[ \begin{array}{cc} c&{}0\\ a&{}b \end{array} \right] \end{aligned}$$

can be done. The determinant is \((a-bc/d) \cdot d\cdot (-1)^0\) for \(d\not =0\), and \(cb\cdot (-1)^1\) for \(d=0\), both of which can be seen to be \(ad-bc\).\(\blacksquare \)

Example 25

When the determinant of A, \(\Delta =ad-bc\), is not zero, in particular when \(d\not =0\),

$$\begin{aligned} \left[ \begin{array}{cc|cc} a&{}b&{}1&{}0\\ c&{}d&{}0&{}1 \end{array} \right] &\rightarrow \left[ \begin{array}{cc|cc} a-bc/d&{}0&{}1&{}-b/d\\ c&{}d&{}0&{}1 \end{array} \right] {\rightarrow } \left[ \begin{array}{cc|cc} \Delta /d&{}0&{}1&{}-b/d\\ 0&{}d&{}-cd/\Delta &{}1+bc/\Delta \end{array} \right] \\ &{\rightarrow } \left[ \begin{array}{cc|cc} 1&{}0&{}d/\Delta &{}-b/\Delta \\ 0&{}1&{}-c/\Delta &{}a/\Delta \end{array} \right] \end{aligned}$$

can be done, and

$$\begin{aligned} \left[ \begin{array}{cc} a&{}b\\ c&{}d \end{array} \right] \cdot \frac{1}{\Delta } \left[ \begin{array}{cc} d&{}-b\\ -c&{}a \end{array} \right] = \left[ \begin{array}{cc} 1&{}0\\ 0&{}1 \end{array} \right] \end{aligned}$$

holds. That is, \(\displaystyle \frac{1}{\Delta } \left[ \begin{array}{cc} d&{}-b\\ -c&{}a \end{array} \right] \) becomes the inverse matrix of \(\displaystyle \left[ \begin{array}{cc} a&{}b\\ c&{}d \end{array} \right] \). \(\blacksquare \)

Moreover, when a constant \(\lambda \in {\mathbb C}\) and a vector \(u\in {\mathbb C}^n\) (\(u\not =0\)) exist such that \(Au=\lambda u\), \(\lambda \) is called an eigenvalue, and u is called an eigenvector. If the matrix \(A-\lambda I_n\) has an inverse, that is, if the determinant of \(A-\lambda I_n\) is not zero, then from \(u=(A-\lambda I_n)^{-1}0=0\), the u that satisfies \(Au=\lambda u\) is limited to \(u=0\). Eigenvalues are determined as solutions to the equation concerning \(\lambda \) (eigenvalue equation) stating that the determinant of \(A-\lambda I_n\) is zero. In other words,

$$\begin{aligned} Au=\lambda u, u\not =0 \Longleftrightarrow (A-\lambda I_n)u=0, u\not =0 \Longleftrightarrow \det (A-\lambda I_n)=0 \end{aligned}$$

holds.

Example 26

For \(n=2\), if we set \(\displaystyle A=\left[ \begin{array}{cc} a&{}b\\ c&{}d \end{array} \right] \), then \(\det (A-\lambda I_2)=(a-\lambda )(d-\lambda )-bc=0\) holds. Therefore, the solutions of the quadratic equation \(\lambda ^2-(a+d)\lambda +ad-bc=0\) are the eigenvalues. \(\blacksquare \)

A matrix \(A\in {\mathbb R}^{n\times n}\) for which all the (ij) components \(A_{i,j}\) and (ji) components \(A_{j,i}\) are equal is called a symmetric matrix. In general, eigenvalues \(\lambda \) are not necessarily real numbers, but when the matrix \(A\in {\mathbb R}^{n\times n}\) is symmetric, \(\lambda \) becomes a real number. In fact, for \(\lambda \in {\mathbb C}\) and \(u\in {\mathbb C}^n\) (\(u\not =0\)), sinceFootnote 2 \(A\overline{u}=\overline{Au}=\overline{\lambda u}=\overline{\lambda }\overline{u}\), we have

$$\begin{aligned} \langle {Au},\overline{u}\rangle =\langle \lambda {u},\overline{u}\rangle =\lambda \langle {u},\overline{u}\rangle \end{aligned}$$

and

$$\begin{aligned} \langle Au, \overline{u}\rangle =\langle u, A\overline{u}\rangle =\langle u, \overline{Au}\rangle =\langle u, \overline{\lambda u}\rangle =\langle u, \overline{\lambda }\overline{u}\rangle =\overline{\lambda }\langle u, \overline{u}\rangle , \end{aligned}$$

where \(\overline{z}\) denotes the complex conjugate of \(z\in {\mathbb C}\), and for \(a,b\in {\mathbb R}\), we set \(\overline{a+ib}=a-ib\).

Example 27

For the matrix \(\left[ \begin{array}{cc} a&{}b\\ c&{}d \end{array} \right] \), if we set \(b=c\), the eigenvalue equation becomes \(\lambda ^2-(a+d)\lambda +ad-b^2=0\), and its discriminant is \((a+d)^2-4(ad-b^2)=(a-d)^2+4b^2\ge 0\). Indeed, the eigenvalues are real numbers. \(\blacksquare \)

For a symmetric matrix A, it is called non-negative definite when all eigenvalues are non-negative, and positive definite when all eigenvalues are positive.

Example 28

For the matrix \(\left[ \begin{array}{cc} a&{}b\\ c&{}d \end{array} \right] \), if we set \(b=c\), when \(a+d\ge 0\), and \(ad\ge b^2\), the two solutions of the eigenvalue equation are non-negative, and it becomes non-negative definite. Furthermore, if both eigenvalues are positive, that is, \(a+d\ge 0\), and \(ad> b^2\), it becomes positive definite. \(\blacksquare \)

Moreover, for a symmetric matrix \(A\in {\mathbb R}^{n\times n}\), \(z^\top Az\), \(z\in {\mathbb R}^n\) is called the quadratic form of A.

Proposition 4

A symmetric matrix \(A\in {\mathbb R}^{n\times n}\) being non-negative definite is equivalent to the quadratic form \(z^\top Az\) being non-negative for any \(z\in {\mathbb R}^n\). Furthermore, A being positive definite is equivalent to the quadratic form \(z^\top Az\) being positive for any \(0\not =z\in {\mathbb R}^{n}\).

For the proof, please refer to the appendix at the end of the chapter.

4.1.2 Open Sets, Closed Sets, and Compact Sets

Let the Euclidean distance between each \(x, y \in {\mathbb R}^d\) be denoted as dist(xy). For a subset M of \({\mathbb R}^d\), let us denote the open ball (excluding the boundary) of radius \(\epsilon > 0\) centered at \(z \in M\) as \(B(z, \epsilon ) := \{y \in {\mathbb R}^d \mid dist(z, y) < \epsilon \}\). If there exists a radius \(\epsilon > 0\) such that \(B(z, \epsilon ) \subseteq M\) for any \(z \in M\), M is called an open set. On the other hand, for any \(\epsilon > 0\), if the intersection of \(B(z, \epsilon )\) and M is non-empty, \(z \in {\mathbb R}^d\) is called a tactile point of \(M\subseteq {\mathbb R}^d\). If M contains all its tactile points as its elements, M is called a closed set (see Fig. 4.1). Generally, the complement of a closed set is an open set, and the complement of an open set is a closed set.

In fact, if M is a closed set, its tactile points are not included in the complement \({M}^C\), so when the radius of the open ball for each \(z \in {M}^C\) is chosen to be small, the open ball will not intersect M. Conversely, if M is an open set, when the radius of the open ball for each \(z \in {M}\) is chosen to be small, the open ball will not intersect \({M}^C\). Therefore, the tactile points of \({M}^C\) are not in M.

Example 29

Assume that \(d=1\) and \(d=3\) for items 1–4 and item 5, respectively.

  1. 1.

    The open interval (ab) is an open set, and the closed interval [ab] is a closed set.

  2. 2.

    The set of all real numbers \(\mathbb R\) and the set of all integers \(\mathbb Z\) are closed sets (\(\mathbb R\) is also an open set).

  3. 3.

    The set \({\mathbb R} \cap {\mathbb Z}^C\), which is the set of all real numbers \(\mathbb R\) excluding the set of all integers \(\mathbb Z\), is an open set.

  4. 4.

    The set of all rational numbers \(\mathbb Q\) is neither an open set nor a closed set.

  5. 5.

    The region \(\{(x, y, z) \in {\mathbb R}^3 \mid x^2 + y^2 + z^2 < 1, z \ge 0\}\) is neither an open set nor a closed set.

\(\blacksquare \)

Fig. 4.1
figure 1

Open sets, closed sets, and cases that are neither. An open set is one where any point in the set is included if the neighborhood is made small enough. A tactile point is one that intersects the set no matter how small the neighborhood is made. A closed set is one that contains all tactile points

In addition, there is a concept of compact sets related to closed sets. When a mapping \(M \ni x \mapsto \epsilon (x) \in {\mathbb R}{>0}\) is arbitrarily defined and a finite number of \(z_1, \ldots , z_m\) are used such that the union of open balls \(\cup _{i=1}^m B(z_i, \epsilon (z_i))\) contains M as a subset, M is called compact. In this book, we only deal with subsets of \({\mathbb R}^d\) as the universal set and Euclidean distance as the distance. In this case, it is known that compact sets are equivalent to closed sets with bounded domains (bounded closed sets), where we say a set M is bounded when there exists a positive constant \(L>0\) such that \(dist(x, y) < L\) for any \(x, y \in M\). Among the closed sets in Example 29, [ab] is compact, but \(\mathbb R\) and \(\mathbb Z\) are not.

Although the proof is omitted, if a set M is compact, a continuous function with domain M has maximum and minimum values.

Example 30

\(M=(0,1], [1, \infty )\) are not compact. The continuous function \(f(x)=1/x\) does not have a maximum value on \(M=(0,1]\) and does not have a minimum value on \(M=[1, \infty )\). \(\blacksquare \)

Also, let dist(xa) be the distance between \(x,a \in M\) (\(M = {\mathbb R}\), for example, \(dist(x, a) = |x-a|\)). For any \(\epsilon > 0\), a function f with domain M is said to be continuous (continuous) at \(x=a\) if there exists a \(\delta = \delta (\epsilon , a)\) such that:

$$\begin{aligned} dist(x, a) < \delta \Longrightarrow |f(x) - f(a)| < \epsilon . \end{aligned}$$

If the function is continuous for all \(a \in M\), then f is continuous.

The continuity of functions can be defined not only for \(f: {\mathbb R}\rightarrow {\mathbb R}\). From Chap. 4 onwards, we will examine the set of continuous functions C(K) defined on a compact set K. The distance between elements \(\phi \) and \(\phi '\) in C(K) is defined by the sup-norm (uniform norm):

$$\begin{aligned} dist(\phi ,\phi '):=\sup _{\theta \in K}|\phi (\theta )-\phi '(\theta )|. \end{aligned}$$
(4.1)

Then, the continuity of a function \(f: C(K)\rightarrow {\mathbb R}\) at \(\phi = \phi _a \in C(K)\) is defined by the existence of \(\delta = \delta (\epsilon , \phi _a)\) such that for any \(\epsilon > 0\),

$$\begin{aligned} dist(\phi ,\phi _a) < \delta \Longrightarrow |f(\phi )-f(\phi _a)| < \epsilon . \end{aligned}$$

4.1.3 Mean Value Theorem and Taylor Expansion

In the following chapters, we will discuss the Mean Value Theorem and Taylor expansion, which will be used several times. They are particularly necessary for mathematical analysis when the sample size n is large.

The Mean Value Theorem asserts that for a differentiable function \(f: {\mathbb R}\rightarrow {\mathbb R}\), if \(a < b\), then there exists a c such that \(a < c < b\) satisfying

$$\begin{aligned} \frac{f(b) - f(a)}{b - a} = f'(c). \end{aligned}$$
(4.2)

Example 31

For \(f(x) = x^2 - 3x + 2\), \(a = 2\), and \(b = 4\), we have

$$\begin{aligned} \frac{(b^2 - 3b + 2) - (a^2 - 3a + 2)}{b - a} = a + b - 3 = 3\ ,\ f'(c) = 2c - 3. \end{aligned}$$

So, \(c = 3\) satisfies the condition. \(\blacksquare \)

Equation (4.2) can be written as \(f(b) = f(a) + f'(c)(b - a)\), which is an extended to Taylor’s theorem.

Namely, if f is continuous up to the \((n-1)\)-th derivative and is n times differentiable,

$$\begin{aligned} f(b)=f(a)+\frac{f'(a)}{1!}(b-a)+\frac{f''(a)}{2!}(b-a)^2+\cdots +\frac{f^{(n-1)}(a)}{(n-1)!}(b-a)^{n-1}+R_n \end{aligned}$$
(4.3)

with

$$\begin{aligned} R_n=\frac{f^{(n)}(c)}{n!}(b-a)^n. \end{aligned}$$

There exists an \(a<c<b\). If \(n=1\), it becomes the Mean Value Theorem. Sometimes it is written as \(\theta a +(1-\theta )b\) instead of c, and there exists such a \(0<\theta <1\).

Setting \(b=x\) in (4.3), we get

$$\begin{aligned} f(x)=f(a)+\frac{f'(a)}{1!}(x-a)+\frac{f''(a)}{2!}(x-a)^2+\cdots +\frac{f^{(n-1)}(a)}{(n-1)!}(x-a)^{n-1}+R_n \end{aligned}$$

which is called the Taylor expansion of the function f at \(x=a\). Furthermore, setting \(a=0\), we get

$$\begin{aligned} f(x)=f(0)+\frac{f'(0)}{1!}x+\frac{f''(0)}{2!}x^2+\cdots +\frac{f^{(n-1)}(0)}{(n-1)!}x^{n-1}+R_n \end{aligned}$$

which is called the Maclaurin expansion.

Example 32

When \(e^x\) and \(\log (1+x)\) are Maclaurin-expanded, there exist \(0<\theta <1\) for each of

$$\begin{aligned} e^x=1+x+\frac{x^2}{2}+\cdots + \frac{x^{n-1}}{(n-1)!}+ \frac{x^{n}}{n!}e^{\theta x} \end{aligned}$$
(4.4)

and

$$\begin{aligned} \log (1+x)=x-\frac{x^2}{2}+\frac{x^3}{3}-\cdots +(-1)^{n-2}\frac{x^{n-1}}{n-1}+(-1)^{n-1}\frac{x^n}{n}\cdot \frac{1}{(1+\theta x)^n}. \end{aligned}$$
(4.5)

\(\blacksquare \)

For the case of two variables, a function \(f: {\mathbb R}^2\rightarrow {\mathbb R}\) that is continuous up to the \((n-1)\)th derivative and differentiable n times, the Taylor expansion at \((x,y)=(a,b)\) can be written as

$$\begin{aligned} f(x,y)= & {} \sum _{k=0}^{n-1}\sum _{i=0}^k \frac{1}{(k-i)!i!} (x-a)^i(y-b)^{k-i}\frac{\partial ^kf}{\partial x^i\partial y^{k-i}}(a,b)\\ {} & {} + \sum _{i=0}^n \frac{1}{(n-i)!i!} (x-a)^i(y-b)^{n-i}\frac{\partial ^nf}{\partial x^i\partial y^{n-i}}(\theta a+(1-\theta )x,\theta b+(1-\theta )y).\\ \end{aligned}$$

In the case of \(n=2\) for d variables, the Taylor expansion around \(x=(x_1,\ldots ,\) \(x_d)^\top =(a_1,\ldots ,a_d)^\top =a\) is, when f has a continuous first derivative and is twice differentiable, written as

$$\begin{aligned} f(x)= & {} f(a)+\sum _{i=1}^d(x_i-a_i)\frac{\partial f}{\partial x_i}(a)+ \frac{1}{2}\sum _{i=1}^d\sum _{j=1}^d(x_i-a_i)(x_j-a_j)\frac{\partial ^2 f}{\partial x_i\partial x_j}(\theta a+(1-\theta )x)\\ = & {} f(a)+(x-a)^\top \{\nabla f(a)\}+\frac{1}{2}(x-a)^\top \{\nabla ^2 f(\theta a+(1-\theta )x)\}(x-a), \end{aligned}$$

where \(\nabla f: {\mathbb R}^d\rightarrow {\mathbb R}^d\) is a vector consisting of the d partial derivatives of f, \(\displaystyle \frac{\partial f}{\partial x_i}\), and \(\nabla ^2 f: {\mathbb R}^d\rightarrow {\mathbb R}^{d\times d}\) is a matrix (the Hessian matrix) consisting of the second partial derivatives of f, \(\displaystyle \frac{\partial ^2 f}{\partial x_i\partial x_j}\).

4.2 Analytic Functions

In the following, we will denote the set of non-negative integers by \(\mathbb N\). Firstly, for \(r=(r_1,\dots ,r_d)\in {\mathbb N}^d\), \(x=(x_1,\dots ,x_d),b=(b_1,\dots ,b_d)\in {\mathbb R}^d\), \(a_r=a_{r_1,\dots ,r_d}\in {\mathbb R}\), we define

$$\begin{aligned} a_r(x-b)^r :=a_{r_1,\dots ,r_d}(x_1-b_1)^{r_1}\dots (x_d-b_d)^{r_d}. \end{aligned}$$

A sum of such terms

$$\begin{aligned} f(x):=\sum _{r\in {\mathbb N}^d}a_r(x-b)^r =\sum _{r_1\in {\mathbb N}}\dots \sum _{r_d\in {\mathbb N}}a_{r_1,\dots ,r_d}(x_1-b_1)^{r_1}\dots (x_d-b_d)^{r_d}\ ,\ x\in {\mathbb R}^d \end{aligned}$$
(4.6)

is called a power series. When there are a finite number of non-zero terms, we call f(x) a polynomial with real coefficients in terms of \(x_1,\dots ,x_d\), and we denote the set of such polynomials as \({\mathbb R}[x]\) or \({\mathbb R}[x_1,\dots ,x_d]\). Furthermore, when there exists an open set U (\(b\in U\subseteq {\mathbb R}^d\)) such that for any \(x\in U\), \(\sum _r|a_r| |x-b|^r<\infty \), we say that f(x) converges absolutely. In this case, the infinite series (4.6) is independent of the order of the sums \(\sum _{r_1},\dots ,\sum _{r_d}\) and is unique. We call such a function \(f: U\rightarrow {\mathbb R}\) an (real) analytic function.

Example 33

For the infinite series \(\sum _{n=0}^\infty a_n\) with \(a_n=(-1)^n\), we can write it in two ways:

$$\begin{aligned} (1-1)+(1-1)+\cdots =0+0+\cdots \end{aligned}$$

and

$$\begin{aligned} 1-(1-1)-(1-1)-\dots =1-0-0-\dots , \end{aligned}$$

which is due to the fact that \(\sum _{n=0}^\infty |a_n|=1+1+\cdots =\infty \). However, in the case of \(a_n=(-\frac{1}{2})^n\), we have \(\sum _{n=0}^\infty |a_n|=1+\frac{1}{2}+\cdots =2\), and hence the series converges.

\(\blacksquare \)

Let \({a_n}\) be a sequence of real numbers and \(c\in {\mathbb R}\). When the power series \(\sum _{n=0}^\infty a_n(x-c)^n\) converges absolutely if \(|x-c|<R\) and diverges if \(|x-c|>R\), we call R the radius of convergence (we need to investigate the case where \(x-c\) equals the radius of convergence). If \(a_n=0\) except for a finite number of terms, \(R:=\lim _{n\rightarrow \infty } \left| \frac{a_{n}}{a_{n+1}}\right| \) will be the radius of convergence. In fact, if the absolute ratio of adjacent terms

$$\begin{aligned} r:=\lim _{n\rightarrow \infty }\left| \frac{a_{n+1}(x-c)^{n+1}}{a_{n}(x-c)^{n}}\right| =\lim _{n\rightarrow \infty }\left| \frac{a_{n+1}}{a_n}\right| \cdot |x-c| \end{aligned}$$

is \(0\le r<1\), it converges, and if \(1<r\le \infty \), it diverges.

Example 34

For

$$\begin{aligned} f(x)=\sum _{n=1}^\infty \frac{1}{n}x^n =x+\frac{1}{2}x^2+\frac{1}{3}x^3+\cdots \end{aligned}$$

the absolute ratio of adjacent terms is

$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{|x|^{n+1}}{|x|^n}\frac{1/(n+1)}{1/n}={|x|}\lim _{n\rightarrow \infty }\frac{n}{n+1}={|x|} \end{aligned}$$

for sufficiently large n. Therefore, it converges absolutely if \(|x|<1\). When investigating the case of \(|x|=1\), it becomes

$$\begin{aligned} \sum _{n=1}^\infty \frac{1}{n}=1+\frac{1}{2}+\frac{1}{3}+\cdots >\lim _{n\rightarrow \infty }\int _1^n \frac{dx}{x}=\lim _{n\rightarrow \infty } \log n=\infty , \end{aligned}$$

so it does not converge absolutely when \(|x|=1\). Therefore, we can set the open set of the domain of f to be \(U=(-1,1)\). \(\blacksquare \)

Since taking the absolute value of each term makes it non-negative, absolute convergence becomes a convergence that does not assume the order of summation. However, what problems would arise with convergence that assumes the order of summation (conditional convergence)?

Example 35

If the series \(\displaystyle \sum _{n=1}^\infty (-1)^{n-1}\frac{1}{n}\) is summed in the order of

$$\begin{aligned} 1-\frac{1}{2}+\frac{1}{3}-\frac{1}{4}+\cdots =\sum _{k=1}^\infty \frac{(-1)^{k-1}}{k}, \end{aligned}$$
(4.7)

it becomes \(\log 2\). In fact, if we denote the right-hand side of

$$\begin{aligned} \sum _{k=1}^{2n}\frac{(-1)^{k-1}}{k}=\left( \sum _{k=1}^{2n}\frac{(-1)^{k-1}}{k}+2\sum _{k=1}^n \frac{1}{2k}\right) -2\sum _{k=1}^n \frac{1}{2k}= \sum _{k=1}^{2n}\frac{1}{k}-\sum _{k=1}^{n}\frac{1}{k}=\sum _{k=1}^n\frac{1}{n+k} \end{aligned}$$

as \(S_n\), the equations

$$\begin{aligned} S_n=\frac{1}{n}\sum _{k=1}^n \frac{1}{1+k/n}\le \int _0^1\frac{dx}{1+x}\le \frac{1}{n}\sum _{k=0}^{n-1}\frac{1}{1+k/n}=S_n+\frac{1}{2n} \end{aligned}$$

and

$$\begin{aligned} \int _0^1\frac{dx}{1+x}-\frac{1}{2n}\le S_n\le \int _0^1\frac{dx}{1+x}=\log 2 \end{aligned}$$

hold. On the other hand, if we first add the terms for \(n=1,2,4\), then the ones for odd numbers greater than or equal to 3, even numbers not divisible by 4 and greater than or equal to 6, and finally multiples of 4 greater than or equal to 8, (4.7) can be calculated as

$$\begin{aligned} {} & {} \sum _{n=1}^\infty (-1)^{n-1}\frac{1}{n}=1-\frac{1}{2}-\frac{1}{4} +\sum _{n=2}^\infty \{\frac{1}{2n-1}-\frac{1}{2(2n-1)}-\frac{1}{4n}\}= \frac{1}{4}+\frac{1}{2}\sum _{n=2}^\infty \{\frac{1}{2n-1}-\frac{1}{2n}\}\\ = & {} \frac{1}{2}\sum _{n=1}^\infty \{\frac{1}{2n-1}-\frac{1}{2n}\}=\frac{1}{2}\lim _{n\rightarrow \infty }S_n=\frac{1}{2}\log 2. \end{aligned}$$

\(\blacksquare \)

Although the proof is omitted, it is known that any series that converges conditionally can be made to converge to any real number by changing the order of its sum (Riemann’s rearrangement theorem).

Example 36

In \(U:=\{(x,y)\in {\mathbb R}^2\}\), the series

$$\begin{aligned} \sum _{m=0}^\infty \sum _{n=0}^\infty \frac{x^my^n}{m!n!} \end{aligned}$$

converges absolutely. In fact, the ratio of the absolute values of any two adjacent terms converges to 0. Therefore, we can rearrange the order of the terms, and we obtain

$$\begin{aligned} \sum _{m=0}^\infty \frac{x^m}{m!}\cdot \sum _{n=0}^\infty \frac{y^n}{n!}=e^x\cdot e^y, \end{aligned}$$

so the function \(f: U\rightarrow {\mathbb R}\), \(f(x,y)=e^{x+y}\), is an analytic function. \(\blacksquare \)

Here, if a function is differentiable any number of times \(r\ge 0\) and the r-times differentiated function is continuous, the function is said to be of class \(C^r\) . On the other hand, analytic functions are continuous and differentiable, and no matter how many times they are differentiated, the asymptotic ratio of adjacent terms remains the same, making them analytic functions. That is, they are of class \(C^\infty \) . Moreover, the analytic functions that can be expanded into power series can be uniquely expanded into Taylor series.

Note, however, that a function being of class \(C^\infty \) does not necessarily mean that it is an analytic function.

Example 37

The function

$$\begin{aligned} f(x)= \left\{ \begin{array}{ll} \exp (-1/x)\ ,&{} x> 0\\ 0\ ,&{} x\le 0\\ \end{array} \right. \end{aligned}$$

is of class \(C^\infty \) but not analytic. In fact, the Taylor expansion at \(x=0\) results in \(a_r=0\), \(r\in {\mathbb R}^d\) (Exercise 31), which contradicts the uniqueness of the Taylor expansion. \(\blacksquare \)

In Chap. 8, we will assume that the average likelihood ratio \(K(\theta )={\mathbb E}_X[\log \frac{p(X|\theta _*)}{p(X|\theta )}]\) and the prior distribution \(\varphi (\theta )\) are analytic functions on \(\theta \in \Theta \) and proceed with the discussion. In this case, for the power series with real numbers \(a_r\)

$$\begin{aligned} \sum _{r\in {\mathbb N}^d}a_r(x-b)^r, \end{aligned}$$

we considered whether \(\sum _{r\in {\mathbb N}^d}|a_r|\ |(x-b)^r|\) is finite. In this book, we further assume that the likelihood ratio \(f(x,\theta )=\log \frac{p(x|\theta _*)}{p(x|\theta )}\) is also an analytic function for \(\theta \in \Theta \). However, in the case of multiple variables, preparations for extension are necessary. In this case, we consider as \(a_r: \mathcal{X}\rightarrow {\mathbb R}\) and use the norm of \(a_r\).

The set V with the properties

$$\begin{aligned} f,g\in V\Longrightarrow f+g\in V \end{aligned}$$

and

$$\begin{aligned} \alpha \in {\mathbb R}, f\in V\rightarrow \alpha f\in V. \end{aligned}$$

is called a linear space. In a linear space, we call \(\Vert \cdot \Vert : V\rightarrow {\mathbb R}\) that satisfies the following conditions for each element a norm of V: for \(\alpha \in {\mathbb R}, f,g\in V\)

$$\begin{aligned} \Vert \alpha f\Vert =|\alpha |\cdot \Vert f\Vert \ ,\ \Vert f+g\Vert \le \Vert f\Vert +\Vert g\Vert \ ,\ \Vert f\Vert \ge 0\ ,\ \Vert f\Vert =0\Longrightarrow f=0. \end{aligned}$$

In this book, we denote the set of \(f: \mathcal{X}\rightarrow {\mathbb R}\) for which

$$\begin{aligned} \Vert f\Vert _2:=\sqrt{\int _\mathcal{X}\{f(x)\}^2q(x)dx} \end{aligned}$$

is finite as \(L^2(q)\), where the true distribution q is used.

Here, the absolute value \(|\cdot |\) becomes the norm of the one-dimensional Euclidean space \(\mathbb R\), but the norm \(\Vert \cdot \Vert _2\) also becomes the norm of the linear space \(L^2(q)\) (problem 38). We often call it an analytic function taking real values when \(a_r\in {\mathbb R}\) and an analytic function taking values in \(L^2(q)\) when \(a_r\in L^2(q)\), but in this book, we simply call the former an analytic function. Also, when we write each norm as \(\Vert \cdot \Vert \), the set of x for which \(\sum _{r\in {\mathbb N}^d}\Vert a_r\Vert \ |(x-b)^r|\) is finite becomes the domain.

For example, if the log-likelihood ratio \(f(x,\theta )\)

$$\begin{aligned} f(x,\theta )=\sum _{r\in {\mathbb N}^d}a_r(x)(\theta -\theta ^1)^r \end{aligned}$$

is an analytic function, it means that there exists a convergence domain (the radius of convergence is non-zero) such that

$$\begin{aligned} \sum _{r\in {\mathbb N}^d}\Vert a_r\Vert _2|(\theta -\theta ^1)^r|<\infty . \end{aligned}$$

4.3 Law of Large Numbers and Central Limit Theorem

4.3.1 Random Variables

By preparing a universal set \(\Omega \) and a set of its events in advance, when

$$\begin{aligned} \{\omega \in \Omega \mid X(\omega )\in O\} \end{aligned}$$

becomes an event for any open set O of \(\mathbb R\), we say that \(X: \Omega \ni \omega \mapsto X(\omega )\in {\mathbb R}\) is measurable. Also, X is called a random variable that takes values inFootnote 3 \(\mathbb R\). However, the way to determine the probability needs to be defined separately.

Example 38

When \(\Omega =\{1,2,3,4,5,6\}\) and \(X(\omega )=(-1)^\omega \), it is necessary that at least \(\{1,3,5\}\) and \(\{2,4,6\}\) are events. That is, among the empty set \(\{\}\) and the universal set \(\Omega \) and these two sets, even if union, intersection, and complement operations are performed, no other than these four sets are generated. Also, by calculating the set of \(\omega \in \Omega \) such that \(X(\omega )\in (0,1)\), the set of \(\omega \in \Omega \) such that \(X(\omega )\in (-2,1)\), etc., we can see that the subset of \(\Omega \) where \(X(\omega )\in O\) for any open set O does not exist other than those four. The random variable X only defines the events, and the probability needs to be specified according to the axioms. \(\blacksquare \)

Random variables can be defined not only as \(\Omega \rightarrow {\mathbb R}\). If \(\eta : \Omega \rightarrow C(K)\) is measurable, where C(K) is a continuous function defined on a compact set K, then \(\eta \) is said to be a random variable that takes values in C(K). Rather than considering it as a random variable, it can be seen as a random function. In defining measurability, open sets are defined using distance by the uniform norm.

4.3.2 Order Notation

First, we shall define the limit of a sequence of real numbers, which appears frequently in this book.

An infinitely long sequence of real numbers \({a_n}\) is said to converge to \(\alpha \) as \(n\rightarrow \infty \), or \(\lim _{n\rightarrow \infty }a_n=\alpha \), if for any \(\epsilon >0\), \(|a_n-\alpha |<\epsilon \) holds except for a finite number of n.Footnote 4

Also, for a function g(n) of positive integer n such as \(g(n)=1,n,n^2\), if \(|g(n)a_n|<\epsilon \) holds for any \(\epsilon >0\) except for a finite number of n, i.e., if \(g(n)a_n\) converges to 0 as \(n\rightarrow \infty \), we write \(a_n=o(\frac{1}{g(n)})\). On the other hand, if there exists an \(M>0\) such that \(|g(n)a_n|<M\) holds except for a finite number of n, i.e., if \(g(n)a_n\) is bounded, we write \(a_n=O(\frac{1}{g(n)})\). For example, if it is O(1/n), it is also o(1).

4.3.3 Law of Large Numbers

Next, we will examine whether the sequence of probabilities \(\{P(A_n)\}\) for a sequence of events \(\{A_n\}\) converges to 1. When the probability \(P(|X_n-\alpha |<\epsilon )\) converges to 1 as \(n\rightarrow \infty \) for any \(\epsilon >0\), the sequence of random variables \(\{X_n\}\) is said to stochastically converge to \(\alpha \), and we write it as \(X_n\xrightarrow {P} \alpha \).

The Weak Law of Large Numbers is one of the most important theorems regarding stochastic convergence. Before introducing it, we shall show an important inequality.

Proposition 5

(Chebyshev’s Inequality) For a random variable with mean \(\mu \) and variance \(\sigma ^2>0\), for any constant \(k>0\), the inequality

$$\begin{aligned} P(|X-\mu |\ge k)\le \sigma ^2/k^2 \end{aligned}$$

holds.

Proof

Define I so that \(I(A)=1\) when event A occurs and \(I(A)=0\) otherwise. Then, the following inequality holds.

$$\begin{aligned} \sigma ^2={\mathbb E}_X[(X-\mu )^2]\ge {\mathbb E}_X[(X-\mu )^2I(|X-\mu |\ge k)]\ge k^2\cdot P(|X-\mu |\ge k). \end{aligned}$$

\(\blacksquare \)

Here, consider \(\left\{ X_{n}\right\} ^{\infty }_{n=1}\) and \(\left\{ \epsilon _{n}\right\} _{n=1}^{\infty }\) as sequences of random variables. When

$$\begin{aligned} \frac{X_{n}}{\epsilon _{n}} {\mathop {\longrightarrow }\limits ^{P}} 0 \end{aligned}$$

holds as \(n \rightarrow \infty \), we write \(X_{n}=o_{P}\left( \epsilon _{n}\right) \). Especially, when \(X_{n} {\mathop {\longrightarrow }\limits ^{P}} 0\) holds, we write \(X_{n}=o_{P}(1)\).

Moreover, when there exist an \(M>0\) except for a finite number of n (they can depend on \(\delta \)) such that

$$\begin{aligned} P\left( \left| X_{n}\right| \le M\left| \epsilon _{n}\right| \right) \ge 1-\delta \end{aligned}$$

for any \(\delta >0\), we write

$$\begin{aligned} X_{n}=O_{P}\left( \epsilon _{n}\right) . \end{aligned}$$

Especially, if \(P\left( \left| X_{n}\right| \le M\right) \ge 1-\delta \), we write \(X_{n}=O_{P}(1)\).

\(o_{P}\) and \(O_{P}\) have the following properties for sequences of random variables \(\left\{ \epsilon _{n}\right\} _{n=1}^{\infty }\) and \(\left\{ \delta _{n}\right\} _{n=1}^{\infty }\):

$$\begin{aligned} If \quad X_{n}=o_{P}\left( \epsilon _{n}\right) \quad and \quad Y_{n}=o_{P}\left( \epsilon _{n}\right) , \quad then \quad X_{n} \pm Y_{n}=o_{P}\left( \epsilon _{n}\right) . \end{aligned}$$
(4.8)
$$\begin{aligned} If \quad X_{n}=o_{P}\left( \epsilon _{n}\right) \quad and \quad Y_{n}=O_{P}\left( \delta _{n}\right) , \quad then \quad X_{n} Y_{n}=o_{P}\left( \delta _{n} \epsilon _{n}\right) . \end{aligned}$$
(4.9)

In particular, (4.9) implies (4.10).

$$\begin{aligned} If \quad X_{n}=o_{P}\left( \epsilon _{n}\right) \quad and \quad Y_{n}=o_{P}\left( \delta _{n}\right) , \quad then \quad X_{n} Y_{n}=o_{P}\left( \delta _{n} \epsilon _{n}\right) . \end{aligned}$$
(4.10)

Moreover, for \(a\in {\mathbb R}\) and a continuous function \(g: {\mathbb R}\rightarrow {\mathbb R}\), we have

$$\begin{aligned} If \quad X_n\xrightarrow {P} a, \quad then \quad g(X_n) \xrightarrow {P} g(a). \end{aligned}$$
(4.11)

Equations (4.8)–(4.10) are known as Slutsky’s theorem and Eq. (4.11) is known as the Continuous Mapping Theorem. For proofs, see [16], for example. The notation \(O_P, o_P\) is not commonly used in general statistics, but it is frequently used in Watanabe’s Bayesian theory, so it is necessary to understand it well.

Example 39

An independent sequence of random variables \(X_1,X_2,\dots \) such that \(X_n\sim N(0,1)\) is \(O_P(1)\). Also, a sequence of random variables \(X_1,X_2,\dots \) such that \(X_n\sim N(0,1/n)\) stochastically converges to 0, hence \(X_n=o_P(1)\). \(\blacksquare \)

Proposition 6

(Weak Law of Large Numbers) For a sequence of independent and identically distributed random variables \(\{X_n\}\), the average \(\displaystyle Z_n:=\frac{{X_1+\cdots +X_n}}{n}\) stochastically converges to its expected valueFootnote 5 \(\mu \).

Proof

First, the mean and variance of \(Z_n\) are, respectively, \(\displaystyle {\mathbb E}{[}Z_{n}{]}= {\mathbb E}[\frac{X_1+\cdots +X_n}{n}]={\mathbb E}{[}X_{1}{]}=\mu \) andFootnote 6 \(\displaystyle {\mathbb V}{[}Z_{n}{]}={\mathbb V}{[}\frac{X_1+\cdots +X_n}{n}{]}={\mathbb V}{[}X_{1}{]}/n= \sigma ^2/n\). Applying these to Proposition 5, we obtain

$$\begin{aligned} P(|Z_n-\mu |\ge \epsilon )\le \frac{\sigma ^2}{n}/\epsilon ^2. \end{aligned}$$

Therefore, as \(n\rightarrow \infty \), the probability of the event \({(|Z_n-\mu |\ge \epsilon )}\) approaches 0.    \(\blacksquare \)

Fig. 4.2
figure 2

A random number following a binomial distribution is generated \(n=200\) times, and the convergence of the sequence of random variables \({Z_n}\) is illustrated. A sequence was generated eight times each for the probabilities of 1 occurring \(p=0.5\) and \(p=0.1\) (Example 40). Since the variances of \(X_n\) for \(p=0.5, 0.1\) are \(p(1-p)=0.25, 0.09\) respectively, the variance at each \(i=1,\ldots .n\) of \(Z_i\) is 0.25/i, 0.09/i. It can be seen that the estimated values up to that point are converging to \(p=0.5,0.1\) respectively

Example 40

We generated random numbers following a binomial distribution 200 times (\(n=200\)), calculated \(Z_i\) for each point up to \(i=1,\ldots ,n\), and checked the degree of convergence (see Fig. 4.2). We generated \(Z_{n}\) 8 times each for \(p=0.5\) and \(p=0.1\).

figure a

\(\blacksquare \)

4.3.4 Central Limit Theorem

In the following, we denote the mean and variance of the true distribution q as \(\mu \) and \(\sigma ^2\), respectively. The Central Limit Theorem is, alongside the Law of Large Numbers, an important asymptotic property of a sequence of random variables \(\{X_n\}\).

Proposition 7

(Central Limit Theorem) For a sequence of independent random variables \(\{X_n\}\) each following the same distribution with mean \(\mu \) and variance \(\sigma ^2\),

$$\begin{aligned} Y_n:=\frac{X_1+\cdots +X_n-n\mu }{\sigma \sqrt{n}} \end{aligned}$$
(4.12)

follows the standard normal distribution as \(n\rightarrow \infty \).

This book will not prove this theorem, but we will confirm its meaning by giving examples of this theorem and its extensions. First, it should be noted that each of the random variables in the sequence \({X_n}\) does not necessarily need to follow a normal distribution.

Example 41

(Application of the Central Limit Theorem) Setting \(n=100\), for each distribution q below, we generated \(m=500\) random samples of (4.12) and plotted the distribution of \(Y_n\) (see Fig. 4.3).

  1. 1.

    Standard normal distribution

  2. 2.

    Exponential distribution with \(\lambda =1\)

  3. 3.

    Binomial distribution with \(p=0.1\)

  4. 4.

    Poisson distribution with \(\lambda =1\)

Note that the exponential distribution is a distribution with a probability density function that is 0 for \(x\le 0\) and

$$\begin{aligned} q(x):=\lambda e^{-\lambda x} \end{aligned}$$

for \(x\ge 0\). The Poisson distribution takes values \(x=0,1,2,\ldots \), with probabilities \(q(x)=e^{-\lambda }\lambda ^x/x!\). The experiment was run using the following code:

figure b

It can be seen that regardless of the shape of distribution q, even with \(n=100\), the shape is close to the standard normal distribution. \(\blacksquare \)

Fig. 4.3
figure 3

We generated \(n=100\) random samples following normal distribution, exponential distribution, binomial distribution, and Poisson distribution, and calculated the value of \(Y_n\) once. This process was repeated \(m=500\) times to examine its distribution. Even with \(n=100\), the distribution is approaching a normal distribution

The above Central Limit Theorem assumed that \(X_1,\ldots ,X_n\) were each real numbers (one-dimensional), and assumed \(\mu \in {\mathbb R}\), \(\sigma ^2>0\). Similar assertions hold even for two-dimensional and d-dimensional (\(d\ge 1\)) cases. Hereinafter, \(N(\mu ,\Sigma )\) denotes a d-dimensional normal distribution with mean \(\mu \in {\mathbb R}^d\) and covariance matrix \(\Sigma \in {\mathbb R}^{d\times d}\). The probability density function of \(X\sim N(\mu ,\Sigma )\) is as follows.

$$\begin{aligned} f(x)=\frac{1}{(2\pi )^{d/2}(\det \Sigma )^{1/2}}\exp \{-\frac{1}{2}(x-\mu )^\top \Sigma ^{-1}(x-\mu ) \}. \end{aligned}$$

In general, when the distribution function \(\{F_n(x)\}\) of a sequence of real-valued random variables \(\{X_n\}\) converges to the distribution function \(F_X(x):=\int _{-\infty }^xq(t)dt\) of a random variable X at each continuous point x as \(n\rightarrow \infty \),

$$\begin{aligned} \lim _{n\rightarrow \infty }{}F_n(x)=F_X(x) \end{aligned}$$
(4.13)

we say that \(\{X_n\}\) converges in distribution to X, and write this as \(X_n\xrightarrow {d} X\). If the probability density function followed by X is q, we sometimes write this as \(X_n\xrightarrow {d} q\). For example, the Central Limit Theorem can be written as \(X_n\xrightarrow {d} N(0,1)\). And, it is known that (4.13) is equivalent to

$$\begin{aligned} \lim _{n\rightarrow \infty }{\mathbb E}_n[g(X_n)]={\mathbb E}_X[g(X)] \end{aligned}$$
(4.14)

for any bounded and continuous function \(g: {\mathbb R}\rightarrow {\mathbb R}\) (Exercise 38), where \({\mathbb E}_n[\cdot ]\), \({\mathbb E}_X[\cdot ]\) are the operations of the mean with respect to the distribution functions \(F_n, F_X\) respectively.

Proposition 8

Consider independent random variables \(X_1,\ldots ,X_n\) with mean \(\mu \in {\mathbb R}^d\) and covariance matrix \(\Sigma \in {\mathbb R}^{d\times d}\) (they do not necessarily follow a normal distribution). Then, we have

$$\begin{aligned} \frac{X_1+\cdots +X_n-n\mu }{\sqrt{n}}\xrightarrow {d} N(0,\Sigma ). \end{aligned}$$

On the other hand, for a probability variable \(\eta _n: C(K)\rightarrow {\mathbb R}\) that takes values in C(K), the concept of distribution function does not exist because C(K) is not in a Euclidean space.Footnote 7 Therefore, for any bounded and continuous function \(g: C(K) \rightarrow {\mathbb R}\), we define the convergence in distribution of the sequence \(\eta _1,\eta _2,\dots \) to a random variable \(\eta \) taking values in some C(K) (\(\eta _n\xrightarrow {d} \eta \)) as

$$\begin{aligned} \lim _{n\rightarrow \infty }{\mathbb E}_{n}[g(\eta _n)]={\mathbb E}_{\eta }[g(\eta )]. \end{aligned}$$
(4.15)

4.4 Fisher Information Matrix

The Fisher information matrix represents the smoothness of the log-likelihood \(\log p(X|\theta )\) at each \(\theta \in \Theta \), and is an important measure for analyzing the relationship between the true distribution and the statistical model.

In this book, we assume the following conditions.

Assumption 2

  1. 1.

    The order of integration in \(\mathcal X\) and differentiation with respect to \(\theta \in \Theta \) in \(p(\cdot |\theta )\) can be exchanged.

  2. 2.

    For each \((x,\theta )\in \mathcal{X}\times \Theta \), the partial derivatives \(\displaystyle \frac{\partial ^2 \log p(x|\theta )}{\partial \theta _i\partial \theta _j}\) exist, for \(i,j=1,\ldots ,d\).

The Fisher information matrix \(I(\theta )\) is defined as the covariance matrix of

$$\nabla \log p(X|\theta ) = \left[ \frac{\partial \log p(X|\theta )}{\partial \theta _1}, \ldots , \frac{\partial \log p(X|\theta )}{\partial \theta _d}\right] $$
$$\begin{aligned} I(\theta ):= & {} {\mathbb V}[\nabla \log p(X|\theta )]\nonumber \\ = & {} {\mathbb E}_X[\{\nabla \log p(X|\theta )-{\mathbb E}_{X'}[\nabla \log p(X'|\theta )]\}\nonumber \\ {} & {} \cdot \{\nabla \log p(X|\theta )-{\mathbb E}_{X''}[\nabla \log p(X''|\theta )]\}^\top ]\nonumber \\ = & {} {\mathbb E}_X[\nabla \log p(X|\theta )(\nabla \log p(X|\theta ))^\top ]-\nabla {\mathbb E}_X[\log p(X|\theta )]\nabla {\mathbb E}_X[\log p(X|\theta )]^\top \nonumber \\ \end{aligned}$$
(4.16)

and we denote \(I:=I(\theta _*)\in {\mathbb R}^{d\times d}\) for \(\theta _*\in \Theta _*\). Also, we define the matrix \(J:=J(\theta _*)\in {\mathbb R}^{d\times d}\) using

$$\begin{aligned} J(\theta ):={\mathbb E}_X[-\nabla ^2 \log p(X|\theta )]. \end{aligned}$$
(4.17)

Assuming regularity, there exists a unique \(\theta =\theta _*\) that minimizes \(D(q|| p(\cdot |\theta ))\), that is, minimizes \({\mathbb E}_X[-\log p(X|\theta )]\), and there exists an open set containing \(\theta _*\) that is included in \(\Theta \), and since \(\nabla ^2 {\mathbb E}_X[\log p(X|\theta )]\) is positive definite, \(\nabla {\mathbb E}_X[\log p(X|\theta )]\) is 0 at \(\theta =\theta _*\). We write this as

$$\begin{aligned} \nabla {\mathbb E}_X[ \log p(X|\theta _*)]=0. \end{aligned}$$
(4.18)

Therefore, if it is regular, the following holds from (4.16).

$$\begin{aligned} I(\theta _*)= {\mathbb E}_X[\nabla \log p(X|\theta _*)(\nabla \log p(X|\theta _*))^\top ]. \end{aligned}$$
(4.19)

Example 42

Assume that the mean and variance of the true distribution q are \(\mu _{}\) and \(\sigma ^2_{}\), respectively (not necessarily normally distributed). For the probability density function (normal distribution) with parameter \(\theta =(\mu ,\sigma ^2)\)

$$\begin{aligned} p(x|\theta )=\frac{1}{\sqrt{2\pi \sigma ^2}}\exp {-\frac{(x-\mu )^2}{2\sigma ^2}}, \end{aligned}$$

we shall calculate the matrices IJ. From

$$\begin{aligned} \log p(x|\theta )=-\frac{1}{2}\log 2\pi \sigma ^2-\frac{(x-\mu )^2}{2\sigma ^2} \end{aligned}$$
$$\begin{aligned} \nabla [\log p(x|\theta )]= \left[ \begin{array}{c} \displaystyle \frac{x-\mu }{\sigma ^2}\\ \displaystyle -\frac{1}{2\sigma ^2}+\frac{(x-\mu )^2}{2(\sigma ^2)^2} \end{array} \right] = \left[ \begin{array}{c} \displaystyle \frac{x-\mu }{\sigma ^2}\\ \displaystyle \frac{(x-\mu )^2-\sigma ^2}{2(\sigma ^2)^2} \end{array} \right] \end{aligned}$$
(4.20)
$$\begin{aligned} \nabla ^2[\log p(x|\theta )]= \left[ \begin{array}{ll} \displaystyle -\frac{1}{\sigma ^{2}}&{} -\displaystyle \frac{x-\mu }{(\sigma ^2)^2}\\ -\displaystyle \frac{x-\mu }{(\sigma ^2)^2}&{} \displaystyle \frac{1}{2(\sigma ^2)^2}-\frac{(x-\mu )^2}{(\sigma ^2)^3}\\ \end{array} \right] \end{aligned}$$
(4.21)
$$\begin{aligned} {\mathbb E}_X[(X-\mu )^2]={\mathbb E}_X[(X-\mu _{**}+\mu _{**}-\mu )^2]=\sigma _{**}^2+(\mu _{**}-\mu )^2\ , \end{aligned}$$
(4.22)

we obtain

$$\begin{aligned} {\mathbb E}_X[\nabla \log p(X|\theta )]= \left[ \begin{array}{c} \displaystyle \frac{\mu _{**}-\mu }{\sigma ^2}\\ \displaystyle -\frac{1}{2\sigma ^2}+\frac{\sigma ^2_{**}+(\mu _{**}-\mu )^2}{2(\sigma ^2)^2} \end{array} \right] \end{aligned}$$
(4.23)
$$\begin{aligned} {} & {} {\mathbb V}[\nabla \log p(X|\theta )]\nonumber \\ = & {} {\mathbb E}_X[\left\{ \nabla \log p(X|\theta ) -{\mathbb E}_X[\nabla \log p(X|\theta )]\right\} \left\{ \nabla \log p(X|\theta )-{\mathbb E}_X[\nabla \log p(X|\theta )]\right\} ^\top ] \nonumber \\ = & {} {\mathbb E}_X \{\left[ \begin{array}{c} \displaystyle \frac{X-\mu _{**}}{\sigma ^2}\\ \displaystyle \frac{(X-\mu _{**})^2+2(\mu _{**}-\mu )(X-\mu _{**})-\sigma ^2_{**}}{2(\sigma ^2)^2} \end{array} \right] \nonumber \\ {} {} & {} \cdot \left[ \begin{array}{cc} \displaystyle \frac{X-\mu _{**}}{\sigma ^2}& \displaystyle \frac{(X-\mu _{**})^2+2(\mu _{**}-\mu )(X-\mu _{**})-\sigma ^2_{**}}{2(\sigma ^2)^2} \end{array} \right] \} \ . \end{aligned}$$
(4.24)

Let \(A:={\mathbb E}_X[(X-\mu _{**})^3]\) and \(B:={\mathbb E}_X[(X-\mu _{**})^4]\), then the (1,1), (1,2), and (2,2) elements of (4.24) are respectively

$$\begin{aligned} {\mathbb E}_X[\left( \frac{X-\mu _{**}}{\sigma ^2}\right) ^2]=\frac{\sigma _{**}^2}{(\sigma ^2)^2}, \end{aligned}$$
$$\begin{aligned} {} & {} {\mathbb E}_X[\frac{X-\mu _{**}}{\sigma ^2}\cdot \frac{(X-\mu _{**})^2+2(\mu _{**}-\mu )(X-\mu _{**})-\sigma _{**}^2}{2(\sigma ^2)^2}]= \frac{A+2(\mu _{**}-\mu )\sigma ^2_{**}}{2(\sigma ^2)^3}, \end{aligned}$$

and

$$\begin{aligned} {} & {} {\mathbb E}_X[\left\{ \frac{(X-\mu _{**})^2+2(\mu _{**}-\mu )(X-\mu _{**})-\sigma _{**}^2}{2(\sigma ^2)^2}\right\} ^2]\\ = & {} \frac{1}{4(\sigma ^2)^4}\left\{ {\mathbb E}_X[(X-\mu _{**})^4]+4(\mu _{**}-\mu ){\mathbb E}_X[(X-\mu _{**})^3]\right. \\ {} & {} \left. +4(\mu _{**}-\mu )^2{\mathbb E}_X[(X-\mu _{**})^2]+(\sigma ^2_{**})^2-2\sigma ^2_{**}{\mathbb E}_X[(X-\mu _{**})^2]\right\} \\ = & {} \frac{B-(\sigma ^2_{**})^2+4(\mu _{**}-\mu )A+4(\mu _{**}-\mu )^2\sigma _{**}^2}{4(\sigma ^2)^4}. \end{aligned}$$

Furthermore, by substituting \(\theta =\theta _*=(\mu _*,\sigma _*^2)\), (4.24) becomes as follows.

$$\begin{aligned} I= \left[ \begin{array}{ll} \displaystyle \frac{\sigma ^2_{**}}{(\sigma _*^2)^2}&{} \displaystyle \frac{A+2(\mu _{**}-\mu _*)\sigma ^2_{**}}{2(\sigma _*^2)^3} \\ \displaystyle \frac{A+2(\mu _{**}-\mu _*)\sigma ^2_{**}}{2(\sigma ^2_*)^3}&{} \displaystyle \frac{B-(\sigma ^2_{**})^2+4(\mu _{**}-\mu _*)A+4(\mu _{**}-\mu _*)^2\sigma _{**}^2}{4(\sigma _*^2)^4} \end{array} \right] . \end{aligned}$$
(4.25)

On the other hand, from (4.21), we obtain

$$\begin{aligned} J(\theta )= \left[ \begin{array}{ll} \displaystyle \frac{1}{\sigma ^{2}}&{} \displaystyle \frac{\mu _{**}-\mu }{(\sigma ^2)^2}\\ \displaystyle \frac{\mu _{**}-\mu }{(\sigma ^2)^2}&{} \displaystyle -\frac{1}{2(\sigma ^2)^2}+\frac{\sigma ^2_{**}+(\mu _{**}-\mu )^2}{(\sigma ^2)^3}\\ \end{array} \right] \end{aligned}$$

and

$$\begin{aligned} J= \left[ \begin{array}{ll} \displaystyle \frac{1}{\sigma _*^{2}}&{} \displaystyle \frac{\mu _{**}-\mu _*}{(\sigma _*^2)^2}\\ \displaystyle \frac{\mu _{**}-\mu _*}{(\sigma _*^2)^2}&{} \displaystyle -\frac{1}{2(\sigma ^2_*)^2}+\frac{\sigma ^2_{**}+(\mu _{**}-\mu _*)^2}{(\sigma _*^2)^3}\\ \end{array} \right] . \end{aligned}$$
(4.26)

Moreover, if it is regular, from (4.19), (4.23) becomes 0, so \((\mu _,\sigma ^2_*)=(\mu _{},\sigma ^2_{})\). Therefore, we obtain

$$\begin{aligned} I= \left[ \begin{array}{ll} \displaystyle (\sigma _{*}^{2})^{-1}&{}\displaystyle \frac{A}{2(\sigma _*^2)^3}\\ \displaystyle \frac{A}{2(\sigma _*^2)^3}&{} \displaystyle \frac{B-(\sigma ^2_{*})^2}{4(\sigma _*^2)^4}\\ \end{array} \right] \end{aligned}$$
(4.27)

and

$$\begin{aligned} J= \left[ \begin{array}{ll} (\sigma _{*}^{2})^{-1}&{}0\\ 0&{}(\sigma _*^2)^{-2}/2\\ \end{array} \right] . \end{aligned}$$
(4.28)

Furthermore, if it is realizable, the true distribution q is also normal, and since \(A=0\) and \(B=3(\sigma ^2_{**})^2\) (as per Example 2), (4.27) coincides with (4.28). \(\blacksquare \)

Proposition 9

When the true distribution q is realizable for the statistical model \({p(\cdot |\theta )}_{\theta \in \Theta }\) and is regular, \(I=J\) holds.

Proof

Since it is realizable, \(q=p(\cdot |\theta _*)\), and we can write

$$\begin{aligned} J= & {} {\mathbb E}_X[-\nabla ^2\log p(X|\theta _*)]={\mathbb E}_X\left[ -\nabla (\frac{\nabla p(X|\theta _*)}{p(X|\theta _*)})\right] \\ = & {} {\mathbb E}_X\left[ -\frac{\nabla ^2 p(X|\theta _*)}{p(X|\theta _*)}\right] + {\mathbb E}_X\left[ \frac{\nabla p(X|\theta _*)(\nabla p(X|\theta _*))^\top }{p(X|\theta _*)^2}\right] \\ = & {} -{\mathbb E}_X\left[ \frac{\nabla ^2 p(X|\theta _*)}{q(X)}\right] + {\mathbb E}_X\left[ \nabla \log p(X|\theta _*)(\nabla \log p(X|\theta _*))^\top \right] . \end{aligned}$$

Furthermore, from the first condition of Assumption 2, we have

$$\begin{aligned} {\mathbb E}_X[\frac{\nabla ^2 p(X|\theta _*)}{q(X)}]=\int _\mathcal{X}\nabla ^2 p(x|\theta _*)dx=\nabla ^2 \int _\mathcal{X}p(x|\theta _*)dx= \nabla ^2 1=0, \end{aligned}$$

and from the equation where we substitute \(\theta =\theta _*\) into (4.16), we can write

$$\begin{aligned} J=0+I(\theta _*)+\left( \nabla {\mathbb E}_X[-\log p(X|\theta _*)]\right) \left( \nabla {\mathbb E}_X[-\log p(X|\theta _*)]\right) ^\top . \end{aligned}$$

Furthermore, since it is regular, we can apply (4.18), and the proposition follows. \(\blacksquare \)

Example 43

In Example 42, if we change the parameter \(\sigma ^2>0\) to \(\sigma \not =0\), the distribution becomes the same (homogeneous) at \(\theta =(\mu ,\sigma )\) and \(\theta =(\mu ,-\sigma )\), but

$$\begin{aligned} \nabla [\log p(x|\theta )]=[\frac{x-\mu }{\sigma ^2},-\frac{1}{\sigma }+\frac{(x-\mu )^2}{\sigma ^3}]^\top , \end{aligned}$$
$$\begin{aligned} \nabla ^2[\log p(x|\theta )]= \left[ \begin{array}{ll} \displaystyle -\frac{1}{\sigma ^2}&{}\displaystyle -\frac{2(x-\mu )}{\sigma ^3}\\ \displaystyle \displaystyle -\frac{2(x-\mu )}{\sigma ^3}&{} \displaystyle \frac{1}{\sigma ^2}-\frac{3(x-\mu )^2}{\sigma ^4}\\ \end{array} \right] , \end{aligned}$$

and

$$\begin{aligned} J(\theta )= \left[ \begin{array}{ll} \displaystyle \frac{1}{\sigma ^2}&{}\displaystyle \frac{2(\mu _{**}-\mu )}{\sigma ^3}\\ \displaystyle \displaystyle \frac{2(\mu _{**}-\mu )}{\sigma ^3}&{} \displaystyle -\frac{1}{\sigma ^2}+3\frac{(\mu _{**}-\mu )^2+\sigma _{**}^2}{\sigma ^4}\\ \end{array} \right] \end{aligned}$$

follow, and the values of \(J(\theta )\) do not coincide between the two. When \(\mu =\mu _{**}\), they coincide at \(\pm \sigma \). \(\blacksquare \)

When limited to the exponential family, using the notation of Sect. 2.4, given that \(p(x|\theta )=u(x)\exp \{v(\theta )^\top w(x)\}\) and \(\nabla \log p(x|\theta )=\nabla v(\theta )^\top w(x)\), the Fisher information matrix can be written as

$$\begin{aligned} I(\theta )={\mathbb V}[\nabla \{v(\theta )^\top w(X)\}] \end{aligned}$$

and

$$\begin{aligned} J(\theta )=-{\mathbb E}_X[\nabla ^2 \{v(\theta )^\top w(X)\}]. \end{aligned}$$

Example 44

In Example 42, with \(J=4\), we can write \(\displaystyle u(x)=\frac{1}{\sqrt{2\pi }}\), \(\displaystyle v(\theta )=[\frac{1}{\sigma ^2},\frac{\mu }{\sigma ^2},\frac{\mu ^2}{\sigma ^2},\log \sigma ^2]^\top \), and \(\displaystyle w(x)=[-\frac{x^2}{2},{x}, -\frac{1}{2},-\frac{1}{2}]^\top \) (see Example 16). Hence,

$$\begin{aligned} \nabla [v(\theta )^\top w(x)] =\nabla [-\frac{1}{2}\log \sigma ^2-\frac{(x-\mu )^2}{2\sigma ^2}]=[\frac{x-\mu }{\sigma ^2},-\frac{1}{2\sigma ^2}+\frac{(x-\mu )^2}{2(\sigma ^2)^2}]^\top \end{aligned}$$
$$\begin{aligned} \nabla ^2[v(\theta )^\top w(x)]= \left[ \begin{array}{cc} \displaystyle -\frac{1}{\sigma ^2}&{}\displaystyle -\frac{x-\mu }{(\sigma ^2)^2}\\ \displaystyle -\frac{x-\mu }{(\sigma ^2)^2}&{}\displaystyle \frac{1}{2(\sigma ^2)^2}-\frac{(x-\mu )^2}{(\sigma ^2)^3} \end{array} \right] \end{aligned}$$

can be obtained. We can calculate \({\mathbb E}_X[\cdot ]\) and \({\mathbb V}_{X}[\cdot ]\) using these and the true model. \(\blacksquare \)