Iterative minimization algorithm on a mixture family

Hayashi, Masahito

doi:10.1007/s41884-024-00140-5

Iterative minimization algorithm on a mixture family

Research Paper
Published: 13 August 2024

(2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Information Geometry Aims and scope Submit manuscript

Iterative minimization algorithm on a mixture family

Download PDF

Masahito Hayashi ORCID: orcid.org/0000-0003-3104-1000^1,2,3

54 Accesses
5 Altmetric
Explore all metrics

Abstract

Iterative minimization algorithms appear in various areas including machine learning, neural networks, and information theory. The em algorithm is one of the famous iterative minimization algorithms in the area of machine learning, and the Arimoto–Blahut algorithm is a typical iterative algorithm in the area of information theory. However, these two topics had been separately studied for a long time. In this paper, we generalize an algorithm that was recently proposed in the context of the Arimoto–Blahut algorithm. Then, we show various convergence theorems, one of which covers the case when each iterative step is done approximately. Also, we apply this algorithm to the target problem of the em algorithm, and propose its improvement. In addition, we apply it to other various problems in information theory.

Alternating Least Squares in Generalized Linear Models

Article 01 September 2019

A Gaussian Mixture Clustering Approach Based on Extremal Optimization

Clustering via finite nonparametric ICA mixture models

Article 28 August 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Optimization over distributions is an important topic in various areas. For example, the minimum divergence between a mixture family and an exponential family has been studied by using the em algorithm in the areas of machine learning and neural networks [1,2,3,4]. The em algorithm is an iterative algorithm to calculate the above minimization and it is rooted in the study of Boltzmann machines [5]. In particular, the paper [3] formulated the em algorithm under the framework with Bregman divergence [6, 7]. The topic of the em algorithm has been mainly studied in the community of machine learning, neural networks, and information geometry. As another iterative algorithm, the Arimoto–Blahut algorithm is known as an algorithm to maximize the mutual information by changing the distribution on the input system [8, 9]. This maximization is needed to calculate the channel capacity [10]. This algorithm has been generalized to various settings including the rate distortion theory [9, 11,12,13], the capacity of a wire-tap channel [14], and their quantum extensions [15,16,17,18,19]. In particular, the two papers [13, 19] made very useful generalizations to cover various topics in information theory. The Arimoto–Blahut algorithm and its variants have been mainly studied in the community of information theory.

However, only a limited number of studies have discussed the relation between the two topics, the em algorithm and the Arimoto–Blahut algorithm, as follows. The papers [23, 24] pointed out that the Arimoto–Blahut algorithm can be considered as an alternating algorithm in a similar way to the EM and the em algorithms. Recently, the paper [20] pointed out that the maximization of the mutual information can be considered to be the maximization of the projected divergence to an exponential family by changing an element of a mixture family. The paper [21] generalized this maximization to the framework with Bregman divergence [6, 7] and applied this setting to various problems in information theory. Also, the recent paper [22] applied the em algorithm to the rate-distortion theory, which is a key topic in information theory.

In this paper, we focus on a generalized problem setting proposed in [19], which is given as an optimization over the set of input quantum states. As the difference from the former algorithm, their algorithm [19] has an acceleration parameter. Changing this parameter, we can enhance the convergence speed under a certain condition. To obtain wider applicability, we extend their problem setting to the minimization over a general mixture family. Although they discussed the convergence speed only when there is no local minimizer, our analysis covers the convergence speed to a local minimizer even when there exist several local minimizers. Further, since our setting covers a general mixture family as the set of input variables, our method can be applied to the minimum divergence between a mixture family and an exponential family, which is the objective problem in the em algorithm. That is, this paper presents a general algorithm including the em algorithm as well as the Arimoto–Blahut algorithm. This type of relation between the em algorithm and the Arimoto–Blahut algorithm is different from the relation pointed by the papers [23, 24].

There is a possibility that each iteration can be calculated only approximately. To cover such an approximated case, we evaluate the error of our algorithm with approximated iterations. Since the em algorithm has local minimizers in general, it is essential to cover the convergence to a local minimizer. Since our algorithm has the acceleration parameter, our application to the minimum divergence gives a generalization of the em algorithm. Also, our algorithm can be applied to the maximization of the projected divergence to an exponential family by changing an element of a mixture family.

In addition, our algorithm has various applications that were not discussed in the preceding study [19]. In channel coding, the decoding block error probability goes to zero exponentially under the proper random coding when the transmission rate is smaller than the capacity [25]. Also, the probability of correct decoding goes to zero exponentially when the transmission rate is greater than the capacity [26]. These exponential rates are written with the optimization of the so-called Gallager function. Recently, the paper [27] showed that the Gallager function can be written as the minimization of the Rényi divergence. Using this fact, we apply our method to these optimizations. Further, we apply our algorithm to the capacity of a wiretap channel. In addition, since our problem setting allows a general mixture family as the range of input, we apply the channel capacity with cost constraint. Also, we point out that the calculation of the commitment capacity is given as the minimization of the divergence between a mixture family and an exponential family. Hence, we discuss this application as well.

The remaining part of this paper is organized as follows. Section 2 formulates our minimization problem for a general mixture family. Then, we proposed several algorithms to solve the minimization problem. We derive various convergence theorems including the case with approximated iterations. The remaining sections apply our algorithm to various examples. In these sections, examples of objective functions are discussed. Section 3 applies our algorithm to various information theoretical problems. Then, Sect. 4 demonstrates the application to the minimum divergence between a mixture family and an exponential family. Section 5 shows how to apply our algorithm to the commitment capacity. Section 6 discusses the application of our algorithm to the maximization of the projected divergence to an exponential family by changing an element of a mixture family. Section 7 considers the application to information bottleneck, which is a powerful method for machine learning. Appendices are devoted to the proofs of the theorems presented in Sect. 2.

2 General setting

2.1 Algorithm with exact iteration

We consider a finite sample space $\mathcal{X}$ and focus on the set $\mathcal{P}(\mathcal{X})$ of distributions whose support is $\mathcal{X}$. Using k linearly independent functions $f_1, \ldots , f_k$ on $\mathcal{X}$ and constants $a=(a_1, \ldots , a_k)$, we define the mixture family $\mathcal{M}_a$ as follows

$$\begin{aligned} \mathcal{M}_a:= \{P \in \mathcal{P}(\mathcal{X})| P[f_i]=a_i \hbox { for } i=1, \ldots , k \}, \end{aligned}$$

(1)

where $P[f]:= \sum _{x \in \mathcal{X}} P(x)f(x)$. We add additional $l-k$ linearly independent functions $f_{k+1}, \ldots f_l$ and $|\mathcal{X}|=l+1$ such that the l functions $f_{1}, \ldots , f_l$ are linearly independent. Then, the distribution P can be parameterized by the mixture parameter $\eta =(\eta _1, \ldots , \eta _l)$ as $ \eta _i= P[f_i]$. That is, the above distribution is denoted by $P_\eta $. Then, we denote the e-projection of P to $\mathcal{M}_a$ by $\varGamma ^{(e)}_{\mathcal{M}_a}[P]$. That is, $\varGamma ^{(e)}_{\mathcal{M}_a}[P]$ is defined as follows [1, 2].

$$\begin{aligned} \varGamma ^{(e)}_{\mathcal{M}_a}[P]:= \mathop {\textrm{argmin}}\limits _{ Q \in \mathcal{M}_a} D(Q\Vert P), \end{aligned}$$

(2)

where the Kullback–Leibler divergence $D(Q\Vert P)$ is defined as

$$\begin{aligned} D(Q\Vert P):= \sum _{x \in \mathcal{X}}Q(x) (\log Q(x)- \log P(x)). \end{aligned}$$

(3)

Using the e-projection, we have the following equation for an element of $Q \in \mathcal{M}_a$, which is often called Pythagorean theorem.

$$\begin{aligned} D(Q\Vert P)= D(Q\Vert \varGamma ^{(e)}_{\mathcal{M}_a}[P]) + D(\varGamma ^{(e)}_{\mathcal{M}_a}[P]\Vert P). \end{aligned}$$

(4)

Given a continuous function $\varPsi $ from $\mathcal{M}_a$ to the set of functions on $\mathcal{X}$, we consider the minimization $\min _{P \in \mathcal{M}_a} \mathcal{G}(P)$;

$$\begin{aligned} \mathcal{G}(P):= \sum _{x \in \mathcal{X}} P(x) \varPsi [P](x). \end{aligned}$$

(5)

This paper aims to find

$$\begin{aligned} \overline{\mathcal{G}}(a):=\min _{P \in \mathcal{M}_a} \mathcal{G}(P), \quad P_{*,a}:=\mathop {\textrm{argmin}}\limits _{P \in \mathcal{M}_a} \mathcal{G}(P). \end{aligned}$$

(6)

For this aim, we propose an iterative algorithm based on a positive real number $\gamma >0$. Since the above formulation (5) is very general, we can choose the function $\varPsi $ dependently on our objective function. That is, different choices of $\varPsi $ lead to different objective functions.

For a distribution $Q \in \mathcal{P}(\mathcal{X})$, we define the distribution $\varPhi [Q] $ as

$$\begin{aligned} \varPhi [Q](x):= \frac{1}{\kappa [Q]}Q(x)\exp ( -\frac{1}{\gamma } \varPsi [Q](x)), \end{aligned}$$

(7)

where $\kappa [Q]$ is the normalization factor $\sum _{x \in \mathcal{X}} Q(x)\exp ( -\frac{1}{\gamma } \varPsi [Q](x))$. Then, depending on $\gamma >0$, we propose Algorithm 1. When the calculation of $ \varPsi [P]$ and the e-projection is feasible, Algorithm 1 is feasible.

Indeed, Algorithm 1 is characterized as the iterative minimization of the following two-variable function, i.e., the extended objective function;

$$\begin{aligned} J_\gamma (P,Q):=\gamma D(P\Vert Q)+\sum _{x \in \mathcal{X}} P(x) \varPsi [Q](x). \end{aligned}$$

(8)

To see this fact, we define

$$\begin{aligned} \mathcal{F}_1[P] := \mathop {\textrm{argmin}}\limits _{Q \in \mathcal{M}_a} J _\gamma (P,Q) ,\quad \mathcal{F}_2[Q] := \mathop {\textrm{argmin}}\limits _{P \in \mathcal{M}_a} J _\gamma (P,Q) . \end{aligned}$$

(9)

Then, $\mathcal{F}_2[Q]$ is calculated as follows.

Lemma 1

Under the above definitions, for any positive value $\gamma >0$, we have $\mathcal{F}_2[Q] =\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]] $, i.e.,

$$\begin{aligned} \min _{P \in \mathcal{M}_a} J_\gamma (P,Q)&= J _\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]],Q) \nonumber \\&= \gamma D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\Vert \varPhi [Q]) - \gamma \log \kappa [Q] , \end{aligned}$$

(10)

$$\begin{aligned} J _\gamma (P,Q)&=\min _{P' \in \mathcal{M}_a} J _\gamma (P',Q) +\gamma D(P\Vert \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]) \end{aligned}$$

(11)

$$\begin{aligned}&=J _\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]],Q) +\gamma D(P\Vert \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]). \end{aligned}$$

(12)

Proof

We have the following relations.

$$\begin{aligned} J _\gamma (P,Q)&=\gamma \sum _{x \in \mathcal{X}} P(x) (\log P(x)- \log Q(x) + \frac{1}{\gamma } \varPsi [Q](x)) \nonumber \\&=\gamma \sum _{x \in \mathcal{X}} P(x) (\log P(x)- \log \varPhi [Q](x)- \log \kappa [Q]) \nonumber \\&=\gamma D(P\Vert \varPhi [Q])-\gamma \log \kappa [Q] \nonumber \\&=\gamma D(P\Vert \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]) +\gamma D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\Vert \varPhi [Q]) - \gamma \log \kappa [Q] , \qquad \end{aligned}$$

(13)

where the final equation follows from (4). Then, the minimum is given as (10), and it is realized with $\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]$.

Applying (10) to the final line of (13), we obtain (11). Since the minimum in (11) is realized when $P'=\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]$, we obtain (12). $\square $

We calculate $\mathcal{F}_1[Q]$. For this aim, we define

$$\begin{aligned} D_{\varPsi }(P \Vert Q):= \sum _{x \in \mathcal{X}} P(x) (\varPsi [P](x)- \varPsi [ Q ](x)). \end{aligned}$$

(14)

Lemma 2

Assume that two distributions $P,Q \in \mathcal{M}_a$ satisfy the following condition,

$$\begin{aligned} D_{\varPsi }(P \Vert Q) \le \gamma D(P\Vert Q). \end{aligned}$$

(15)

Then, we have $\mathcal{F}_1[P] =P$, i.e.,

$$\begin{aligned} J _\gamma (P,Q)\ge J _\gamma (P,P) . \end{aligned}$$

(16)

Proof

Eq. (15) guarantees that

$$\begin{aligned}&J _\gamma (P,Q)-J _\gamma (P,P) \nonumber \\&\quad = \gamma D(P\Vert Q)-\sum _{x \in \mathcal{X}} P(x) (\varPsi [P](x)- \varPsi [Q](x)) \ge 0 . \end{aligned}$$

(17)

$\square $

Remark 1

The preceding study [19] discussed the minimization of a function defined over the set of density matrices, i.e., the set of quantum states. When the function is given as a function only of the diagonal part of the density matrix, the function is given as a function of probability distribution composed of the diagonal part. That is, the preceding study [19] covers the case when the function is optimized over a set of probability distributions as a special case. The obtained result of this paper covers the case when the function is optimized over a mixture family. That is, the preceding study [19] does not consider the case with linear constraints. In this sense, the obtained result of this paper generalizes the above special case of the result of [19], and Algorithm 1 is a generalization of the algorithm given in [19].

Lemma 3.2 [19] is composed of several statements. The combination of Lemmas 1 and 2 is a generalization of the above special case of [19, Lemma 3.2]. That is, the classical restriction of [19, Lemma 3.2] is equivalent to the combination of Lemmas 1 and 2 without linear constraints.

Due to Lemmas 1 and 2, when all pairs $(P^{(t+1)},P^{(t)})$ satisfy (15), the relations

$$\begin{aligned} \mathcal {G}(P^{(t)})&=J _\gamma (P^{(t)},P^{(t)})\ge J _\gamma (P^{(t+1)},P^{(t)}) \nonumber \\ {}&\ge J _\gamma (P^{(t+1)},P^{(t+1)})= \mathcal {G}(P^{(t+1)}) \end{aligned}$$

(18)

hold under Algorithm 1. In addition, we have the following theorem.

Theorem 1

When all pairs $(P^{(t+1)},P^{(t)})$ satisfy (15), i.e., the positive number $\gamma $ is sufficiently large, Algorithm 1 converges to a local minimum.

Proof

Since $\{\mathcal{G}(P^{(t)}) \}$ is monotonically decreasing for t, we have

$$\begin{aligned} \lim _{n \rightarrow \infty } \mathcal{G}(P^{(t)})-\mathcal{G}(P^{(t+1)}) =0. \end{aligned}$$

(19)

Using (12), we have

$$\begin{aligned} \mathcal{G}(P^{(t)})&=J_{\gamma }(P^{(t)},P^{(t)}) \nonumber \\&= \gamma D(P^{(t)}\Vert P^{(t+1)}) + J_{\gamma }( P^{(t+1)}, P^{(t)}) \nonumber \\&\ge \gamma D(P^{(t)}\Vert P^{(t+1)}) + \mathcal{G}( P^{(t+1)}) . \end{aligned}$$

(20)

Thus, we have

$$\begin{aligned} \gamma D(P^{(t)}\Vert P^{(t+1)}) \le \mathcal{G}(P^{(t)})-\mathcal{G}(P^{(t+1)}) . \end{aligned}$$

(21)

Since due to (19) and (21), the sequence $\{\mathcal{G}(P^{(t)})\}$ is a Cauchy sequence, it converges. $\quad \square $

To discuss the details of Algorithm 1, we focus on the $\delta $-neighborhood $U(P^{0},\delta )$ of $P^{0}$ defined as

$$\begin{aligned} U(P^{0},\delta ):=\{ P \in \mathcal{M}_a | D(P^{0} \Vert P)\le \delta \}. \end{aligned}$$

(22)

In particular, we denote $\mathcal{M}_a$ by $U(P^{0},\infty )$. Then, we address the following conditions for the $\delta $-neighborhood $U(P^{0},\delta )$ of $P^{0}$;

(A0)
Any distribution $Q \in U(P^{0},\delta )$ satisfies the inequality
$$\begin{aligned} \mathcal{G}(\mathcal{F}_2[Q]) \ge \mathcal{G}(P^{0}). \end{aligned}$$
(23)
(A1)
Any distribution $Q \in U(P^{0},\delta )$ satisfies
$$\begin{aligned} D_{\varPsi }(\mathcal{F}_2[Q] \Vert Q) \le \gamma D(\mathcal{F}_2[Q] \Vert Q) . \end{aligned}$$
(24)
(A2)
Any distribution $Q \in U(P^{0},\delta )$ satisfies
$$\begin{aligned} D_{\varPsi }(P^{0} \Vert Q) \ge 0. \end{aligned}$$
(25)
(A3)
There exists a positive number $\beta >0$ such that any distribution $Q \in U(P^{0},\delta )$ satisfies
$$\begin{aligned} D_{\varPsi }(P^{0} \Vert Q) =\sum _{x \in \mathcal{X}} P^{0}(x) (\varPsi [P^{0}](x)- \varPsi [Q](x)) \ge \beta D(P^{0} \Vert Q). \end{aligned}$$
(26)

The condition (A3) is a stronger version of (A2).

However, the convergence to the global minimum is not guaranteed. As a generalization of [19, Theorem 3.3], the following theorem discusses the convergence to the global minimum and the convergence speed.

Theorem 2

Assume that the $\delta $-neighborhood $U(P^{0},\delta )$ of $P^{0}$ satisfies the conditions (A1) and (A2) with $\gamma $, and $P^{(1)} \in U(P^{0},\delta )$. Then, Algorithm 1 with $t_0$ iterations has one of the following two behaviors.

(i)
There exists an integer $t_1 \le t_0+1$ such that
$$\begin{aligned} \mathcal{G}(P^{(t_1)}) < \mathcal{G}(P^{0}). \end{aligned}$$
(27)
(ii)
Algorithm 1 satisfies the conditions $\{P^{(t)}\}_{t=1}^{t_0+1} \subset U(P^{0},\delta )$ and
$$\begin{aligned} \mathcal{G}(P^{(t_0+1)}) -\mathcal{G}(P^{0}) \le \frac{\gamma D(P^{0}\Vert P^{(1)}) }{t_0}. \end{aligned}$$
(28)

When the condition (A0) holds additionally, Algorithm 1 with $t_0$ iterations satisfies (ii).

The above theorem is shown in Appendix 10. Now, we choose an element $P^* \in \mathcal{M}_a$ to satisfy $\mathcal{G}(P^*)=\min _{P \in \mathcal{M}_a} \mathcal{G}(P)$. Then, the condition (A0) holds with $U(P^*,\infty )=\mathcal{M}_a$ and the choice $P^0=P^*$. When the conditions (A1) and (A2) hold with $U(P^*,\infty )=\mathcal{M}_a$ and the choice $P^0=P^*$, Theorem 2 guarantees the convergence to the minimizer $P^*$ in Algorithm 1. Although Theorem 2 requires the conditions (A1) and (A2), the condition (A2) is essential due to the following reason. When we choose $\gamma >0$ to be sufficiently large, the condition (A1) holds with the $\delta $-neighborhood $U(P^*,\delta )$ of $P^*$ because $U(P^*,\delta )$ is a compact set. Hence, it is essential to check the condition (A2) for Theorem 2.

However, as seen in (28), a larger $\gamma $ makes the convergence speed slower. Therefore, it is important to choose $\gamma $ to be small under the condition (A1). Practically, it is better to change $\gamma $ to be smaller when the point $P^{(t)}$ is closer to the minimizer $P^*$. In fact, as a generalization of [19, Proposition 3.6], we have the following exponential convergence under a stronger condition dependent on $\gamma $. In this sense, the parameter is called an acceleration parameter [19, Remark 3.4].

Theorem 3

Assume that the $\delta $-neighborhood $U(P^{0},\delta )$ of $P^{0}$ satisfies the conditions (A1) and (A3) with $\gamma $, and $P^{(1)} \in U(P^{0},\delta )$. Then, Algorithm 1 with $t_0$ iterations has one of the following two behaviors.

(i)
There exists an integer $t_1 \le t_0+1$ such that
$$\begin{aligned} \mathcal{G}(P^{(t_1)}) < \mathcal{G}(P^{0}). \end{aligned}$$
(29)
(ii)
Algorithm 1 satisfies the conditions $\{P^{(t)}\}_{t=1}^{t_0+1} \subset U(P^{0},\delta )$ and
$$\begin{aligned} \mathcal{G}(P^{(t_0+1)})-\mathcal{G}(P^{0}) \le (1-\frac{\beta }{\gamma })^{t_0} D(P^{0} \Vert P^{(1)}). \end{aligned}$$
(30)

When the condition (A0) holds additionally, Algorithm 1 with $t_0$ iterations satisfies (ii).

The above theorem is shown in Appendix 11. Next, we consider the case when there are several local minimizers $P^*_1, \ldots , P^*_{n} \in \mathcal{M}_a$ while the true minimizer is $P^*$. These local minimizers are characterized by the following corollary, which is shown in Appendix 10 as a corollary of Theorem 2.

Corollary 1

$$\begin{aligned} D_{\varPsi }(P^* \Vert P^*_i)= \sum _{x \in \mathcal{X}}P^*(x) (\varPsi [P^*](x)-\varPsi [P^*_i](x))= \mathcal{G}(P^*)-\mathcal{G}(P^*_i)<0. \qquad \end{aligned}$$

(31)

Hence, if there exist local minimizers, the condition (A2) does not hold with $U(P^*,\infty )=\mathcal{M}_a$ and the choice $P^0=P^*$. In this case, when the $\delta $-neighborhood $U(P^*_i,\delta )$ of $P^*_i$ satisfies the conditions (A0), (A1), and (A2), Algorithm 1 converges to the local minimizer $P^*_i$ with the speed (28) except for the case (i). Since $P^*_i$ is a local minimizer, the $\delta $-neighborhood $U(P^*_i,\delta )$ of $P^*_i$ satisfies the conditions (A0) and (A1) with sufficiently small $\delta >0$. When the following condition (A4) holds, as shown below, the $\delta $-neighborhood $U(P^*_i,\delta )$ of $P^*_i$ satisfies the condition (A2) with sufficiently small $\delta >0$. That is, when the initial point belongs to the $\delta $-neighborhood $U(P^*_i,\delta )$, Algorithm 1 converges to $P^*_i$.

(A4)
The function $\eta \mapsto \varPsi [P_\eta ](x)$ is differentiable, and the relation
$$\begin{aligned} \sum _{x \in \mathcal{X}}P_{\eta } (x) \Big (\frac{\partial }{\partial \eta _i}\varPsi [P_\eta ](x)\Big )= 0 \end{aligned}$$
(32)
holds for $i=k+1, \ldots , l$ and $P_{\eta } \in \mathcal{M}_a$.

Lemma 3

We consider the following two conditions for a convex subset $\mathcal{K} \subset \mathcal{M}_a$.

(B1)
The relation
$$\begin{aligned} D_{\varPsi }(P \Vert Q)=\sum _{x \in \mathcal{X}} P(x) (\varPsi [P](x)-\varPsi [Q](x)) \ge 0 \end{aligned}$$
(33)
holds for $P,Q \in \mathcal{K}$.
(B2)
$\mathcal{G}(P)$ is convex for the mixture parameter in $\mathcal{K}$.

The condition (B1) implies the condition (B2). In addition, when the condition (A4) holds, the condition (B2) implies the condition (B1).

We consider two kinds of mixture parameters. These parametrizations can be converted to each other via affine conversion, which preserves the convexity. Therefore, the condition (B2) does not depend on the choice of mixture parameter.

When the function $\eta \mapsto \varPsi [P_\eta ](x)$ is twice-differentiable, and the Hessian of $\mathcal{G}(P_\eta )$ is strictly positive semi-definite at a local minimizer $ P^*_i$, this function is convex in the $\delta $-neighborhood $U(P^*_i,\delta )$ of $P^*_i$ with a sufficiently small $\delta >0$ because the Hessian of $\mathcal{G}(P_\eta )$ is strictly positive semi-definite in the neighborhood due to the continuity.

Then, Lemma 3 guarantees the condition (A2) for the $\delta $-neighborhood $U(P^*_i,\delta )$. Algorithm 1 converges to the local minimizer $P^*_i$ with the speed (28) except for the case (i). The mathematical symbols introduced in Sect. 2.1 are summarized in Table 1.

Table 1 List of mathematical symbols for Sect. 2.1

Full size table

Proof of Lemma 3

Assume the condition (B1). Then, for $\lambda \in [0,1]$, we have

$$\begin{aligned} \varphi (\lambda )&:=\lambda \mathcal{G}(P)+(1-\lambda ) \mathcal{G}(Q) - \mathcal{G}(\lambda P+(1-\lambda )Q)\nonumber \\&=\lambda \sum _{x \in \mathcal{X}} P(x) (\varPsi [P](x)- \varPsi [\lambda P+(1-\lambda )Q](x)) \nonumber \\&\quad +(1-\lambda ) \sum _{x \in \mathcal{X}} Q(x) (\varPsi [Q](x)- \varPsi [\lambda P+(1-\lambda )Q](x))\nonumber \\&\ge 0, \end{aligned}$$

(34)

which implies (B2).

Assume the conditions (A4) and (B2). Since $\varphi (\lambda )\ge 0$ for $\lambda \in [0,1]$, we have

$$\begin{aligned} 0&\le \frac{d \varphi (\lambda )}{d \lambda }|_{\lambda =0} \nonumber \\&=\mathcal{G}(P)- \mathcal{G}(Q) -\sum _{x \in \mathcal{X}} (P(x)-Q(x))\varPsi [Q](x)\nonumber \\&\quad -\sum _{x \in \mathcal{X}} Q(x) \frac{d \varPsi [\lambda P+(1-\lambda )Q](x)}{d\lambda }|_{\lambda =0}\nonumber \\&{\mathop {=}\limits ^{(a)}}\mathcal{G}(P)- \mathcal{G}(Q) -\sum _{x \in \mathcal{X}} (P(x)-Q(x))\varPsi [Q](x), \end{aligned}$$

(35)

which implies (B1), where (a) follows from the condition (A4). $\square $

Remark 2

The preceding study [19, Theorem 3.3 and Proposition 3.6] consider similar statements as Theorems 2 and 3. As mentioned in Remark 1, the preceding study [19] covers the case when $\mathcal{M}_a$ is given as $\mathcal{P}(\mathcal{X})$, and does not cover the case with a general mixture family $\mathcal{M}_a$. In addition, the preceding study [19, Theorem 3.3 and Proposition 3.6] covers only the case when $P^0$ and $U(P^{0},\delta )$ are $P^*$ and $ \mathcal{P}(\mathcal{X})$, respectively. That is, the preceding study does not cover the case with local minimizers. In this sense, Theorems 2 and 3 are more general under the classical setting.

2.2 Algorithm with approximated iteration

In general, it is not so easy to calculate the e-projection $\varGamma ^{(e)}_{\mathcal{M}_a}(\varPhi [Q])$. We consider the case when it is approximately calculated. There are two methods to calculate the e-projection. One is the method based on the minimization in the given mixture family, and the other is the method based on the minimization in the exponential family orthogonal to the mixture family. In the first method, the e-projection $\varGamma ^{(e)}_{\mathcal{M}_a}(\varPhi [Q])$ is the minimizer of the following minimization;

$$\begin{aligned} \min _{P \in \mathcal{M}_a} D(P\Vert \varPhi [Q]). \end{aligned}$$

(36)

To describe the second method, using the functions $f_j$ used in (1), we define the exponential family

$$\begin{aligned} Q_{\theta }(x):= \varPhi [Q](x) e^{\sum _{j=1}^k \theta ^j f_j(x)- \phi [Q](\theta )}, \end{aligned}$$

(37)

where

$$\begin{aligned} \phi [Q](\theta ):= \log \sum _{x \in \mathcal{X}}\varPhi [Q](x) e^{\sum _{j=1}^k \theta ^j f_j(x)}. \end{aligned}$$

(38)

The projected element $\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]$ is the unique element of the intersection $ \{Q_\theta \}\cap \mathcal{M}_a$. For example, for this fact, see [22, Lemma 3]. Then, the e-projection $\varGamma ^{(e)}_{\mathcal{M}_a}(\varPhi [Q])$ is given as the solution of the following equation;

$$\begin{aligned} \frac{\partial \phi [Q]}{\partial \theta ^j }(\theta ) =\sum _{x\in \mathcal{X}}Q_\theta (x) f_j(x)= a_j \end{aligned}$$

(39)

for $j=1, \ldots , k$. The solution of (39) is given as the minimizer of the following minimization;

$$\begin{aligned} \min _{\theta \in \mathbb {R}^k} \phi [Q](\theta )- \sum _{j=1}^k \theta ^j a_j. \end{aligned}$$

(40)

We discuss the precision of our algorithm when each step in the above minimization has a certain error. Allowing certain errors in the first method, we propose Algorithm 2 instead of Algorithm 1.

However, the first method requires the minimization with the same number of parameters as the original minimization $\min _{P \in \mathcal{M}_a} \mathcal{G}(P)$. Hence, it is better to employ the second method. In fact, when $\mathcal{M}_a$ is given as a subset of $\mathcal{P}(\mathcal{X})$ with one linear constraint, the minimization (40) is written as a one-parameter convex minimization. Since any one-parameter convex minimization is performed by the bisection method, which needs $O(-\log \epsilon )$ iterations [35] to achieve a smaller error of the minimum of the objective function than $\epsilon $, the cost of this minimization is much smaller than that of the original minimization $\min _{P \in \mathcal{M}_a} \mathcal{G}(P)$. To consider an algorithm based on the minimization (40), we assume that $\varPsi $ is defined in $\mathcal{P}(\mathcal{X})$. In the multi-parameter case, we can use the gradient method and the accelerated proximal gradient method [53,54,55,56,57,58].

To consider the convergence of Algorithm 2, we extend the conditions (A1) and (A2). For this aim, we focus on the $\delta $-neighborhood $\bar{U}(P^{0},\delta )$ of $P^{0} \in \mathcal{M}_a$ defined as

$$\begin{aligned} \bar{U}(P^{0},\delta ):=\{ P \in \mathcal{P}(\mathcal{X}) | D(P^{0} \Vert P)\le \delta \}. \end{aligned}$$

(43)

Then, we introduce the following conditions for the $\delta $-neighborhood $\bar{U}(P^{0},\delta )$ of $P^{0}$ as follows.

(A1+)
Any distribution $Q \in \bar{U}(P^{0},\delta )\cap \mathcal{M}_a ={U}(P^{0},\delta )$ satisfies the following condition with a positive real number $\epsilon _2 >0$. When a distribution $P \in \mathcal{M}_a$ satisfies $D( P \Vert \mathcal{F}_2[Q]) \le \epsilon _2$, we have
$$\begin{aligned} \sum _{x \in \mathcal{X}} P(x) (\varPsi [P](x)- \varPsi [ Q ](x)) \le \gamma D(P \Vert Q). \end{aligned}$$
(44)
(A2+)
Any distribution $Q \in \bar{U}(P^{0},\delta )$ satisfies (25).

The convergence of Algorithm 2 is guaranteed in the following theorem.

Theorem 4

Assume that the $\delta $-neighborhood $\bar{U}(P^{0},\delta )$ of $P^{0}$ satisfies the conditions (A1+) and (A2+) with two positive real numbers $\gamma >0$, $\epsilon _2>0$, and $P^{(1)} \in U(P^{0},\delta )$. Then, for a positive real number $\epsilon _1>0$, Algorithm 2 satisfies the conditions

$$\begin{aligned} D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t)}]]\Vert \bar{P}^{(t+1)} )&\le \epsilon _1 \end{aligned}$$

(45)

$$\begin{aligned} \mathcal{G}(P_f^{(t_1)}) -\mathcal{G}(P^*)&\le \frac{\gamma D(P^*\Vert P^{(1)}) }{t_1-1} + \epsilon _1 +\gamma \epsilon _2. \end{aligned}$$

(46)

The above theorem is shown in Appendix 12. We discussed the convergences of Algorithms 1 and 2 under several conditions. When these conditions do not hold, we cannot guarantee its global convergence but, the algorithms achieve a local minimum. Hence, we need to repeat these algorithms by changing the initial value. The mathematical symbols introduced in Sect. 2.2 are summarized in Table 2.

Table 2 List of mathematical symbols for Sect. 2.2

Full size table

Remark 3

To address the minimization with a cost constraint, the paper [13] added a linear penalty term to the objective function. However, this method does not guarantee that the obtained result satisfies the required cost constraint. Our method can be applied to any mixture family including the distribution family with cost constraint(s). Hence, our method can be applied directly without the above modification while we need to calculate the e-projection. As explained in this subsection, this e-projection can be obtained with the convex minimization whose number of variables is the number of the constraint to define the mixture family. If the number of the constraints is not so large, still the e-projection is feasible.

2.3 Combination of the gradient method and the Algorithm 1

Although we can use the gradient method to calculate (40) for a general mixture family $\mathcal{M}_a$, in order to calculate $\overline{\mathcal{G}}(a):=\min _{P \in \mathcal{M}_a} \mathcal{G}(P)$ with $a \in \mathbb {R}^k$, we propose another algorithm to combine the gradient method and Algorithm 1. This algorithm avoids the calculation of the e-projection $\varGamma ^{(e)}_{\mathcal{M}_a}$. For simplicity, we assume that the function $\overline{\mathcal{G}}(a)$ is convex and $\mathcal{M}_a$ is not empty. Then, we replace $f_i(x)$ by $f_i(x)-a_i$, which implies that $\overline{\mathcal{G}}(a)$ is changed to $\overline{\mathcal{G}}(0)$. In the following, we aim to calculate $\overline{\mathcal{G}}(0)$, and we denote the expectation of the function f under the distribution P by P[f].

Then, we consider the following functions by using Legendre transform; For $b=(b^1, \ldots , b^k)\in \mathbb {R}^k$ and $c=(c^1, \ldots , c^{l-k})\in \mathbb {R}^{l-k}$, we define

$$\begin{aligned} \mathcal{G}_*(b,c)&:= \sup _{P \in \mathcal{P}(\mathcal{X})} \sum _{i=1}^k b^i P[f_i] +\sum _{j=1}^{l-k} c^i P[f_{k+i}]-\mathcal{G}(P) , \end{aligned}$$

(47)

and

$$\begin{aligned} \overline{\mathcal{G}}_*(b)&:=\mathcal{G}_*(b,0) =\sup _{P \in \mathcal{P}(\mathcal{X})} \sum _{i=1}^k b^i P[f_i]-\mathcal{G}(P) = \sup _{a \in \mathbb {R}^k} \sum _{i=1}^k b^i a_i -\overline{\mathcal{G}}(a). \end{aligned}$$

(48)

In the following, we consider the calculation of $\overline{\mathcal{G}}(0)$ by assuming that the function $\eta \mapsto \mathcal{G}(P_\eta ) $ is $C^2$-continuous and convex. Since Legendre transform of $\overline{\mathcal{G}}_*(b)$ is $\overline{\mathcal{G}}(a)$ due to the convexity of $\overline{\mathcal{G}}(a)$, we have $\sup _{b \in \mathbb {R}^k} \sum _{i=1}^k b^i a_i -\overline{\mathcal{G}}_*(b) =\overline{\mathcal{G}}(a)$. As a special case, we have

$$\begin{aligned} -\inf _{b \in \mathbb {R}^k} \overline{\mathcal{G}}_*(b) =\overline{\mathcal{G}}(0). \end{aligned}$$

(49)

That is, when we find the minimizer $b_*:= \mathop {\textrm{argmin}}\limits _{a \in \mathbb {R}^k} \overline{\mathcal{G}}_*(b)$, we can calculate $\overline{\mathcal{G}}(0)$ as $\overline{\mathcal{G}}(0)= -\sup _{P \in \mathcal{P}(\mathcal{X})} \sum _{i=1}^k b_*^i P[f_i]-\mathcal{G}(P) =\inf _{P \in \mathcal{P}(\mathcal{X})} \mathcal{G}(P)-\sum _{i=1}^k b_*^i P[f_i] $.

To find it, we denote the gradient vector of a function f on $\mathbb {R}^k$ by $\nabla f$. That is, $\nabla f$ is the vector $ (\frac{\partial }{\partial x^1}f ,\ldots , \frac{\partial }{\partial x^k}f )$. Then, we choose a real number L that is larger than the matrix norm of the Hessian of $\overline{\mathcal{G}}_*$, which implies the uniform Lipschitz condition;

$$\begin{aligned} \Vert \nabla \overline{\mathcal{G}}_*(b)- \nabla \overline{\mathcal{G}}_*(b')\Vert \le L \Vert b-b'\Vert . \end{aligned}$$

(50)

Then, we apply the following update rule for the minimization of $\overline{\mathcal{G}}_*(b)$;

$$\begin{aligned} b_{t+1}:= b_t -\frac{1}{L}\nabla \overline{\mathcal{G}}_*(b_t). \end{aligned}$$

(51)

The following precision is guaranteed [52, Chapter 10] [53, 54];

$$\begin{aligned} |\overline{\mathcal{G}}_*(b_k)- \overline{\mathcal{G}}_*(b_*)| \le \frac{L}{2k} \Vert b_*-b_0\Vert ^2. \end{aligned}$$

(52)

We notice that

$$\begin{aligned} \nabla \overline{\mathcal{G}}_*(b)= \mathop {\textrm{argmax}}\limits _{a \in \mathbb {R}^k} \sum _{i=1}^k b^i a_i -\overline{\mathcal{G}}(a)= (Q_b[f_i])_{i=1}^k, \end{aligned}$$

(53)

where

$$\begin{aligned} Q_b&:=\mathop {\textrm{argmax}}\limits _{P \in \mathcal{P}(\mathcal{X})} \sum _{i=1}^k b^i P[f_i]-\mathcal{G}(P) \nonumber \\&=\mathop {\textrm{argmin}}\limits _{P \in \mathcal{P}(\mathcal{X})} \sum _{x \in \mathcal{X}}P(x) \Big (\varPsi [P](x) -\sum _{i=1}^k b^i f_i(x)\Big ) . \end{aligned}$$

(54)

However, the calculation of (54) requires a large calculation amount. Hence, replacing the update rule (51) by a one-step iteration in Algorithm 1, we propose another algorithm.

Using $\varPhi ^{b}[Q](x):= \frac{1}{\kappa }Q(x)\exp ( -\frac{1}{\gamma } \Big (\varPsi [P](x) -\sum _{i=1}^k b^i f_i(x)\Big )$ with the normalizing constant $\kappa $, we propose Algorithm 3.

It is not so easy to evaluate the convergence speed of Algorithm 3. But, when it converges, the convergent point is the true minimizer.

Theorem 5

When the pair (b, P) is a convergence point, we have $b=b_*$ and $P=P_*$.

Proof

Since the pair (b, P) is a convergence point, we have $P=\varPhi ^b[P]$, which implies

$$\begin{aligned} \sum _{i=1}^k b^i P[f_i]-\mathcal{G}(P) =\sup _{P' \in \mathcal{P}(\mathcal{X})} \sum _{i=1}^k b^i P'[f_i]-\mathcal{G}(P') =\overline{\mathcal{G}}_*(b). \end{aligned}$$

(55)

Since the pair (b, P) is a convergence point, we have $ P[f_i] =0$ for $i=1, \ldots , k$, i.e., the distribution P satisfies the required condition in (1). The relation (53) implies $\nabla \overline{\mathcal{G}}_*(b)=0$. Hence, (49) yields $\overline{\mathcal{G}}_*(b)=\overline{\mathcal{G}}(0)$, which implies $b=b_*$. Therefore, the relation (55) is rewritten as $\mathcal{G}(P) =\overline{\mathcal{G}}(0)$, which implies $P=P_*$. $\square $

Remark 4

We compare our algorithm with a general algorithm proposed in [13]. The input of the objective function in [13] forms a mixture family. The function f given in [13, (6)] satisfies the condition of $\mathcal{G}$ by considering the second line of [13, (6)] as $\varPsi $. Their algorithm is the same as Algorithm 1 with $\gamma =1$ when there is no constraint because their extended objective function g defined in [13, (16)] can be considered as $D(P\Vert Q)+ \sum _{x \in \mathcal{X}}P(x) \varPsi [Q](x)$, where the choice of q in [13] corresponds to the choice of P and the choice of $Q_1,\ldots , Q_K$ in [13] does to the choice of Q.

Also, we can show that the function f given in [13, (6)] satisfies the condition (A4). Since the condition (A4) holds, the convexity of f is equivalent to the condition (B1). This equivalence, in this case, was shown as [13, Proposition 4.1]. They showed the convergence of their algorithm as [13, Theorem 4.1], which can be considered as a special case of our Theorem 2.

However, our treatment for the constraint is different from theirs. They consider the minimization $\min _{P \in \mathcal{P}(\mathcal{X})} \mathcal{G}(P)-\sum _{i=1}^k b^i P[f_i] $ without updating the parameter b. Hence, their algorithm cannot achieve the minimum with the desired constraint while Algorithms 1, 2, and 3 achieve the minimum with the desired constraint. Although their algorithm is similar to Algorithm 3, Algorithm 3 updates the parameter b to achieve the minimum with the desired constraint.

3 Application to information theoretical problems

3.1 Channel capacity

In the same way as the reference [19], we apply our problem setting to the channel coding. A channel is given as a conditional distribution $W_{Y|X}$ on the sample space $\mathcal{Y}$ with conditions on the sample space $\mathcal{X}$, where $\mathcal{Y}$ is a general sample space with a measure $\mu $ and $\mathcal{X}$ is a finite sample space. For two absolutely continuous distributions $P_Y$ and $Q_Y$ with respect to $\mu $ on $\mathcal{Y}$, the Kullback–Leibler divergence $D(P_Y\Vert Q_Y)$ is given as

$$\begin{aligned} D(P_Y\Vert Q_Y):= \int _{\mathcal{Y}}p_Y(y) (\log p_Y(y)-\log q_Y(y))\mu (dy), \end{aligned}$$

(56)

where $p_Y$ and $q_Y$ are the probability density functions of $P_Y$ and $Q_Y$ with respect to $\mu $. This quantity is generalized to the Renyi divergence with order $\alpha >0$ as

$$\begin{aligned} D_\alpha (P_{Y} \Vert Q_Y):= \frac{1}{\alpha -1}\log \int _{\mathcal{Y}} \Big (\frac{p_Y(y)}{q_Y(y)}\Big )^{\alpha -1}p_Y(y) \mu (dy). \end{aligned}$$

(57)

The channel capacity $C(W_{Y|X})$ is given as the maximization of the mutual information $I(P_X,W_{Y|X})$ as [10]

$$\begin{aligned} C(W_{Y|X})&:=\max _{P_X} I(P_X,W_{Y|X}) \end{aligned}$$

(58)

$$\begin{aligned} I(P_X,W_{Y|X})&:=\sum _{x \in \mathcal{X}}P_X(x) D(W_{Y|X=x}\Vert W_{Y|X} \cdot P_X) \nonumber \\&= D(W_{Y|X} \times P_X \Vert (W_{Y|X} \cdot P_X) \times P_X), \end{aligned}$$

(59)

where $W_{Y|X} \cdot P_X$ and $W_{Y|X} \times P_X$ are defined as the following probability density functions $w_{Y|X} \cdot P_X$ and $w_{Y|X} \times P_X$;

$$\begin{aligned} (w_{Y|X} \cdot P_X)(y)&:= \sum _{x \in \mathcal{X}}P_X(x) w_{Y|X=x}(y) \end{aligned}$$

(60)

$$\begin{aligned} (w_{Y|X} \times P_X)(x, y)&:= P_X(x) w_{Y|X=x}(y) . \end{aligned}$$

(61)

However, the mutual information $I(P_X,W_{Y|X})$ has another form as

$$\begin{aligned} I(P_X, W_{Y|X})= \min _{Q_Y} \sum _{x \in \mathcal{X}}P_X(x) D(W_{Y|X=x}\Vert Q_{Y}). \end{aligned}$$

(62)

When we choose $\mathcal{M}_a$ and $\varPsi $ as $\mathcal{P}(\mathcal{X})$ and

$$\begin{aligned} \varPsi _{W_{Y|X}}[P_X](x):= - D(W_{Y|X=x}\Vert W_{Y|X} \cdot P_X), \end{aligned}$$

(63)

$-I(P_X,W_{Y|X})$ coincides with $\mathcal{G}(P_X)$ [19]. Since

$$\begin{aligned} D_{\varPsi }(P_X\Vert Q_X) = D(W_{Y|X} \cdot P_X\Vert W_{Y|X} \cdot Q_X) \ge 0, \end{aligned}$$

(64)

the condition (A2) holds with $\mathcal{P}(\mathcal{X})$. In addition, since the information processing inequality guarantees that

$$\begin{aligned} D(W_{Y|X} \cdot P_X\Vert W_{Y|X} \cdot Q_X) \le D(P_X\Vert Q_X), \end{aligned}$$

(65)

the condition (A1) holds with $\gamma =1$ and $\mathcal{P}(\mathcal{X})$. In this case, $\varPhi $ is given as

$$\begin{aligned} \varPhi [Q_X](x)=\frac{1}{\kappa _{W_{Y|X}}[Q_X]}Q_X(x) \exp ( \frac{1}{\gamma } D(W_{Y|X=x}\Vert W_{Y|X} \cdot Q_X)), \end{aligned}$$

(66)

where the normalizing constant $\kappa _{W_{Y|X}}[Q_X]$ is given as

$$\begin{aligned} \kappa _{W_{Y|X}}[Q_X]=\sum _{x \in \mathcal{X}}Q_X(x) \exp ( \frac{1}{\gamma } D(W_{Y|X=x}\Vert W_{Y|X} \cdot Q_X)). \end{aligned}$$

(67)

When $\gamma =1$, it coincides with the Arimoto–Blahut algorithm [8, 9]. Since $\varPhi [Q_X] \in \mathcal{P}(\mathcal{X})$, $P_X^{(t+1)}$ is given as $\varPhi [P_X^{(t)}]$.

Remark 5

The reference [19] covers the case when $\mathcal{M}_a$ is given as $\mathcal{P}(\mathcal{X})$, and the reference [19] presented the algorithms presented in this subsection in a more general form. Also, they proposed an adaptive choice of $\gamma $ in this case [19, (22)]. In addition, they numerically compared their adaptive choice with the case of $\gamma =1$ [19, Figs. 1,..., 6]. These comparisons show a significant improvement by their adaptive choice.

3.2 Reliability function in channel coding

In channel coding, we consider the reliability function, which was originally introduced by Gallager [25] and expresses the exponential decreasing rate of an upper bound of the decoding block error probability under the random coding. To achieve this aim, for $\alpha > 0 $, we define

$$\begin{aligned} I_{\alpha }(P_X,W_{Y|X}):= \frac{\alpha }{\alpha -1}\log \Big ( \int _{\mathcal{Y}} \Big (\sum _{x \in \mathcal{X}} P_X(x) w_{Y|X=x}(y)^{\alpha }\Big )^{\frac{1}{\alpha }} \mu (dy) \Big ). \end{aligned}$$

(68)

Then, when the code is generated with the random coding based on the distribution $P_X$, the decoding block error probability with coding rate R is upper bounded by the following quantity;

$$\begin{aligned} e^{n \min _{\rho \in [0,1]} \big (\rho R -\rho I_{\frac{1}{1+\rho }}(P_X,W_{Y|X})\big )} \end{aligned}$$

(69)

when we use the channel $W_{Y|X}$ with n times. Notice that $e^{-\rho I_{\frac{1}{1+\rho }}(P_X,W_{Y|X})}= \int _{\mathcal{Y}} \Big (\sum _{x \in \mathcal{X}} P_X(x) w_{Y|X=x}(y)^{\frac{1}{1+\rho }}\Big )^{1+\rho } \mu (dy)$. That is, the Gallager function [25] is given as $\rho I_{\frac{1}{1+\rho }}(P_X,W_{Y|X})$, i.e., the parameter $\alpha $ is different from the parameter $\rho $ in the Gallager function. Taking the minimum for the choice of $P_X$, we have

$$\begin{aligned} \min _{P_X} e^{\min _{\rho \in [0,1]} \big (\rho R -\rho I_{\frac{1}{1+\rho }}(P_X,W_{Y|X})\big )} = \min _{\alpha \in [1/2,1]} \Big (e^{-\frac{\alpha -1}{\alpha } R} \min _{P_X} e^{\frac{\alpha -1}{\alpha } I_{\alpha }(P_X,W_{Y|X})}\Big ) \end{aligned}$$

(70)

with $\alpha =\frac{1}{1-\rho }\in [1/2,1]$. Therefore, we consider the following minimization;

$$\begin{aligned} \min _{P_X} e^{\frac{\alpha -1}{\alpha } I_{\alpha }(P_X,W_{Y|X})} =\min _{P_X} \int _{\mathcal{Y}} \Big (\sum _{x \in \mathcal{X}} P_X(x) w_{Y|X=x}(y)^{\alpha }\Big )^{\frac{1}{\alpha }} \mu (dy) . \quad \end{aligned}$$

(71)

In the following, we discuss the RHS of (71) with $\alpha \in [1/2,1]$.

To apply our method, as a generalization of (62), we consider another expression of $I_{\alpha }(P_X,W_{Y|X})$;

$$\begin{aligned} I_{\alpha }(P_X,W_{Y|X})= \min _{Q_Y} D_\alpha (W_{Y|X}\times P_X \Vert Q_Y\times P_X), \end{aligned}$$

(72)

which was shown in [27, Lemma 2]. Using

$$\begin{aligned} Q_{Y|\alpha ,P_X}&:= \mathop {\textrm{argmin}}\limits _{Q_Y} D_\alpha (W_{Y|X}\times P_X \Vert Q_Y\times P_X) \nonumber \\&=\mathop {\textrm{argmax}}\limits _{Q_Y}\sum _{x\in \mathcal{X}}P_X(x)e^{ (\alpha -1)D_\alpha (W_{Y|X=x} \Vert Q_{Y})}, \end{aligned}$$

(73)

we have

$$\begin{aligned} \Big (\min _{P_X} e^{\frac{\alpha -1}{\alpha } I_{\alpha }(P_X,W_{Y|X})}\Big )^\alpha = \min _{P_X} \sum _{x\in \mathcal{X}}P_X(x)e^{ (\alpha -1) D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})}. \end{aligned}$$

(74)

The probability density function $q_{Y|\alpha ,P_X}$ of the minimizer $Q_{Y|\alpha ,P_X}$ is calculated as

$$\begin{aligned} q_{Y|\alpha ,P_X}(y)= C \Big (\sum _{x \in \mathcal{X}} P_X(x) w_{Y|X=x}(y)^{\alpha }\Big )^{\frac{1}{\alpha }} , \end{aligned}$$

(75)

where C is the normalizing constant [27, Lemma 2].

To solve the minimization (74), we apply our method to the case when we choose $\mathcal{M}_a$ and $\varPsi $ as $\mathcal{P}(\mathcal{X})$ and

$$\begin{aligned} \varPsi _{\alpha ,W_{Y|X}}[P_X](x):= e^{(\alpha -1) D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})}. \end{aligned}$$

(76)

Since (73) guarantees that

$$\begin{aligned}&\sum _{x\in \mathcal{X}}P_X(x) (\varPsi _{\alpha ,W_{Y|X}}[P_X](x) -\varPsi _{\alpha ,W_{Y|X}}[Q_X](x)) \nonumber \\&\quad = \sum _{x\in \mathcal{X}}P_X(x) \Big (e^{ (\alpha -1)D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})} - e^{ (\alpha -1)D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,Q_X})}\Big ) \ge 0, \end{aligned}$$

(77)

the condition (A2) holds with $\mathcal{M}_a=\mathcal{P}(\mathcal{X})$. The condition (A1) can be satisfied with sufficiently large $\gamma $. In this case, $\varPhi $ is given as

$$\begin{aligned} \varPhi _\alpha [Q_X](x)=\frac{1}{\kappa _{\alpha ,W_{Y|X}}[Q_X]}Q_X(x) \exp ( -\frac{1}{\gamma } e^{(\alpha -1) D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})}), \end{aligned}$$

(78)

where the normalizing constant $\kappa _{\alpha ,W_{Y|X}}[Q_X]$ is given as $\kappa _{\alpha ,W_{Y|X}}[Q_X]=$ $\sum _{x \in \mathcal{X}}Q_X(x) \exp ( -\frac{1}{\gamma } e^{(\alpha -1) D_\alpha ( W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})})$. Since $\varPhi _\alpha [Q_X] \in \mathcal{M}_a$, $P_X^{(t+1)}$ is given as $\varPhi _\alpha [P_X^{(t)}]$.

3.3 Strong converse exponent in channel coding

In channel coding, we discuss an upper bound of the probability of correct decoding. This probability is upper bounded by the following quantity;

$$\begin{aligned} \max _{P_X} e^{n \min _{\rho \in [0,1]} \Big (-\rho R +\rho I_{\frac{1}{1-\rho }}(P_X,W_{Y|X})\Big )} \end{aligned}$$

(79)

when we use the channel $P_{Y|X}$ with n times and the coding rate is R [26]. Therefore, we consider the following maximization;

$$\begin{aligned} \max _{P_X} e^{\rho I_{\frac{1}{1-\rho }}(P_X,W_{Y|X})} =\max _{P_X} e^{\frac{\alpha -1}{\alpha } I_{\alpha }(P_X,W_{Y|X})} \end{aligned}$$

(80)

with $\alpha =\frac{1}{1-\rho }>1$. In the following, we discuss the RHS of (80) with $\alpha >1$.

To apply our method, we consider another expression (72) of $I_{\alpha }(P_X,W_{Y|X})$. Using (73), we have

$$\begin{aligned} \Big (\max _{P_X} e^{\frac{\alpha -1}{\alpha } I_{\alpha }(P_X,W_{Y|X})}\Big )^\alpha = \max _{P_X} \sum _{x\in \mathcal{X}}P_X(x)e^{ (\alpha -1) D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})}. \end{aligned}$$

(81)

The maximization (81) can be solved by choosing $\mathcal{M}_a$ and $\varPsi $ as $\mathcal{P}(\mathcal{X})$ and

$$\begin{aligned} \varPsi _{\alpha ,W_{Y|X}}[P_X](x):= - e^{(\alpha -1) D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})}. \end{aligned}$$

(82)

Since (73) guarantees that

$$\begin{aligned}&\sum _{x\in \mathcal{X}}P_X(x) (\varPsi _{\alpha ,W_{Y|X}}[P_X](x) -\varPsi _{\alpha ,W_{Y|X}}[Q_X](x)) \nonumber \\&\quad = \sum _{x\in \mathcal{X}}P_X(x) \Big (-e^{ (\alpha -1)D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})} + e^{ (\alpha -1)D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,Q_X})}\Big ) \ge 0, \end{aligned}$$

(83)

the condition (A2) holds with $\mathcal{M}_a=\mathcal{P}(\mathcal{X})$. Similarly, the condition (A1) can be satisfied with sufficiently large $\gamma $. In this case, $\varPhi $ is given as

$$\begin{aligned} \varPhi _\alpha [Q_X](x)=\frac{1}{\kappa _{\alpha ,W_{Y|X}}[Q_X]}Q_X(x) \exp ( \frac{1}{\gamma } e^{(\alpha -1) D_\alpha ( W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})}), \end{aligned}$$

(84)

where the normalizing constant $\kappa _{\alpha ,W_{Y|X}}[Q_X]$ is given as $\kappa _{\alpha ,W_{Y|X}}[Q_X]=$

$\sum _{x \in \mathcal{X}}Q_X(x) \exp ( \frac{1}{\gamma } e^{(\alpha -1) D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})})$. Since $\varPhi _\alpha [Q_X] \in \mathcal{M}_a$, $P_X^{(t+1)}$ is given as $\varPhi _\alpha [P_X^{(t)}]$.

3.4 Wiretap channel capacity

3.4.1 General case

Given a pair of a channel $W_{Y|X}$ from $\mathcal{X}$ to a legitimate user $\mathcal{Y}$ and a channel $W_{Z|X}$ from $\mathcal{X}$ to a malicious user $\mathcal{Z}$, the wiretap channel capacity is given as [28, 29]

$$\begin{aligned} C(W_{Y|X},W_{Z|X}):=\max _{P_{VX}} I(P_V, W_{Y|X}\cdot P_{X|V})-I(P_V, W_{Z|X}\cdot P_{X|V}) \end{aligned}$$

(85)

with a sufficiently large discrete set $\mathcal{V}$. The recent papers showed that the above rate can be achieved even with the strong security [30,31,32] and the semantic security [33, 34].^{Footnote 1} Furthermore, the paper [34, Appendix D] showed the above even when the output systems are general continuous systems including Gaussian channels. The wiretap capacity (85) can be calculated via the minimization;

$$\begin{aligned} \min _{P_{VX}} -I(P_V, W_{Y|X}\cdot P_{X|V})+I(P_V, W_{Z|X}\cdot P_{X|V}). \end{aligned}$$

(86)

Here, $\mathcal{V}$ is an additional discrete sample space. When we choose $\mathcal{M}_a$ and $\varPsi $ as $\mathcal{P}(\mathcal{X} \times \mathcal{V})$ and

$$\begin{aligned}&\varPsi _{W_{Y|X},W_{Z|X}}[P_{VX}](v,x)\nonumber \\&\quad := D(W_{Z|X}\cdot P_{X|V=v}\Vert W_{Z|X}\cdot P_{X} ) -D(W_{Y|X}\cdot P_{X|V=v}\Vert W_{Y|X}\cdot P_{X} ), \end{aligned}$$

(87)

$-I(P_V, W_{Y|X}\cdot P_{X|V})+I(P_V, W_{Z|X}\cdot P_{X|V})$ coincides with $\mathcal{G}(P_{VX})$. Here, although $\varPsi _{W_{Y|X},W_{Z|X}}[P_{VX}]$ is a function of (v, x), the function value depends only on v. Hence, the general theory in Sect. 2 can be used for the minimization of (86). In this case, it is difficult to clarify whether the conditions (A1) and (A2) hold in general. $\varPhi $ is given as

$$\begin{aligned}&\varPhi [Q_{VX}](v,x)\nonumber \\&\quad =\frac{1}{\kappa _{W_{Y|X},W_{Z|X}}[Q_{VX}]}Q_{VX}(v,x) \exp \Big ( \frac{1}{\gamma } \Big ( D(W_{Y|X}\cdot P_{X|V=v}\Vert W_{Y|X}\cdot P_{X} )\nonumber \\&\qquad -D(W_{Z|X}\cdot P_{X|V=v}\Vert W_{Z|X}\cdot P_{X} ) \Big ) \Big ), \end{aligned}$$

(88)

where $\kappa _{W_{Y|X},W_{Z|X}}[Q_{VX}]$ is the normalizing constant. Since $\varPhi [Q_X] \in \mathcal{M}_a$, $P_X^{(t+1)}$ is given as $\varPhi [P_X^{(t)}]$.

3.4.2 Degraded case

However, when there exists a channel $W_{Z|Y}$ from $\mathcal{Y}$ to $\mathcal{Z}$ such that $ W_{Z|Y} \cdot W_{Y|X}=W_{Z|X}$, i.e., the channel $W_{Z|X}$ is a degraded channel of $W_{Y|X}$, we can define the joint channel $W_{YZ|X}$ with the following conditional probability density function

$$\begin{aligned} w_{YZ|X}(yz|x):= w_{Z|Y}(z|y)w_{Y|X}(y|x). \end{aligned}$$

(89)

Then, the maximization (85) is simplified as

$$\begin{aligned} C(W_{YZ|X}):=\max _{P_{X}} I(X;Y|Z)[P_X, W_{YZ|X}] \end{aligned}$$

(90)

where the conditional mutual information is given as

$$\begin{aligned} I(X;Y|Z)[P_X, W_{YZ|X}] := \sum _{x,z} P_{XZ}(x,z) D(P_{Y|X=x,Z=z}\Vert P_{Y|Z=z}), \end{aligned}$$

(91)

where the conditional distributions $P_{Y|XZ}$ and $P_{Y|Z}$ are defined from the joint distribution $W_{YZ|X}\times P_X$. To consider (90), we consider the following minimization with a general two-output channel $W_{YZ|X}$;

$$\begin{aligned} \min _{P_{X}} - I(X;Y|Z)[P_X, W_{YZ|X}] . \end{aligned}$$

(92)

When we choose $\mathcal{M}_a$ and $\varPsi $ as $\mathcal{P}(\mathcal{X})$ and

$$\begin{aligned} \varPsi _{W_{YZ|X}}[P_{X}](x):= - \sum _{z} P_{Z|X=x}(z) D(P_{Y|X=x,Z=z}\Vert P_{Y|Z=z}). \end{aligned}$$

(93)

$- I(X;Y|Z)[P_X, W_{YZ|X}]$ coincides with $\mathcal{G}(P_{X})$. Hence, the general theory in Sect. 2 can be used for the minimization of (92). In this case, as shown in Sect. 6.2, the conditions (A1) with $\gamma =1$ and (A2) hold. $\varPhi $ is given as

$$\begin{aligned}&\varPhi [Q_{X}](x)\nonumber \\&\quad =\frac{1}{\kappa _{W_{YZ|X}}[Q_{X}]}Q_{X}(x) \exp \Big ( \frac{1}{\gamma } \Big ( \sum _{z} P_{Z|X=x}(z) D(P_{Y|X=x,Z=z}\Vert P_{Y|Z=z}) \Big ) \Big ),\quad \end{aligned}$$

(94)

where $\kappa _{W_{YZ|X}}[Q_{X}]$ is the normalizing constant. Since $\varPhi [Q_X] \in \mathcal{M}_a$, $P_X^{(t+1)}$ is given as $\varPhi [P_X^{(t)}]$. The above algorithm with $\gamma =1$ coincides with the algorithm proposed by [14].

3.5 Capacities with cost constraint

Next, we consider the case when a cost constraint is imposed. Consider a function f on $\mathcal{X}$ and the following constraint for a distribution $P_X \in \mathcal{X}$;

$$\begin{aligned} P_X[f]=a. \end{aligned}$$

(95)

We define $\mathcal{M}_a$ by imposing the condition (95) as a special case of (1). The capacity of the channel $W_{Y |X}$ under the cost constraint is given as $\max _{P_X \in \mathcal{M}_a}I(P_X,W_{Y|X})$. That is, we need to solve the minimization $\min _{P_X \in \mathcal{M}_a}-I(P_X,W_{Y|X})$. In this case, the $t+1$-th distribution $P^{(t+1)}$ is given as $\varGamma _{\mathcal{M}_a}^{(e)} [\varPhi [P_X^{(t)}]]$. Since $\varGamma _{\mathcal{M}_a}^{(e)} [\varPhi [P_X^{(t)}]]$ cannot be calculated analytically, we can use Algorithm 2 instead of Algorithm 1. Since conditions (A1) with $\gamma =1$ and (A2) hold, Theorem 4 guarantees the global convergence to the minimum in Algorithm 2.

We can consider the cost constraint (95) for the problems (74) and (81). In these cases, we have a similar modification by considering $\varGamma _{\mathcal{M}_a}^{(e)} [\varPhi [P_X^{(t)}]]$.

4 em problem

We apply our algorithm to the problem setting with the em algorithm [2,3,4], which is a generalization of Boltzmann machines [5]. The em algorithm is implemented by iterative applications of the projection to an exponential family (the m-projection) and the projection to a mixture family (the e-projection). Hence, this algorithm is called the em algorithm. On the other hand, the EM algorithm is implemented by iterative applications of expectation and maximization. Their relation is summarized as follows. In particular, the expectation in the EM algorithm, which is often called E-step, corresponds to the e-projection to a mixture family, which is often called e-step in the em algorithm. Also, the maximization in the EM algorithm, which is often called M-step, corresponds to the m-projection to an exponential family, which is often called m-step in the em algorithm. In this reason, they are essentially the same [2].

For this aim, we consider a pair of an exponential family $\mathcal{E}$ and a mixture family $\mathcal{M}_a$ on $\mathcal{X}$. We denote the m-projection to $\mathcal{E}$ of P by $\varGamma _\mathcal{E}^{(m)}[P]$, which is defined as [1, 2]

$$\begin{aligned} \varGamma _\mathcal{E}^{(m)}[P]:= \mathop {\textrm{argmin}}\limits _{Q\in \mathcal{E}} D(P\Vert Q). \end{aligned}$$

(96)

We consider the following minimization;

$$\begin{aligned}&\min _{P\in \mathcal{M}_a}\min _{Q\in \mathcal{E}} D(P\Vert Q)= \min _{P\in \mathcal{M}_a} D(P\Vert \varGamma _\mathcal{E}^{(m)}[P]) \nonumber \\&\quad = \min _{P\in \mathcal{M}_a} \sum _{x \in \mathcal{X}}P(x) (\log P(x) - \log \varGamma _\mathcal{E}^{(m)}[P](x)). \end{aligned}$$

(97)

We choose the function $\varPsi $ as

$$\begin{aligned} \varPsi _\mathrm{{em}}[P](x):= (\log P(x) - \log \varGamma _\mathcal{E}^{(m)}[P](x)), \end{aligned}$$

(98)

and apply the discussion in Sect. 2. Then, we have

$$\begin{aligned}&\sum _{x \in \mathcal{X}} P^{0}(x) (\varPsi _\mathrm{{em}}[P^{0}](x)- \varPsi _\mathrm{{em}}[Q](x)) \nonumber \\&\quad =\sum _{x \in \mathcal{X}} P^{0}(x) \Big ( (\log P^{0}(x) - \log \varGamma _\mathcal{E}^{(m)}[P^{0}](x)) - (\log Q(x) - \log \varGamma _\mathcal{E}^{(m)}[Q](x)) \Big ) \nonumber \\&\quad = \sum _{x \in \mathcal{X}} P^{0}(x)(\log P^{0}(x)-\log Q(x)) \nonumber \\&\qquad + \sum _{x \in \mathcal{X}} P^{0}(x) \Big ( \log \varGamma _\mathcal{E}^{(m)}[Q](x)) - \log \varGamma _\mathcal{E}^{(m)}[P^{0}](x)) \Big ) \nonumber \\&\quad = D (P^{0}\Vert Q) + D(P^{0} \Vert \varGamma _\mathcal{E}^{(m)}[P^{0}]) -D(P^{0} \Vert \varGamma _\mathcal{E}^{(m)}[Q]) \nonumber \\&\quad = D (P^{0}\Vert Q) -D(\varGamma _\mathcal{E}^{(m)}[P^{0}] \Vert \varGamma _\mathcal{E}^{(m)}[Q]), \end{aligned}$$

(99)

where the final equation follows from (4). The condition (A1) holds with $U(P^0,\infty )=\mathcal{M}_a$ and $\gamma =1$. There is a possibility that the condition (A1) holds with a smaller $\gamma $. Therefore, with $\gamma =1$, Theorem 1 guarantees that Algorithm 1 converges to a local minimum. In addition, when the relation

$$\begin{aligned} D (P^0\Vert Q) \ge D(\varGamma _\mathcal{E}^{(m)}[P^0] \Vert \varGamma _\mathcal{E}^{(m)}[Q]) \end{aligned}$$

(100)

holds for $Q \in U(P^0,\delta )$, the condition (A2) holds with $U(P^0,\delta )$. That is, if the condition (100) holds, Algorithm 1 has the global convergence to the minimizer. The condition (100) is a condition similar to the condition given in [22].

In this case, $\varPhi $ is given as

$$\begin{aligned} \varPhi [Q](x)&=\frac{1}{\kappa _\mathrm{{em}}[Q]}Q(x) \exp \Big ( - \frac{1}{\gamma } (\log Q(x) - \log \varGamma _\mathcal{E}^{(m)}[Q](x))\Big )\nonumber \\&=\frac{1}{\kappa _\mathrm{{em}}[Q]}Q(x)^{\frac{\gamma -1}{\gamma }} \varGamma _\mathcal{E}^{(m)}[Q](x)^{\frac{1}{\gamma }}, \end{aligned}$$

(101)

where the normalizing constant $\kappa _\mathrm{{em}}[Q_X]$ is given as $\kappa _\mathrm{{em}}[Q_X]=\sum _{x \in \mathcal{X}} Q(x)^{\frac{\gamma -1}{\gamma }} \varGamma _\mathcal{E}^{(m)} [Q](x)^{\frac{1}{\gamma }}$. Since $\varPhi [Q] \in \mathcal{M}_a$, $P^{(t+1)}$ is given as $\varGamma _{\mathcal{M}_a}^{(e)}[\varPhi [P^{(t)}]]$. When $\gamma =1$, it coincides with the conventional em-algorithm [2,3,4] because $\varPhi [P^{(t)}]=\varGamma _\mathcal{E}^{(m)}[P^{(t)}]$. The above analysis suggests the choice of $\gamma $ as a smaller value than 1. That is, there is a possibility that a smaller $\gamma $ improves the conventional em-algorithm. In addition, we may use Algorithm 2 instead of Algorithm 1 when the calculation of e-projection is difficult.

Lemma 4

When $\varPsi _\mathrm{{em}}$ is given as (98), the condition (A4) holds.

Therefore, by combining Lemmas 3 and 4, the assumption of Theorem 2 holds in the $\delta $ neighborhood of a local minimizer with sufficiently small $\delta >0$. That is, the convergence speed can be evaluated by Theorem 2.

Proof

Pythagorean theorem guarantees

$$\begin{aligned}&\sum _{x \in \mathcal{X}}P (x) \big ( \log \varGamma _\mathcal{E}^{(m)}[Q](x)- \log \varGamma _\mathcal{E}^{(m)}[P](x) \big ) \nonumber \\&\quad = D(P\Vert \varGamma _\mathcal{E}^{(m)}[P]) - D(P\Vert \varGamma _\mathcal{E}^{(m)}[Q]) = D(\varGamma _\mathcal{E}^{(m)}[P]\Vert \varGamma _\mathcal{E}^{(m)}[Q]) . \end{aligned}$$

(102)

We make the parameterization $P_\eta \in \mathcal{M}_a$ with mixture parameter $\eta $. We denote $\eta (h,i):=(\eta (0)_1, \ldots , \eta (0)_{i-1},\eta (0)_i+h,\eta (0)_{i+1},\ldots , \eta (0)_k)$.

$$\begin{aligned}&\sum _{x \in \mathcal{X}}P_{\eta (0)} (x) \Big (\frac{\partial }{\partial \eta _i}\varPsi [P_\eta ](x)|_{\eta =\eta (0)}\Big ) \nonumber \\&\quad = \sum _{x \in \mathcal{X}}P_{\eta (0)} (x) \Big ( \lim _{h\rightarrow 0}\frac{\varPsi [P_{\eta (h,i)}](x)-\varPsi [P_{\eta (0)}](x)}{h} \Big ) \nonumber \\&\quad = \sum _{x \in \mathcal{X}}P_{\eta (0)} (x) \Big ( \lim _{h\rightarrow 0}\frac{ \log P_{\eta (h,i)}(x)-\log P_{\eta (0)}(x)}{h} \nonumber \\&\qquad -\lim _{h\rightarrow 0}\frac{ \log \varGamma _\mathcal{E}^{(m)}[P_{\eta (h,i)}](x)- \log \varGamma _\mathcal{E}^{(m)}[P_{\eta (0)}](x)}{h} \Big ) \nonumber \\&\quad {\mathop {=}\limits ^{(a)}} \sum _{x \in \mathcal{X}}P_{\eta (0)} (x) \Big ( \lim _{h\rightarrow 0}\frac{ \log P_{\eta (h,i)}(x)-\log P_{\eta (0)}(x)}{h} \Big )\nonumber \\&\qquad - \sum _{x \in \mathcal{X}}\varGamma _\mathcal{E}^{(m)}[P_{\eta (0)}] (x) \Big ( \lim _{h\rightarrow 0} \frac{ \log \varGamma _\mathcal{E}^{(m)}[P_{\eta (h,i)}](x)- \log \varGamma _\mathcal{E}^{(m)}[P_{\eta (0)}](x)}{h} \Big ) \nonumber \\&\quad = \sum _{x \in \mathcal{X}} \frac{\partial }{\partial \eta _i} P_\eta (x)|_{\eta =\eta (0)} - \sum _{x \in \mathcal{X}} \frac{\partial }{\partial \eta _i} \varGamma _\mathcal{E}^{(m)}[P_\eta ](x)|_{\eta =\eta (0)} =0, \end{aligned}$$

(103)

which implies the condition (A4). Here, (a) follows from (102). $\square $

5 Commitment capacity

Using the same notations as Sect. 3, we address the bit commitment via a noisy channel $W_{Y|X}$. When we can guarantee the communication channel is given by $W_{Y|X}$, bit commitment is possible. In this topic, we are interested in the secure transmission rate in the sense of bit commitment per a single use of the noisy channel $W_{Y|X}$. The maximum value of this rate is called the commitment capacity. Given a distribution $P_X$, the Shannon entropy is given as

$$\begin{aligned} H(X)_{P_X}:=-\sum _{x \in \mathcal{X}}P_X(x)\log P_X(x). \end{aligned}$$

(104)

Given a joint distribution $P_{XY}$, the conditional entropy is defined as

$$\begin{aligned} H(X|Y)_{P_{XY}}:= \int _{\mathcal{Y}} H(X)_{W_{X|Y=y}} p_Y(y) \mu (dy). \end{aligned}$$

(105)

The commitment capacity is given as

$$\begin{aligned} C_c(W_{Y|X}):=\max _{P_X} H(X|Y)_{W_{Y|X} \times P_X}. \end{aligned}$$

(106)

This problem setting has several versions. To achieve the bit commitment, the papers [36,37,38] considered interactive protocols with multiple rounds, where each round has one use of the given noisy channel $W_{Y|X}$ and free noiseless communications in both directions. Then, it derived the commitment capacity (106). Basically, the proof is composed of two parts. One is the achievability part, which is often called the direct part and shows the existence of the code to achieve the capacity. The other is the impossibility part, which is often called the converse part and shows the non-existence of the code to exceed the capacity. As the achievability part, they showed that the commitment capacity can be achieved with non-interactive protocol, which has no free noiseless communication during multiple uses of the given noisy channel $W_{Y|X}$. However, as explained in [39], their proof of the impossibility part skips so many steps that it cannot be followed. Later, the paper [40] showed the impossibility part only for non-interactive protocols by applying the wiretap channel. Recently, the paper [41] constructed a code to achieve the commitment capacity by using a specific type of list decoding. Further, the paper showed the achievability of the commitment capacity even in the quantum setting. In addition, the paper [39] showed the impossibility part for interactive protocols by completing the proof by [39]. The proof in [39] covers the impossibility part for a certain class even in the quantum setting.

5.1 Algorithm based on the em-algorithm problem

To calculate the commitment capacity, we consider the following mixture and exponential families;

$$\begin{aligned} \mathcal{M}_a&:=\{W_{Y|X} \times P_X| P_X \in \mathcal{P}(\mathcal{X})\}\nonumber \\&=\bigg \{P_{XY} \in \mathcal{P}(\mathcal{X}\times \mathcal{Y})\bigg | \sum _{y \in \mathcal{Y}}P_{XY}(x,y)=P_X(x)\bigg \} \end{aligned}$$

(107)

$$\begin{aligned} \mathcal{E}&:=\{ Q_Y \times P_{X,Uni} | Q_Y \in \mathcal{P}(\mathcal{Y})\}, \end{aligned}$$

(108)

where $P_{X,Uni}$ is the uniform distribution on $\mathcal{X}$. In (107), the integer k is chosen to be $|\mathcal{X}|-1$, the linear functions $f_1, \ldots , f_k$ are chosen to be $\sum _{y \in \mathcal{Y}}P_{XY}(x_1,y), \ldots ,$ $\sum _{y \in \mathcal{Y}}P_{XY}(x_{k-1},y)$, and the vector a is chosen to be $P_X(x_1), \ldots , P_X(x_k)$. These choices clarify that (107) gives a mixture family.

Since $\varGamma _\mathcal{E}^{(m)}[W_{Y|X} \times P_X]=(W_{Y|X} \cdot P_X) \times P_{X,Uni} $, the commitment capacity is rewritten as

$$\begin{aligned} \log |\mathcal{X}|-C_c(W_{Y|X})&=\min _{P_X} H(X)_{P_{X,Uni}} + H(X)_{W_{Y|X} \cdot P_X} - H(X Y)_{W_{Y|X} \times P_X}\nonumber \\&=\min _{P_X} D( W_{Y|X} \times P_X\Vert (W_{Y|X} \cdot P_X) \times P_{X,Uni} ) \nonumber \\&=\min _{P_X} D( W_{Y|X} \times P_X\Vert \varGamma _\mathcal{E}^{(m)}[W_{Y|X} \times P_X] ). \end{aligned}$$

(109)

Hence, the minimization (109) is a special case of the minimization (97). Since $ \varGamma _\mathcal{E}^{(m)}[W_{Y|X} \times Q_X](x,y) =(W_{Y|X} \cdot Q_X) \times P_{X,Uni} $,

$$\begin{aligned} D(\varGamma _\mathcal{E}^{(m)}[P^*] \Vert \varGamma _\mathcal{E}^{(m)}[Q_X]) =D(W_{Y|X}\cdot P_X\Vert W_{Y|X}\cdot Q_X) \le D (P^*\Vert Q_X) , \end{aligned}$$

(110)

which yields the condition (100). Hence, the global convergence is guaranteed.

By applying (101), $\varPhi $ is calculated as

$$\begin{aligned}&\varPhi [W_{Y|X} \times Q_X](x,y)\nonumber \\&\quad =\frac{1}{\kappa _{W_{Y|X}}^{1}[Q_X]} w_{Y|X}(y|x)^{\frac{\gamma -1}{\gamma }} Q_X(x)^{\frac{\gamma -1}{\gamma }} (w_{Y|X} \cdot Q_X)(y)^{\frac{1}{\gamma }} P_{X,Uni}(x)^{\frac{1}{\gamma }}, \end{aligned}$$

(111)

where $\kappa _{W_{Y|X}}^{1}[Q_X]$ is the normalizer. Then, after a complicated calculation, the marginal distribution of its projection to $\mathcal{M}_a$ is given as

$$\begin{aligned}&\int _{\mathcal{Y}} \varGamma _{\mathcal{M}_a}^{(e)}[\varPhi [W_{Y|X} \times Q_X]](x,y)\mu (dy)\nonumber \\&\quad =\frac{1}{\kappa _{W_{Y|X}}^{2}[Q_X]} Q_X(x)^{1-\frac{1}{\gamma }} \exp ( -\frac{1}{\gamma } D(W_{Y|X=x}\Vert W_{Y|X} \cdot Q_X)) , \end{aligned}$$

(112)

where $\kappa _{W_{Y|X}}^{2}[Q_X]$ is the normalizer. In the algorithm, we update $P_X^{(t+1)}$ as $P_X^{(t+1)}(x):= \int _{\mathcal{Y}} \varGamma _{\mathcal{M}_a}^{(e)}[\varPhi [W_{Y|X} \times P_X^{(t)}]] (x,y)\mu (dy) $.

5.2 Direct application

The update formula (112) requires a complicated calculation, we can derive the same update rule by a simpler derivation as follows. The commitment capacity is rewritten as

$$\begin{aligned} -C_c(W_{Y|X})&=\min _{P_X} I(P_X,W_{Y|X})-H(X)_{P_{X}} \nonumber \\&=\min _{P_X} \sum _{x\in \mathcal{X}}P_X(x) (D(W_{Y|X=x}\Vert W_{Y|X}\cdot P_X )+\log P_X(x)). \end{aligned}$$

(113)

We choose $\mathcal{M}_a$ and $\varPsi $ as $\mathcal{P}(\mathcal{X})$ and

$$\begin{aligned} \varPsi _{c,W_{Y|X}}[P_X](x):= D(W_{Y|X=x}\Vert W_{Y|X}\cdot P_X )+\log P_X(x). \end{aligned}$$

(114)

Then, we have

$$\begin{aligned}&\sum _{x \in \mathcal{X}}P_X(x)( \varPsi [P_X](x)- \varPsi [Q_X](x))\nonumber \\&\quad = D(P_X\Vert Q_X)-D(W_{Y|X}\cdot P_X\Vert W_{Y|X}\cdot Q_X) \ge 0 \end{aligned}$$

(115)

and

$$\begin{aligned} D(P_X\Vert Q_X) \ge D(P_X\Vert Q_X)- D(W_{Y|X}\cdot P_X\Vert W_{Y|X}\cdot Q_X) . \end{aligned}$$

(116)

Since the condition (A1) with $\gamma =1$ and the condition (A2) hold, Algorithm 1 converges with $\gamma =1$. In this case, $\varPhi $ is given as

$$\begin{aligned}&\varPhi [Q_X](x) \nonumber \\&\quad =\frac{1}{\kappa _{W_{Y|X}}^3[Q_X]}Q_X(x) \exp ( - \frac{1}{\gamma } (\log Q_X(x)+ D(W_{Y|X=x}\Vert W_{Y|X} \cdot Q_X))) \nonumber \\&\quad =\frac{1}{\kappa _{W_{Y|X}}^3[Q_X]}Q_X(x)^{1-\frac{1}{\gamma }} \exp ( -\frac{1}{\gamma } D(W_{Y|X=x}\Vert W_{Y|X} \cdot Q_X)) , \end{aligned}$$

(117)

where the normalizing constant $\kappa _{W_{Y|X}}^3[Q_X]$ is given as $\kappa _{W_{Y|X}}^3[Q_X]:=$

$\sum _{x \in \mathcal{X}} Q_X(x)^{1-\frac{1}{\gamma }} \exp ( - \frac{1}{\gamma } D(W_{Y|X=x}\Vert W_{Y|X} \cdot Q_X))$. Since $\varPhi [Q_X] \in \mathcal{M}_a$, $P_X^{(t+1)}$ is given as $\varPhi [P_X^{(t)}]$.

To consider the effect of the acceleration parameter $\gamma $, we made a numerical comparison when the channel with $\mathcal{X}=\{1,2,3,4\}$ and $\mathcal{Y}=\{1,2,3,4\}$ is given as follows.

$$\begin{aligned}&W_{Y|X}(1,1)=0.6, W_{Y|X}(2,1)=0.2, W_{Y|X}(3,1)=0.1, W_{Y|X}(4,1)=0.1, \nonumber \\&W_{Y|X}(1,2)=0.1, W_{Y|X}(2,2)=0.2, W_{Y|X}(3,2)=0.1, W_{Y|X}(4,2)=0.6,\nonumber \\&W_{Y|X}(1,3)=0.1, W_{Y|X}(2,3)=0.2, W_{Y|X}(3,3)=0.15, W_{Y|X}(4,3)=0.55,\nonumber \\&W_{Y|X}(1,4)=0.05, W_{Y|X}(2,4)=0.85, W_{Y|X}(3,4)=0.05, W_{Y|X}(4,4)=0.05 . \end{aligned}$$

(118)

We choose $\gamma $ to be 1, 0.95, and 0.9. Figure 1 shows the numerical result for the iteration of our algorithm when the channel input is limited into $\{1,2,3\}$. A smaller $\gamma $ does not improve the convergence in this case. Figure 2 shows the same numerical result when all elements of $\{1,2,3,4\}$ are allowed as the channel input. In this case, a smaller $\gamma $ improves the convergence.

6 Reverse em problem

6.1 General problem description

In this section, given a pair of an exponential family $\mathcal{E}$ and a mixture family $\mathcal{M}_a$ on $\mathcal{X}$, we consider the following maximization;

$$\begin{aligned}&\max _{P\in \mathcal{M}_a}\min _{Q\in \mathcal{E}} D(P\Vert Q)= \max _{P\in \mathcal{M}_a} D(P\Vert \varGamma _\mathcal{E}^{(m)}[P]) \nonumber \\&\quad = \max _{P\in \mathcal{M}_a} \sum _{x \in \mathcal{X}}P(x) (\log P(x) - \log \varGamma _\mathcal{E}^{(m)}[P](x)) \end{aligned}$$

(119)

while Sect. 4 considers the minimization of the same value. When $\mathcal{M}_a$ is given as (107) and $\mathcal{E}$ is given as $\mathcal{P}(\mathcal{X})\times \mathcal{P}(\mathcal{Y})$, this problem coincides with the channel capacity (58). This problem was firstly studied for the channel capacity in [20], and was discussed with a general form in [21]. To discuss this problem, we choose the function $\varPsi $ as $\varPsi _\mathrm{{rem}}:=-\varPsi _\mathrm{{em}}$, and apply the discussion in Sect. 2. Due to (99), (24) in the condition (A1) is written as

$$\begin{aligned} (\gamma +1) D (P^0\Vert Q) \ge D(\varGamma _\mathcal{E}^{(m)}[P^0] \Vert \varGamma _\mathcal{E}^{(m)}[Q]), \end{aligned}$$

(120)

and (25) in the condition (A2) is written as

$$\begin{aligned} D(\varGamma _\mathcal{E}^{(m)}[P^0] \Vert \varGamma _\mathcal{E}^{(m)}[Q]) \ge D (P^0\Vert Q) . \end{aligned}$$

(121)

Further, due to Lemma 4, the condition (A4) holds.

6.2 Application to wiretap channel

Now, we apply this problem setting to wiretap channel with the degraded case discussed in Sect. 3.4.2. We choose $\mathcal{M}_a$ as $\{ W_{YZ|X}\times P_X | P_X\in \mathcal{P}(\mathcal{X}) \}$ and $\mathcal{E}$ as the set of distributions with the Markov chain condition $X-Z-Y$ [42]. Then, the conditional mutual information $I(X;Y|Z)[P_X,W_{YZ|X}]$ is given as $D( W_{YZ|X}\times P_X\Vert \varGamma _{\mathcal{E}}^{(m)} [W_{YZ|X}\times P_X])$. In this application, we have

$$\begin{aligned} D_{\varPsi _\mathrm{{rem}}}(W_{YZ|X}\times P_X\Vert W_{YZ|X}\times Q_X)&= D_{\varPsi _{W_{YZ|X}}}D( P_X \Vert Q_X) \end{aligned}$$

(122)

$$\begin{aligned} D( W_{YZ|X}\times P_X\Vert W_{YZ|X}\times Q_X)&=D( P_X \Vert Q_X). \end{aligned}$$

(123)

To check the conditions (A1) and (A2) for $\varPsi _{W_{YZ|X}}$, it is sufficient to check them for $\varPsi _\mathrm{{rem}}$ in this application. Since we have

$$\begin{aligned}&D( \varGamma _{\mathcal{E}}^{(m)} [W_{YZ|X}\times P_X]\Vert \varGamma _{\mathcal{E}}^{(m)} [W_{YZ|X}\times Q_X]) \nonumber \\&\quad = D( W_{Z|X} \times P_{X} \Vert Q_{XZ}) + D( W_{YZ|X}\cdot P_X \Vert W_{YZ|X}\cdot Q_X)\nonumber \\&\qquad - D( W_{Z|X}\cdot P_X \Vert W_{Z|X}\cdot Q_X) \nonumber \\&\quad = D( P_X \Vert Q_X) + D( W_{YZ|X}\cdot P_X \Vert W_{YZ|X}\cdot Q_X) \nonumber \\&\qquad - D( W_{Z|X}\cdot P_X \Vert W_{Z|X}\cdot Q_X) \nonumber \\&\quad \le 2D( P_X \Vert Q_X), \end{aligned}$$

(124)

LHS of (24) in the condition (A1) is written as

$$\begin{aligned}&\gamma D (P_X^0\Vert Q_X) - D( W_{YZ|X}\cdot P_X \Vert W_{YZ|X}\cdot Q_X) \nonumber \\&\qquad + D( W_{Z|X}\cdot P_X \Vert W_{Z|X}\cdot Q_X) \nonumber \\&\quad \ge \gamma D (P^0\Vert Q) - D( P_X^0 \Vert Q_X). \end{aligned}$$

(125)

It is not negative when $\gamma \ge 1$. Also, RHS of (25) in the condition (A2) is written as

$$\begin{aligned} D( W_{YZ|X}\cdot P_X \Vert W_{YZ|X}\cdot Q_X) - D( W_{Z|X}\cdot P_X \Vert W_{Z|X}\cdot Q_X) \ge 0. \end{aligned}$$

(126)

Hence, the conditions (A1) and (A2) hold with $\gamma \ge 1$.

7 Information bottleneck

As a method for information-theoretical machine learning, we often consider information bottleneck [43]. Consider two correlated systems $\mathcal{X}$ and $\mathcal{Y}$ and a joint distribution $P_{XY}$ over $\mathcal{X} \times \mathcal{Y}$. The task is to extract an essential information from the space $\mathcal{X}$ to $\mathcal{T}$ with respect to $\mathcal{Y}$. Here, we discuss a generalized problem setting proposed in [44]. For this information extraction, given parameters $\alpha \in [0,1]$ and $\beta \ge \alpha $, we choose a conditional distribution $P_{T|X}^*$ as

$$\begin{aligned} P_{T|X}^*:= \mathop {\textrm{argmin}}\limits _{P_{T|X}} \alpha I(T;X)+(1-\alpha )H(T)-\beta I(T;Y). \end{aligned}$$

(127)

This method is called information bottleneck. To apply our method to this problem, we set $\mathcal{M}_a$ to be $\mathcal{P}(\mathcal{X})$, and define

$$\begin{aligned} {\varPsi }_{\alpha ,\beta }[P_{TX}](t,x)&:= \alpha \log P_{TX}(t,x) -\alpha \log P_{X}(x) +(\beta -1) \log P_{T}(t) \nonumber \\&\quad +\beta \sum _{y\in \mathcal{Y}} P_{Y|X}(y|x) ( \log P_{Y}(y) - \log P_{TY}(t,y) ). \end{aligned}$$

(128)

Then, when the joint distribution $P_{TX}$ is chosen to be $P_{T|X} \times P_X$, the objective function is written as

$$\begin{aligned} \mathcal{G}_{\alpha ,\beta }( P_{TX}):= \sum _{t\in \mathcal{T},x \in \mathcal{X}} P_{TX}(t,x) {\varPsi }_{\alpha ,\beta }[P_{TX}](t,x). \end{aligned}$$

(129)

That is, our problem is reduced to the minimization

$$\begin{aligned} \min _{P_{TX}\in \mathcal{M}(P_X)} \mathcal{G}_{\alpha ,\beta }( P_{TX}), \end{aligned}$$

(130)

where $\mathcal{M}(P_X)$ is the set of joint distributions on $\mathcal{X} \times \mathcal{Y}$ whose marginal distribution on $\mathcal{X}$ is $P_X$. The e-projection $\varGamma ^{(e)}_{\mathcal{M}(P_X)}$ to $\mathcal{M}(P_X)$ is written as

$$\begin{aligned} \varGamma ^{(e)}_{\mathcal{M}(P_X)}[Q_{TX}]= Q_{T|X}\times P_X \end{aligned}$$

(131)

because the relation $D(P_{TX}\Vert Q_{TX})= D(P_{X}\Vert Q_{X})+ D(P_{TX}\Vert Q_{T|X}\times P_X)$ holds for a distribution $P_{TX}\in \mathcal{M}(P_X)$.

When Algorithm 1 is applied to this problem, due to (131), the update rule for the conditional distribution is given as

$$\begin{aligned} P_{T| X}^{(t)}\mapsto P_{T| X}^{(t+1)}(t|x) := \kappa _x P_{T|X}^{(t)}(t|x) \exp ( -\frac{1}{\gamma }\varPsi _{\alpha ,\beta }[ P_{T|X}^{(t)}\times P_X ](t,x)), \end{aligned}$$

(132)

where $\kappa _x$ is a normalizing constant. This update rule is the same as the update rule proposed in Sect. 3 of [45] when the states $\{\rho _{Y|x}\}$ are given as diagonal density matrices, i.e., a conditional distribution $P_{Y|X}$. Also, as shown in [45, (22)], we have the relation

$$\begin{aligned} D_{\varPsi _{\alpha ,\beta }}( P_{TX} \Vert Q_{TX}) \le \alpha D( P_{TX} \Vert Q_{TX}) \end{aligned}$$

(133)

for $P_{TX},Q_{TX} \in \mathcal{M}(P_X)$ as follows. First, we have

$$\begin{aligned} D( P_{T|X} \cdot P_{XY}\Vert Q_{T|X}\cdot P_{XY}) \ge D( P_{T} \Vert Q_{T}), \end{aligned}$$

(134)

where $P_{T|X} \cdot P_{XY}(t,y):= \sum _{x \in \mathcal{X}} P_{T|X}(t|x) P_{XY}(x,y)$. Then, we have

$$\begin{aligned}&D_{\varPsi _{\alpha ,\beta }}( P_{TX} \Vert Q_{TX}) \nonumber \\&\quad = (\beta -1)D( P_{T} \Vert Q_{T}) +\alpha \sum _{x \in \mathcal{X}}P_X(x)D( P_{T|X=x}\Vert Q_{T|X=x})\nonumber \\&\quad -\beta D( P_{T|X} \cdot P_{XY}\Vert Q_{T|X}\cdot P_{XY}) \nonumber \\&\quad \le -D( P_{T} \Vert Q_{T}) +\alpha \sum _{x \in \mathcal{X}}P_X(x)D( P_{T|X=x}\Vert Q_{T|X=x})\nonumber \\&\quad \le \alpha \sum _{x \in \mathcal{X}}P_X(x)D( P_{T|X=x}\Vert Q_{T|X=x}) = \alpha D( P_{TX} \Vert Q_{TX}). \end{aligned}$$

(135)

That is, the condition (A1) holds with $\gamma =\alpha $. Therefore, with $\gamma =\alpha $, Theorem 1 guarantees that Algorithm 1 converges to a local minimum, which was shown as [45, Theorem 3]. This fact shows the importance of the choice of $\gamma $ dependently on the problem setting. That is, it shows the necessity of our problem setting with a general positive real number $\gamma >0$.

The paper [45] also discussed the case when $\mathcal{Y}$ and $\mathcal{T}$ are quantum systems. It numerically compared these algorithms depending on $\gamma $ [45, Fig. 2]. This numerical calculation indicates the following behavior. When $\gamma $ is larger than a certain threshold, a smaller $\gamma $ realizes faster convergence. But, when $\gamma $ is smaller than a certain threshold, the algorithm does not converge.

8 Conclusion

We have proposed iterative algorithms with an acceleration parameter for a general minimization problem over a mixture family. For these algorithms, we have shown convergence theorems in various forms, one of which covers the case with approximated iterations. Then, we have applied our algorithms to various problem settings including the em algorithm and several information theoretical problem settings.

There are two existing studies to numerically evaluate the effect of the acceleration parameter $\gamma $ [19, 45]. They reported improvement in the convergence by modifying the acceleration parameter $\gamma $. For example, in the numerical calculation for information bottleneck in [45, Fig. 2], the case with $\gamma =0.55 $ improves the convergence. Our numerical calculation for the commitment capacity has two cases. In one case, the choices with $\gamma =0.95,0.9$ do not improve the convergence. In another case, the choices with $\gamma =0.95,0.9$ improve the convergence. These facts show that the effect of the acceleration parameter $\gamma $ depends on the parameters of the problem setting. The commitment capacity is considered as a special case of the divergence between a mixture family and an exponential family.

There are several future research directions. The first direction is the evaluation of the convergence speed of Algorithm 3 because we could not derive its evaluation. The second direction is to find various applications of our methods. Although this paper studied several examples, in order to clarify the usefulness of our algorithm, it is needed to find more useful examples for our algorithm. The third direction is the extensions of our results. A typical extension is the extension to the quantum setting [46,47,48]. As a further extension, it is an interesting topic to extend our result to the setting with Bregman divergence. Recently, the Bregman proximal gradient algorithm has been studied for the minimization of a convex function [49,50,51]. Since this algorithm uses Bregman divergence, it might have an interesting relation with the above-extended algorithm. Therefore, it is an interesting study to investigate this relation.

Data availability

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Notes

The strong security means that the mutual information between the message and the eavesdropper’s information goes to zero as the number of uses of the channel goes to zero when the message is subject to the uniform distribution. The semantic security means that the maximum of the mutual information by changing the distribution of the message goes to zero as the number of uses of the channel goes to zero.

References

Amari, S.: Information Geometry and Its Applications, Springer Japan (2016)
Amari, S.: Information geometry of the EM and em algorithms for neural networks. Neural Netw. 8(9), 1379–1408 (1995)
Article Google Scholar
Fujimoto, Y., Murata, N.: A modified EM algorithm for mixture models based on Bregman divergence. Ann. Inst. Stat. Math. 59, 3–25 (2007)
Article MathSciNet Google Scholar
Allassonnière, S., Chevallier, J.: A New Class of EM Algorithms. Escaping Local Minima and Handling Intractable Sampling, Computational Statistics & Data Analysis, Elsevier, vol. 159(C), (2019)
Amari, S., Kurata, K., Nagaoka, H.: Information geometry of Boltzmann machines. IEEE Trans. Neural Netw. 3(2), 260–271 (1992)
Article Google Scholar
Amari, S., Nagaoka, H.: Methods of Information Geometry (AMS and Oxford, 2000)
Amari, S.: $\alpha $-Divergence Is Unique, Belonging to Both f-Divergence and Bregman Divergence Classes. IEEE Trans. Inform. Theory 55, 4925–4931 (2009)
Article MathSciNet Google Scholar
Arimoto, S.: An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans. Inform. Theory 18(1), 14–20 (1972)
Article MathSciNet Google Scholar
Blahut, R.: Computation of channel capacity and rate-distortion functions. IEEE Trans. Inform. Theory 18(4), 460–473 (1972)
Article MathSciNet Google Scholar
Shannon, C.E.: A Mathematical Theory of Communication, Bell Syst. Tech. J. 27, 379 – 423 and 623 – 656 (1948)
Csiszár, I.: On the computation of rate-distortion functions. IEEE Trans. Inform. Theory 20(1), 122–124 (1974)
Article MathSciNet Google Scholar
Cheng, S., Stankovic, V., Xiong, Z.: Computing the channel capacity and rate-distortion function with two-sided state information. IEEE Trans. Inform. Theory 51(12), 4418–4425 (2005)
Article MathSciNet Google Scholar
Yasui, K., Suko, T., Matsushima, T.: On the Global Convergence Property of Extended Arimoto-Blahut Algorithm, EICE Trans. Fundamentals, J91-A(9), 846-860, (2008). (In Japanese)
Yasui, K., Suko, T., Matsushima, T.: An Algorithm for Computing the Secrecy Capacity of Broadcast Channels with Confidential Messages. In: Proc. 2007 IEEE Int. Symp. Information Theory (ISIT 2007), Nice, France, 24-29 June 2007, pp. 936 – 940
Nagaoka, H.: Algorithms of Arimoto-Blahut type for computing quantum channel capacity. In: Proc. 1998 IEEE Int. Symp. Information Theory (ISIT 1998), Cambridge, MA, USA, 16-21 Aug. (1998), pp. 354
Dupuis, F., Yu, W., Willems, F.: Blahut-Arimoto algorithms for computing channel capacity and rate-distortion with side information. In: Proc. 2014 IEEE Int. Symp. Information Theory (ISIT 2014), Chicago, IL, USA, 27 June-2 July 2004, pp. 179
Sutter, D., Sutter, T., Esfahani, P.M., Renner, R.: Efficient approximation of quantum channel capacities. IEEE Trans. Inform. Theory 62, 578–598 (2016)
Article MathSciNet Google Scholar
Li, H., Cai, N.: A Blahut-Arimoto type algorithm for computing classical-quantum channel capacity. In: Proc. 2019 IEEE Int. Symp. Information Theory (ISIT 2019), Paris, France, 7-12 July (2019), pp. 255–259
Ramakrishnan, N., Iten, R., Scholz, V.B., Berta, M.: Computing quantum channel capacities. IEEE Trans. Inform. Theory 67, 946–960 (2021)
Article MathSciNet Google Scholar
Toyota, S.: Geometry of Arimoto algorithm. Inf. Geom. 3, 183–198 (2020)
Article MathSciNet Google Scholar
Hayashi, M.: Reverse em-problem based on Bregman divergence and its application to classical and quantum information theory, Submitted to Information Geometry; arXiv: 2201.02447 (2022)
Hayashi, M.: Bregman divergence based em algorithm and its application to classical and quantum rate distortion theory. IEEE Trans. Inform. Theory 69, 3460–3492 (2023)
Article MathSciNet Google Scholar
Csiszár, I., Tusnády, G.: Information geometry and alternating minimization procedures. Stat. Decis. Suppl. Issue 1, 205–2377 (1984)
MathSciNet Google Scholar
O’Sullivan, J.A.: Alternating minimization algorithms: From Blahut-Arimoto to expectation-maximization’’. In: Vardy, A. (ed.) Codes, Curves, and Signals, pp. 173–192. Kluwer Academic, Norwell (1998)
Chapter Google Scholar
Gallager, R.G.: Information Theory and Reliable Communication. Wiley, New York (1968)
Google Scholar
Arimoto, S.: On the converse to the coding theorem for discrete memoryless channels. IEEE Trans. Inform. Theory 19, 357–359 (1973)
Article MathSciNet Google Scholar
Hayashi, M.: Quantum wiretap channel with non-uniform random number and its exponent and equivocation rate of leaked information. IEEE Trans. Inform. Theory 61(10), 5595–5622 (2015)
Article MathSciNet Google Scholar
Wyner, A.D.: The wire-tap channel. Bell. Syst. Tech. J. 54, 1355–1387 (1975)
Article MathSciNet Google Scholar
Csiszár, I., Körner, J.: Broadcast channels with confidential messages. IEEE Trans. Inform. Theory 24(3), 339–348 (1978)
Article MathSciNet Google Scholar
Csiszár, I.: Almost independence and secrecy capacity. Probl. Inf. Transm. 32(1), 40–47 (1996)
Google Scholar
Hayashi, M.: General nonasymptotic and asymptotic formulas in channel resolvability and identification capacity and their application to the wiretap channel. IEEE Trans. Inform. Theory 52(4), 1562–1575 (2006)
Article MathSciNet Google Scholar
Hayashi, M.: Exponential decreasing rate of leaked information in universal random privacy amplification. IEEE Trans. Inform. Theory 57(6), 3989–4001 (2011)
Article MathSciNet Google Scholar
Bellare, M., Tessaro, S., Vardy, A.: Semantic security for the wiretap channel. In: Proc. 32nd Annu. Cryptol. Conf. 7417, 294-311 (2012)
Hayashi, M., Matsumoto, R.: Secure Multiplex Coding with Dependent and Non-Uniform Multiple Messages. IEEE Trans. Inform. Theory 62(5), 2355–2409 (2016)
Article MathSciNet Google Scholar
Boyd, S., Vandenberghe, L.: Convex Optimization, Cambridge University Press (2004)
Winter, A., Nascimento, A.C.A., Imai, H.: Commitment Capacity of Discrete Memoryless Channels. In: Proc. 9th IMA International Conferenece on Cryptography and Coding (Cirencester 16-18 December 2003), pp. 35-51, (2003)
Imai, H., Morozov, K., Nascimento, A.C.A., Winter, A.: Commitment Capacity of Discrete Memoryless Channels, https: arXiv:cs/0304014
Imai, H., Morozov, K., Nascimento, A.C.A., Winter, A.: Efficient protocols achieving the commitment capacity of noisy correlations. In: Proc. IEEE International Symposium on Information Theory (ISIT2006), Seattle, Washington, USA July 9 – 14, 1432-1436 (2006)
Hayashi, M., Warsi, N.: Commitment capacity of classical-quantum channels. IEEE Trans. Inform. Theory 69(8), 5083–5099 (2023)
Article MathSciNet Google Scholar
Yamamoto, H., Isami, D.: Multiplex Coding of Bit Commitment Based on a Discrete Memoryless Channel. In: Proc. IEEE ISIT 2007, pp. 721 – 725, June 24-29, (2007)
Hayashi, M.: Secure list decoding and its application to bit-string commitment. IEEE Trans. Inform. Theory 68(6), 3620–3642 (2022)
Article MathSciNet Google Scholar
Toyota, S.: Private communication (2019)
Tishby, N., Pereira, F.C., Bialek. W.: The information bottleneck method, In: The 37th annual Allerton Conference on Communication, Control, and Computing, pages 368 – 377. Univ. Illinois Press, 1999. https://doi.org/10.48550/arXiv.physics/0004057
Strouse, D.J., Schwab, D.J.: The deterministic information Bottleneck. Neural Comput. 29(6), 1611–1630 (2017)
Article MathSciNet Google Scholar
Hayashi, M., Yang, Y.: Efficient algorithms for quantum information bottleneck. Quantum 7, 936 (2023)
Article Google Scholar
Holevo, A.S.: The capacity of the quantum channel with general signal states. IEEE Trans. Inform. Theory 44, 269 (1998)
Article MathSciNet Google Scholar
Schumacher, B., Westmoreland, M.D.: Sending classical information via noisy quantum channels. Phys. Rev. A 56, 131 (1997)
Article Google Scholar
Hayashi, M.: Quantum Information Theory: Mathematical Foundation, Graduate Texts in Physics, Springer-Verlag, (2017)
Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM J. Optim. 3(3), 538–543 (1993)
Article MathSciNet Google Scholar
Teboulle, M.: Convergence of proximal-like algorithms. SIAM J. Optim. 7(4), 1069–1083 (1997)
Article MathSciNet Google Scholar
Zhou, Y., Liang, Y., Shen, L.: A simple convergence analysis of Bregman proximal gradient algorithm. Comput. Optim. Appl. 73(3), 903–912 (2019)
Article MathSciNet Google Scholar
Beck, A.: First-Order Methods in Optimization. SIAM, MOS-SIAM Series on Optimization (2017)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. Ser. B 140, 125–161 (2013)
Article MathSciNet Google Scholar
Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16(3), 697–725 (2006)
Article MathSciNet Google Scholar
Nesterov, Y.: A method for solving a convex programming problem with convergence rate $O(1/k^2)$. Soviet Math. - Doklady 27(2), 372–376 (1983)
Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston (2004)
Book Google Scholar
Teboulle, M.: A simplified view of first order methods for optimization. Math. Program. Ser. B 170, 67–96 (2018)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The author was supported in part by the National Natural Science Foundation of China (Grant No. 62171212) and Guangdong Provincial Key Laboratory (Grant No. 2019B121203002). The author is very grateful to Mr. Shoji Toyota for helpful discussions. In addition, he pointed out that the secrecy capacity can be written as the reverse em algorithm in a similar way as the channel capacity [42] under the degraded condition.

Author information

Authors and Affiliations

School of Data Science, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, 518172, Shenzhen, China
Masahito Hayashi
International Quantum Academy (SIQA), Futian District, Shenzhen, 518048, China
Masahito Hayashi
Graduate School of Mathematics, Nagoya University, Chikusa-ku, 464-8602, Nagoya, Japan
Masahito Hayashi

Authors

Masahito Hayashi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Masahito Hayashi.

Ethics declarations

Conflict of interest

The author states that there is no Conflict of interest.

Additional information

Communicated by Nihat Ay.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Useful lemma

To show various theorems, we prepare the following lemma.

Lemma 5

For any two distributions $Q,Q'\in \mathcal{M}_a$, we have

$$\begin{aligned}&D(P^{0}\Vert Q )- D(P^{0}\Vert Q') \nonumber \\&\quad =\frac{1}{\gamma }J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]],Q) -\frac{1}{\gamma }\mathcal{G}(P^{0}) +\frac{1}{\gamma }D_{\varPsi }(P^{0} \Vert Q) \nonumber \\&\qquad -D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\Vert Q' )\end{aligned}$$

(136)

$$\begin{aligned}&\quad =\frac{1}{\gamma }\mathcal{G}(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]) -\frac{1}{\gamma }\mathcal{G}(P^{0}) + D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\Vert Q)\nonumber \\&\qquad -\frac{1}{\gamma } D_{\varPsi }(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]] \Vert Q)\nonumber \\&\qquad +\frac{1}{\gamma } D_{\varPsi }(P^{0} \Vert Q) -D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\Vert Q' ) . \end{aligned}$$

(137)

In addition, when $\varPsi $ is defined for any distribution in $\mathcal{P}(\mathcal{X})$, the above relations holds for any distribution $Q \in \mathcal{P}(\mathcal{X})$.

Proof

We have

$$\begin{aligned} \mathcal{G}(P^{0})&= \sum _{x \in \mathcal{X}} P^{0}(x) \varPsi [P^{0}](x) =J _\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [P^{0}]],P^{0})\nonumber \\&= \gamma (D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [P^{0}]]\Vert \varPhi [P^{0}]) - \log \kappa [P^{0}]). \end{aligned}$$

(138)

Using (138), we have

$$\begin{aligned}&D(P^{0}\Vert Q)- D(P^{0}\Vert Q') =\sum _{x \in \mathcal{X}} P^{0}(x) (\log Q'(x)- \log Q(x))\nonumber \\&\quad =\sum _{x \in \mathcal{X}} P^{0}(x) \Big ( \log Q'(x) -\log \varPhi [Q](x) +\log \varPhi [Q](x) - \log Q(x) \Big ) \nonumber \\&\quad {\mathop {=}\limits ^{(a)}} D(P^{0}\Vert \varPhi [Q]) -D(P^{0}\Vert Q' ) +\sum _{x \in \mathcal{X}} P^{0}(x) \Big ( -\frac{1}{\gamma } \varPsi [Q](x) -\log \kappa [Q] \Big )\nonumber \\&\quad {\mathop {=}\limits ^{(b)}} D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\Vert \varPhi [Q]) -D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\Vert Q' ) \nonumber \\&\qquad -\log \kappa [Q] -\frac{1}{\gamma }\sum _{x \in \mathcal{X}} P^{0}(x) \varPsi [Q](x)\nonumber \\&\quad {\mathop {=}\limits ^{(c)}} \frac{1}{\gamma }J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]],Q) -\frac{1}{\gamma }\mathcal{G}(P^{0}) +\frac{1}{\gamma }\sum _{x \in \mathcal{X}} P^{0}(x) \Big (\varPsi [P^{0}](x)-\varPsi [Q](x)\Big )\nonumber \\&\quad -D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\Vert Q' ) \nonumber \\&\quad {\mathop {=}\limits ^{(d)}} \frac{1}{\gamma }\mathcal{G}(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]) -\frac{1}{\gamma }\mathcal{G}(P^{0})+D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\Vert Q) \nonumber \\&\quad -\frac{1}{\gamma }\sum _{x \in \mathcal{X}} \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]](x) ( \varPsi [\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]](x)-\varPsi [Q](x))\nonumber \\&\quad +\frac{1}{\gamma }\sum _{x \in \mathcal{X}} P^{0}(x) \Big (\varPsi [P^{0}](x)-\varPsi [Q](x)\Big ) -D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\Vert Q' ) , \end{aligned}$$

(139)

where each step is shown as follows. (a) follows from the definition of $\varPhi (Q) $. (c) follows from (10) and (138). (d) follows from (17). (b) follows from the relations

$$\begin{aligned} D(P^{0}\Vert \varPhi [Q])&=D(P^{0}\Vert \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]) +D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\Vert \varPhi [Q]) \end{aligned}$$

(140)

$$\begin{aligned} D(P^{0}\Vert Q' )&=D(P^{0}\Vert \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]] )+D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\Vert Q' ), \end{aligned}$$

(141)

which are shown by the Pythagorean equation. Therefore, considering the definition of $D_\varPsi (P\Vert Q)$, we obtain (136) and (137). $\square $

Proof of Theorem 2 and Corollary 1

Step 1: This step aims to show the following inequalities by assuming that item (i) does not hold and the conditions (A1) and (A2) hold.

$$\begin{aligned}&D(P^{0}\Vert P^{(t+1)}) \le \delta \end{aligned}$$

(142)

$$\begin{aligned}&D(P^{0}\Vert P^{(t)})- D(P^{0}\Vert P^{(t+1)}) \ge \frac{1}{\gamma }\mathcal{G}(P^{(t+1)}) -\frac{1}{\gamma }\mathcal{G}(P^{0}) \end{aligned}$$

(143)

for $t=1, \ldots , t_0-1$. We show these relations by induction.

For any t, by using the relation $ \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [{P}^{(t)}]]=P^{(t+1)}$, the application of (137) of Lemma 5 to the case with $Q=P^{(t)}$ and $Q'=P^{(t+1)}$ yields

$$\begin{aligned}&D(P^{0}\Vert P^{(t)})- D(P^{0}\Vert P^{(t+1)}) \nonumber \\&\quad =\frac{1}{\gamma }\mathcal{G}( P^{(t+1)}) -\frac{1}{\gamma }\mathcal{G}(P^{0}) +D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [P^{(t)}]]\Vert P^{(t)})\nonumber \\&\qquad -\frac{1}{\gamma } D_\varPsi ( \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [P^{(t)}]] \Vert \varPsi [P^{(t)}]) +\frac{1}{\gamma } D_\varPsi (P^{0}\Vert P^{(t)}) . \end{aligned}$$

(144)

First, we show the relations (142) and (143) with $t=1$. Since $D(P^{0}\Vert P^{(1)}) \le \delta $, $P^{(1)}$ belongs to $U(P^{0},\delta )$. Hence, the conditions (A1) and (A2) guarantee the following inequality with $t=1$;

$$\begin{aligned} (\text {RHS of }\, 144) \ge \frac{1}{\gamma }\mathcal{G}(P^{(t+1)}) -\frac{1}{\gamma }\mathcal{G}(P^{0}). \end{aligned}$$

(145)

The combination of (144) and (145) implies (143). Since item (i) does not hold, we have

$$\begin{aligned} \frac{1}{\gamma }\mathcal{G}(P^{(t+1)}) -\frac{1}{\gamma }\mathcal{G}(P^{0})\ge 0. \end{aligned}$$

(146)

The combination of (144), (145), and (146) implies (142).

Next, we show the relations (142) and (143) with $t=t'$ by assuming the relations (142) and (143) with $t=t'-1$. Since the assumption guarantees $D(P^{0}\Vert P^{(t')}) \le \delta $, the conditions (A1) and (A2) guarantee (145) with $t=t'$. We obtain (142) and (143) in the same way as $t=1$.

Step 2: This step aims to show (28) by assuming that item (i) does not hold and the conditions (A1) and (A2) hold. Due to (142), the condition (A1) and Lemmas 1 and 2 guarantee that

$$\begin{aligned} \mathcal{G}(P^{(t+1)}) \le \mathcal{G}(P^{(t)}) . \end{aligned}$$

(147)

We have

$$\begin{aligned}&\frac{t_0}{\gamma } \Big ( \mathcal{G}(P^{(t_0+1)}) - \mathcal{G}(P^{0}) \Big ) {\mathop {\le }\limits ^{(a)}} \frac{1}{\gamma }\sum _{t=1}^{t_0} \mathcal{G}(P^{(t+1)}) -\mathcal{G}(P^{0})\nonumber \\&\quad {\mathop {\le }\limits ^{(b)}} \sum _{t=1}^{t_0} D(P^{0}\Vert P^{(t)})- D(P^{0}\Vert P^{(t+1)})\nonumber \\ {}&\quad = D(P^{0}\Vert P^{(1)})- D(P^{0}\Vert P^{(t_0+1)}) \nonumber \\ {}&\quad \le D(P^{0}\Vert P^{(1)}), \end{aligned}$$

(148)

where (a) and (b) follow from (147) and (143), respectively.

Step 3: This step aims to show item (ii) by assuming the conditions (A0) as well as (A1) and (A2). In the discussion of Step 1, since $D(P^{0}\Vert P^{(t)}) \le \delta $, the condition (A0) guarantees (146). We can show item (ii) with assuming that item (i) does not hold. Hence, we obtain Theorem 2.

Step 4: To show Corollary 1, we apply (144) to the case when $P^0=P^*$ and $P^{(t)}=P^*_i$. Then, we have

$$\begin{aligned} 0=\mathcal{G}(P^*_i)-\mathcal{G}(P^*) +\sum _{x \in \mathcal{X}}P^*(x) (\varPsi [P^*](x)-\varPsi [P^*_i](x)), \end{aligned}$$

(149)

which implies (31).

Proof of Theorem 3

We have already shown that $\{P^{(t)}\}_{t=1}^{t_0+1} \subset U(P^{0},\delta )$ when item (i) does not hold. Hence, in the following, we show only (30) by using (A1), (A3), and $\{P^{(t)}\}_{t=1}^{t_0+1} \subset U(P^{0},\delta )$ when item (i) does not hold.

We have

$$\begin{aligned}&J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [P^{(t)}]],P^{(t)}) - \mathcal{G}(P^{0}) \end{aligned}$$

(150)

$$\begin{aligned}&\quad {\mathop {=}\limits ^{(a)}}\mathcal{G}( P^{(t+1)}) -\mathcal{G}(P^{0}) +\gamma D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [P^{(t)}]]\Vert P^{(t)})\nonumber \\&\qquad -D_\varPsi (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [P^{(t)}]] \Vert \varPsi [P^{(t)}])\nonumber \\&\quad {\mathop {\ge }\limits ^{(b)}} \mathcal{G}( P^{(t+1)}) -\mathcal{G}(P^{0}) {\mathop {\ge }\limits ^{(c)}} 0, \end{aligned}$$

(151)

where (a) follows from Lemma 2, (b) follows from the condition (A1) and $P^{(t)}\in U(P^{0},\delta )$, and (c) holds because item (i) does not hold.

Since $ \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [{P}^{(t)}]]=P^{(t+1)}$, the application of (136) of Lemma 5 to the case with $Q=P^{(t)}$ and $Q' =P^{(t+1)}$ yields

$$\begin{aligned}&D(P^{0} \Vert P^{(t)})-D(P^{0} \Vert P^{(t+1)})\nonumber \\&\quad = \frac{1}{\gamma } J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [P^{(t)}]],P^{(t)}) - \frac{1}{\gamma }\mathcal{G}(P^{0}) +\frac{1}{\gamma } D_\varPsi (P^{0}\Vert P^{(t)}) \end{aligned}$$

(152)

$$\begin{aligned}&\quad {\mathop {\ge }\limits ^{(a)}} \frac{1}{\gamma } D_\varPsi (P^{0}\Vert P^{(t)}) {\mathop {\ge }\limits ^{(b)}} \frac{\beta }{\gamma }D(P^{0} \Vert P^{(t)}), \end{aligned}$$

(153)

where (a) follows from (151), and (b) follows from (26) in the condition (A3) and $P^{(t)}\in U(P^{0},\delta )$. Hence, we have

$$\begin{aligned} D(P^{0} \Vert P^{(t+1)})\le (1-\frac{\beta }{\gamma })D(P^{0} \Vert P^{(t)}). \end{aligned}$$

(154)

Using the above relations, we have

$$\begin{aligned}&\mathcal{G}(P^{(t+1)}) -\mathcal{G}(P^{0}) {\mathop {\le }\limits ^{(a)}} J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [P^{(t)}]],P^{(t)}) -\mathcal{G}(P^{0}) \nonumber \\&\quad {\mathop {=}\limits ^{(b)}} D(P^{0}\Vert P^{(t)})- D(P^{0}\Vert P^{(t+1)}) - \frac{1}{\gamma } D_\varPsi (P^{0}\Vert P^{(t)}) \nonumber \\&\quad {\mathop {\le }\limits ^{(c)}} D(P^{0}\Vert P^{(t)})- D(P^{0}\Vert P^{(t+1)}) - \frac{\beta }{\gamma }D(P^{0}\Vert P^{(t)}) \nonumber \\&\quad {\mathop {\le }\limits ^{(d)}} (1-\frac{\beta }{\gamma }) D(P^{0} \Vert P^{(t)}) {\mathop {\le }\limits ^{(e)}} (1-\frac{\beta }{\gamma })^t D(P^{0} \Vert P^{(1)}), \end{aligned}$$

(155)

where each step is derived as follows. Step (a) follows from (151). Step (b) follows from (152). Step (c) follows from (26) in the condition (A3) and $P^{(t)}\in U(P^{0},\delta )$. Step (d) follows from (153). Step (e) follows from (154). Hence, we obtain (30). Therefore, we have shown item (ii) under the conditions (A1) and (A3) when item (i) does not hold.

When (A0) holds in addition to (A1) and (A3), as shown in Step 1 of the proof of Theorem 2, the relation $\{P^{(t)}\}_{t=1}^{t_0+1} \subset U(P^{0},\delta )$ holds. Hence, item (ii) holds.

Proof of Theorem 4

In this proof, we choose $\bar{P}^{(1)}$ to be ${P}^{(1)}$.

Step 1: This step aims to show the inequality (45). We denote the maximizer in (41) by $\theta '$. The condition (41) implies that

$$\begin{aligned} \phi [\bar{P}^{(t)}](\theta )- \sum _{j=1}^k \theta ^j a_j \le \phi [\bar{P}^{(t)}](\theta ')- \sum _{j=1}^k {\theta '}^j a_j +\epsilon _1 . \end{aligned}$$

(156)

The divergence in the exponential family $\{Q_\theta \}$ can be considered as the Bregmann divergence of the potential function $\phi [\bar{P}^{(t)}](\theta )$. For example, for this fact, see [22, Section III-A]. Hence, we have

$$\begin{aligned} D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t)}]]\Vert \bar{P}^{(t+1)} )&= \phi [\bar{P}^{(t)}](\theta )- \sum _{j=1}^k \theta ^j a_j\nonumber \\&\quad -\Big (\phi [\bar{P}^{(t)}](\theta ')- \sum _{j=1}^k {\theta '}^j a_j\Big ) \nonumber \\ {}&\le \epsilon _1. \end{aligned}$$

(157)

Step 2: This step aims to show Eq. (46) when the following inequality

$$\begin{aligned} \mathcal{G}( {P}^{(t_2)}) -\gamma D(P^{(t_2)} \Vert \bar{P}^{(t_2)}) {\le } \frac{\gamma }{t_1-1} D(P^{0}\Vert \bar{P}^{(1)}) + \epsilon _1 +\mathcal{G}(P^{0}) \end{aligned}$$

(158)

holds. Eq. (46) is shown as follows;

$$\begin{aligned}&\mathcal{G}( {P}_f^{(t_1)})-\mathcal{G}(P^{0}) {\mathop {=}\limits ^{(a)}} \mathcal{G}( {P}^{(t_2)})-\mathcal{G}(P^{0}) \nonumber \\ {}&\quad {\mathop {\le }\limits ^{(b)}} \frac{\gamma }{t_1-1} D(P^{0}\Vert \bar{P}^{(1)}) +\gamma D(P^{(t_2)} \Vert \bar{P}^{(t_2)}) +\epsilon _1 \nonumber \\&\quad {\mathop {\le }\limits ^{(b)}} \frac{\gamma }{t_1-1} D(P^{0}\Vert \bar{P}^{(1)}) +\gamma \epsilon _2+\epsilon _1, \end{aligned}$$

(159)

where Steps (a), (b), and (c) follow from the definition of ${P}_f^{(t_1)}$, (158), and (42), respectively. Therefore, the remaining task is the proof of (158).

Step 3: We choose $t_4 \in [1, t_1-1]$ as the minimum integer $t \in [1, t_1-1]$ to satisfy the following inequality

$$\begin{aligned} \frac{1}{\gamma }J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t)}]],\bar{P}^{(t)}) \le \frac{1}{\gamma }\mathcal{G}(P^{0})+\epsilon _1. \end{aligned}$$

(160)

If no integer $t \in [1, t_1-1]$ satisfies (160), we set $t_4$ to be $t_1$. This step aims to show the following two facts for $t =1, \ldots , t_4-1$. (i) $D(P^{0}\Vert \bar{P}^{(t+1)}) \le \delta $. (ii) The inequality

$$\begin{aligned} D(P^{0}\Vert \bar{P}^{(t)})- D(P^{0}\Vert \bar{P}^{(t+1)})&{\ge } \frac{1}{\gamma }J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t)}]],\bar{P}^{(t)}) \nonumber \\&-\frac{1}{\gamma }\mathcal{G}(P^{0})-\epsilon _1 \end{aligned}$$

(161)

holds. The above two items are shown by induction for t as follows. It is sufficient to show the case when $t_4 \ge 2$.

We show items (i) and (ii) for $t=1$ as follows. The application of Lemma 5 to the case with $Q=\bar{P}^{(1)}$ and $Q' = \bar{P}^{(2)}$ yields

$$\begin{aligned}&D(P^{0}\Vert \bar{P}^{(1)})- D(P^{0}\Vert \bar{P}^{(2)}) \nonumber \\&\quad =\frac{1}{\gamma }J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(1)}]],\bar{P}^{(1)}) -\frac{1}{\gamma }\mathcal{G}(P^{0}) \nonumber \\ {}&\qquad +\frac{1}{\gamma } D_\varPsi (P^{0}\Vert P^{(1)}) -D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(1)}]]\Vert \bar{P}^{(2)} )\nonumber \\&\quad {\mathop {\ge }\limits ^{(a)}} \frac{1}{\gamma }J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(1)}]],\bar{P}^{(1)}) -\frac{1}{\gamma }\mathcal{G}(P^{0})-\epsilon _1 {\mathop {\ge }\limits ^{(b)}} 0, \end{aligned}$$

(162)

where (a) follows from (A2+) and (45) because the relation $D(P^{0}\Vert \bar{P}^{(1)}) \le \delta $ follows from the assumption of this theorem. (b) follows from the fact that $t=1$ does not satisfy the condition (160). Hence, $ D(P^{0}\Vert \bar{P}^{(2)}) \le D(P^{0}\Vert \bar{P}^{(1)}) \le \delta $.

Assume that items (i) and (ii) hold with $t=t'-1$. Then, the application of Lemma 5 to the case with $Q=\bar{P}^{(t)}$ and $Q' = \bar{P}^{(t+1)}$ yields

$$\begin{aligned}&D(P^{0}\Vert \bar{P}^{(t')})- D(P^{0}\Vert \bar{P}^{(t'+1)}) \nonumber \\&\quad = \frac{1}{\gamma }J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t')}]],\bar{P}^{(t')}) -\frac{1}{\gamma }\mathcal{G}(P^{0}) \nonumber \\ {}&\qquad +\frac{1}{\gamma } D_\varPsi (P^{0}\Vert \bar{P}^{(t)}) -D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t')}]]\Vert \bar{P}^{(t'+1)} )\nonumber \\&\quad {\mathop {\ge }\limits ^{(a)}} \frac{1}{\gamma }J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t')}]],\bar{P}^{(t')}) -\frac{1}{\gamma }\mathcal{G}(P^{0})-\epsilon _1{\mathop {\ge }\limits ^{(b)}} 0, \end{aligned}$$

(163)

where (a) follows from (A2+) and (45) because the relation $D(P^{0}\Vert \bar{P}^{(t')}) \le \delta $ follows from the assumption of induction. (b) follows from the fact that $t=t'$ does not satisfy the condition (160). Hence, $ D(P^{0}\Vert \bar{P}^{(t'+1)}) \le D(P^{0}\Vert \bar{P}^{(t')}) \le \delta $.

Step 4: This step aims to show the inequality (158) when $t_4 \le t_1-1$, i.e., there exists an integer $t \in [1,t_1-1]$ to satisfy (160).

Pythagorean theorem guarantees

$$\begin{aligned}&D( {P}^{(t_4+1)} \Vert \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t_4)}]] ) \nonumber \\&\quad \le D( {P}^{(t_4+1)} \Vert \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t_4)}]] ) + D( \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t_4)}]] \Vert \bar{P}^{(t_4+1)}) \nonumber \\&\quad = D(P^{(t_4+1)} \Vert \bar{P}^{(t_4+1)}) \le \epsilon _2 . \end{aligned}$$

(164)

Then, we have

$$\begin{aligned}&\mathcal{G}( {P}^{(t_2)}) -\gamma D(P^{(t_2)} \Vert \bar{P}^{(t_2)}) {\mathop {\le }\limits ^{(a)}} \mathcal{G}( {P}^{(t_4+1)}) -\gamma D(P^{(t_4+1)} \Vert \bar{P}^{(t_4+1)})\nonumber \\&\quad {\mathop {\le }\limits ^{(b)}} J_\gamma ({P}^{(t_4+1)},\bar{P}^{(t_4)}) -\gamma D(P^{(t_4+1)} \Vert \bar{P}^{(t_4+1)})\nonumber \\&\quad {\mathop {=}\limits ^{(c)}} J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t_4)}]],\bar{P}^{(t_4)}) +\gamma D( {P}^{(t_4+1)} \Vert \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t_4)}]] ) \nonumber \\ {}&\qquad -\gamma D(P^{(t_4+1)} \Vert \bar{P}^{(t_4+1)})\nonumber \\&\quad {\mathop {\le }\limits ^{(d)}} J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t_4)}]],\bar{P}^{(t_4)}) , \end{aligned}$$

(165)

where each step is derived as follows. Step (a) follows from the relation $t_2=$

$ \mathop {\textrm{argmin}}\limits _{t=2, \ldots , t_1} \mathcal{G}(P^{(t)})- \gamma D(P^{(t)} \Vert \bar{P}^{(t)})$. Step (b) follows from Lemma 2 and the condition (A1+) because (164) holds, and the relation $D(P^{0}\Vert \bar{P}^{(t_4)}) \le \delta $ follows from item (i) with $t=t_4-1$ shown in Step 3. Step (c) follows from (12). Step (d) follows from the equation (164).

Combining (165) and (160), we have

$$\begin{aligned} \mathcal{G}( {P}^{(t_2)}) -\gamma D(P^{(t_2)} \Vert \bar{P}^{(t_2)}) \le \epsilon _1 +\mathcal{G}(P^{0}), \end{aligned}$$

(166)

which implies (158).

Step 5: This step aims to show

$$\begin{aligned} J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t_3)}]],\bar{P}^{(t_3)}) -\mathcal{G}(P^{0}) -\epsilon _1 \le \frac{\gamma }{t_1-1} D(P^{0}\Vert \bar{P}^{(1)}) \end{aligned}$$

(167)

under the choice of $t_3:= \mathop {\textrm{argmin}}\limits _{1\le t \le t_1-1} J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t)}]],\bar{P}^{(t)})$ when $t_4 = t_1$, i.e., there exists no integer $t \in [1,t_1-1]$ to satisfy (160).

Using (161), we have

$$\begin{aligned}&\frac{1}{\gamma }J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t_3)}]],\bar{P}^{(t_3)}) -\frac{1}{\gamma }\mathcal{G}(P^{0}) -\epsilon _1\nonumber \\&\quad \le D(P^{0}\Vert \bar{P}^{(t)})- D(P^{0}\Vert \bar{P}^{(t+1)}) \end{aligned}$$

(168)

for $t \le t_1-1$. Taking the sum for (168), we have

$$\begin{aligned}&\frac{1}{\gamma }J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t_3)}]],\bar{P}^{(t_3)}) -\frac{1}{\gamma }\mathcal{G}(P^{0}) -\epsilon _1\nonumber \\&\quad \le \frac{1}{t_1-1}\sum _{t=1}^{t=t_1-1} D(P^{0}\Vert \bar{P}^{(t)})- D(P^{0}\Vert \bar{P}^{(t+1)}) \nonumber \\&\quad = \frac{1}{t_1-1} ( D(P^{0}\Vert \bar{P}^{(1)})- D(P^{0}\Vert \bar{P}^{(t_1)})) \le \frac{1}{t_1-1} D(P^{0}\Vert \bar{P}^{(1)}). \end{aligned}$$

(169)

Therefore, we obtain (167).

Step 6: This step aims to show the inequality (158) when $t_4 = t_1$, i.e., there exists no integer $t \in [1,t_1-1]$ to satisfy (160). We obtain the following inequality

$$\begin{aligned} \mathcal{G}( {P}^{(t_2)}) -\gamma D(P^{(t_2)} \Vert \bar{P}^{(t_2)}) \le J_\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t_3)}]],\bar{P}^{(t_3)}) \end{aligned}$$

(170)

in the same way as (165) in Step 4 by changing $t_4$ by $t_3$. Combining (170) and (167), we obtain (158).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hayashi, M. Iterative minimization algorithm on a mixture family. Info. Geo. (2024). https://doi.org/10.1007/s41884-024-00140-5

Download citation

Received: 09 May 2022
Revised: 10 May 2024
Accepted: 13 July 2024
Published: 13 August 2024
DOI: https://doi.org/10.1007/s41884-024-00140-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Iterative minimization algorithm on a mixture family

Abstract

Similar content being viewed by others

Alternating Least Squares in Generalized Linear Models

A Gaussian Mixture Clustering Approach Based on Extremal Optimization

Clustering via finite nonparametric ICA mixture models

1 Introduction

2 General setting

2.1 Algorithm with exact iteration

Lemma 1

Proof

Lemma 2

Proof

Remark 1

Theorem 1

Proof

Theorem 2

Theorem 3

Corollary 1

Lemma 3

Proof of Lemma 3

Remark 2

2.2 Algorithm with approximated iteration

Theorem 4

Remark 3

2.3 Combination of the gradient method and the Algorithm 1

Theorem 5

Proof

Remark 4

3 Application to information theoretical problems

3.1 Channel capacity

Remark 5

3.2 Reliability function in channel coding

3.3 Strong converse exponent in channel coding

3.4 Wiretap channel capacity

3.4.1 General case

3.4.2 Degraded case

3.5 Capacities with cost constraint

4 em problem

Lemma 4

Proof

5 Commitment capacity

5.1 Algorithm based on the em-algorithm problem

5.2 Direct application

6 Reverse em problem

6.1 General problem description

6.2 Application to wiretap channel

7 Information bottleneck

8 Conclusion

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Useful lemma

Lemma 5

Proof

Proof of Theorem 2 and Corollary 1

Proof of Theorem 3

Proof of Theorem 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation