1 Introduction

Optimization over distributions is an important topic in various areas. For example, the minimum divergence between a mixture family and an exponential family has been studied by using the em algorithm in the areas of machine learning and neural networks [1,2,3,4]. The em algorithm is an iterative algorithm to calculate the above minimization and it is rooted in the study of Boltzmann machines [5]. In particular, the paper [3] formulated the em algorithm under the framework with Bregman divergence [6, 7]. The topic of the em algorithm has been mainly studied in the community of machine learning, neural networks, and information geometry. As another iterative algorithm, the Arimoto–Blahut algorithm is known as an algorithm to maximize the mutual information by changing the distribution on the input system [8, 9]. This maximization is needed to calculate the channel capacity [10]. This algorithm has been generalized to various settings including the rate distortion theory [9, 11,12,13], the capacity of a wire-tap channel [14], and their quantum extensions [15,16,17,18,19]. In particular, the two papers [13, 19] made very useful generalizations to cover various topics in information theory. The Arimoto–Blahut algorithm and its variants have been mainly studied in the community of information theory.

However, only a limited number of studies have discussed the relation between the two topics, the em algorithm and the Arimoto–Blahut algorithm, as follows. The papers [23, 24] pointed out that the Arimoto–Blahut algorithm can be considered as an alternating algorithm in a similar way to the EM and the em algorithms. Recently, the paper [20] pointed out that the maximization of the mutual information can be considered to be the maximization of the projected divergence to an exponential family by changing an element of a mixture family. The paper [21] generalized this maximization to the framework with Bregman divergence [6, 7] and applied this setting to various problems in information theory. Also, the recent paper [22] applied the em algorithm to the rate-distortion theory, which is a key topic in information theory.

In this paper, we focus on a generalized problem setting proposed in [19], which is given as an optimization over the set of input quantum states. As the difference from the former algorithm, their algorithm [19] has an acceleration parameter. Changing this parameter, we can enhance the convergence speed under a certain condition. To obtain wider applicability, we extend their problem setting to the minimization over a general mixture family. Although they discussed the convergence speed only when there is no local minimizer, our analysis covers the convergence speed to a local minimizer even when there exist several local minimizers. Further, since our setting covers a general mixture family as the set of input variables, our method can be applied to the minimum divergence between a mixture family and an exponential family, which is the objective problem in the em algorithm. That is, this paper presents a general algorithm including the em algorithm as well as the Arimoto–Blahut algorithm. This type of relation between the em algorithm and the Arimoto–Blahut algorithm is different from the relation pointed by the papers [23, 24].

There is a possibility that each iteration can be calculated only approximately. To cover such an approximated case, we evaluate the error of our algorithm with approximated iterations. Since the em algorithm has local minimizers in general, it is essential to cover the convergence to a local minimizer. Since our algorithm has the acceleration parameter, our application to the minimum divergence gives a generalization of the em algorithm. Also, our algorithm can be applied to the maximization of the projected divergence to an exponential family by changing an element of a mixture family.

In addition, our algorithm has various applications that were not discussed in the preceding study [19]. In channel coding, the decoding block error probability goes to zero exponentially under the proper random coding when the transmission rate is smaller than the capacity [25]. Also, the probability of correct decoding goes to zero exponentially when the transmission rate is greater than the capacity [26]. These exponential rates are written with the optimization of the so-called Gallager function. Recently, the paper [27] showed that the Gallager function can be written as the minimization of the Rényi divergence. Using this fact, we apply our method to these optimizations. Further, we apply our algorithm to the capacity of a wiretap channel. In addition, since our problem setting allows a general mixture family as the range of input, we apply the channel capacity with cost constraint. Also, we point out that the calculation of the commitment capacity is given as the minimization of the divergence between a mixture family and an exponential family. Hence, we discuss this application as well.

The remaining part of this paper is organized as follows. Section 2 formulates our minimization problem for a general mixture family. Then, we proposed several algorithms to solve the minimization problem. We derive various convergence theorems including the case with approximated iterations. The remaining sections apply our algorithm to various examples. In these sections, examples of objective functions are discussed. Section 3 applies our algorithm to various information theoretical problems. Then, Sect. 4 demonstrates the application to the minimum divergence between a mixture family and an exponential family. Section 5 shows how to apply our algorithm to the commitment capacity. Section 6 discusses the application of our algorithm to the maximization of the projected divergence to an exponential family by changing an element of a mixture family. Section 7 considers the application to information bottleneck, which is a powerful method for machine learning. Appendices are devoted to the proofs of the theorems presented in Sect. 2.

2 General setting

2.1 Algorithm with exact iteration

We consider a finite sample space \(\mathcal{X}\) and focus on the set \(\mathcal{P}(\mathcal{X})\) of distributions whose support is \(\mathcal{X}\). Using k linearly independent functions \(f_1, \ldots , f_k\) on \(\mathcal{X}\) and constants \(a=(a_1, \ldots , a_k)\), we define the mixture family \(\mathcal{M}_a\) as follows

$$\begin{aligned} \mathcal{M}_a:= \{P \in \mathcal{P}(\mathcal{X})| P[f_i]=a_i \hbox { for } i=1, \ldots , k \}, \end{aligned}$$
(1)

where \(P[f]:= \sum _{x \in \mathcal{X}} P(x)f(x)\). We add additional \(l-k\) linearly independent functions \(f_{k+1}, \ldots f_l\) and \(|\mathcal{X}|=l+1\) such that the l functions \(f_{1}, \ldots , f_l\) are linearly independent. Then, the distribution P can be parameterized by the mixture parameter \(\eta =(\eta _1, \ldots , \eta _l)\) as \( \eta _i= P[f_i]\). That is, the above distribution is denoted by \(P_\eta \). Then, we denote the e-projection of P to \(\mathcal{M}_a\) by \(\varGamma ^{(e)}_{\mathcal{M}_a}[P]\). That is, \(\varGamma ^{(e)}_{\mathcal{M}_a}[P]\) is defined as follows [1, 2].

$$\begin{aligned} \varGamma ^{(e)}_{\mathcal{M}_a}[P]:= \mathop {\textrm{argmin}}\limits _{ Q \in \mathcal{M}_a} D(Q\Vert P), \end{aligned}$$
(2)

where the Kullback–Leibler divergence \(D(Q\Vert P)\) is defined as

$$\begin{aligned} D(Q\Vert P):= \sum _{x \in \mathcal{X}}Q(x) (\log Q(x)- \log P(x)). \end{aligned}$$
(3)

Using the e-projection, we have the following equation for an element of \(Q \in \mathcal{M}_a\), which is often called Pythagorean theorem.

$$\begin{aligned} D(Q\Vert P)= D(Q\Vert \varGamma ^{(e)}_{\mathcal{M}_a}[P]) + D(\varGamma ^{(e)}_{\mathcal{M}_a}[P]\Vert P). \end{aligned}$$
(4)

Given a continuous function \(\varPsi \) from \(\mathcal{M}_a\) to the set of functions on \(\mathcal{X}\), we consider the minimization \(\min _{P \in \mathcal{M}_a} \mathcal{G}(P)\);

$$\begin{aligned} \mathcal{G}(P):= \sum _{x \in \mathcal{X}} P(x) \varPsi [P](x). \end{aligned}$$
(5)

This paper aims to find

$$\begin{aligned} \overline{\mathcal{G}}(a):=\min _{P \in \mathcal{M}_a} \mathcal{G}(P), \quad P_{*,a}:=\mathop {\textrm{argmin}}\limits _{P \in \mathcal{M}_a} \mathcal{G}(P). \end{aligned}$$
(6)

For this aim, we propose an iterative algorithm based on a positive real number \(\gamma >0\). Since the above formulation (5) is very general, we can choose the function \(\varPsi \) dependently on our objective function. That is, different choices of \(\varPsi \) lead to different objective functions.

For a distribution \(Q \in \mathcal{P}(\mathcal{X})\), we define the distribution \(\varPhi [Q] \) as

$$\begin{aligned} \varPhi [Q](x):= \frac{1}{\kappa [Q]}Q(x)\exp ( -\frac{1}{\gamma } \varPsi [Q](x)), \end{aligned}$$
(7)

where \(\kappa [Q]\) is the normalization factor \(\sum _{x \in \mathcal{X}} Q(x)\exp ( -\frac{1}{\gamma } \varPsi [Q](x))\). Then, depending on \(\gamma >0\), we propose Algorithm 1. When the calculation of \( \varPsi [P]\) and the e-projection is feasible, Algorithm 1 is feasible.

Algorithm 1
figure a

Minimization of \(\mathcal{G}(P)\)

Indeed, Algorithm 1 is characterized as the iterative minimization of the following two-variable function, i.e., the extended objective function;

$$\begin{aligned} J_\gamma (P,Q):=\gamma D(P\Vert Q)+\sum _{x \in \mathcal{X}} P(x) \varPsi [Q](x). \end{aligned}$$
(8)

To see this fact, we define

$$\begin{aligned} \mathcal{F}_1[P] := \mathop {\textrm{argmin}}\limits _{Q \in \mathcal{M}_a} J _\gamma (P,Q) ,\quad \mathcal{F}_2[Q] := \mathop {\textrm{argmin}}\limits _{P \in \mathcal{M}_a} J _\gamma (P,Q) . \end{aligned}$$
(9)

Then, \(\mathcal{F}_2[Q]\) is calculated as follows.

Lemma 1

Under the above definitions, for any positive value \(\gamma >0\), we have \(\mathcal{F}_2[Q] =\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]] \), i.e.,

$$\begin{aligned} \min _{P \in \mathcal{M}_a} J_\gamma (P,Q)&= J _\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]],Q) \nonumber \\&= \gamma D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\Vert \varPhi [Q]) - \gamma \log \kappa [Q] , \end{aligned}$$
(10)
$$\begin{aligned} J _\gamma (P,Q)&=\min _{P' \in \mathcal{M}_a} J _\gamma (P',Q) +\gamma D(P\Vert \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]) \end{aligned}$$
(11)
$$\begin{aligned}&=J _\gamma (\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]],Q) +\gamma D(P\Vert \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]). \end{aligned}$$
(12)

Proof

We have the following relations.

$$\begin{aligned} J _\gamma (P,Q)&=\gamma \sum _{x \in \mathcal{X}} P(x) (\log P(x)- \log Q(x) + \frac{1}{\gamma } \varPsi [Q](x)) \nonumber \\&=\gamma \sum _{x \in \mathcal{X}} P(x) (\log P(x)- \log \varPhi [Q](x)- \log \kappa [Q]) \nonumber \\&=\gamma D(P\Vert \varPhi [Q])-\gamma \log \kappa [Q] \nonumber \\&=\gamma D(P\Vert \varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]) +\gamma D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\Vert \varPhi [Q]) - \gamma \log \kappa [Q] , \qquad \end{aligned}$$
(13)

where the final equation follows from (4). Then, the minimum is given as (10), and it is realized with \(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\).

Applying (10) to the final line of (13), we obtain (11). Since the minimum in (11) is realized when \(P'=\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\), we obtain (12). \(\square \)

We calculate \(\mathcal{F}_1[Q]\). For this aim, we define

$$\begin{aligned} D_{\varPsi }(P \Vert Q):= \sum _{x \in \mathcal{X}} P(x) (\varPsi [P](x)- \varPsi [ Q ](x)). \end{aligned}$$
(14)

Lemma 2

Assume that two distributions \(P,Q \in \mathcal{M}_a\) satisfy the following condition,

$$\begin{aligned} D_{\varPsi }(P \Vert Q) \le \gamma D(P\Vert Q). \end{aligned}$$
(15)

Then, we have \(\mathcal{F}_1[P] =P\), i.e.,

$$\begin{aligned} J _\gamma (P,Q)\ge J _\gamma (P,P) . \end{aligned}$$
(16)

Proof

Eq. (15) guarantees that

$$\begin{aligned}&J _\gamma (P,Q)-J _\gamma (P,P) \nonumber \\&\quad = \gamma D(P\Vert Q)-\sum _{x \in \mathcal{X}} P(x) (\varPsi [P](x)- \varPsi [Q](x)) \ge 0 . \end{aligned}$$
(17)

\(\square \)

Remark 1

The preceding study [19] discussed the minimization of a function defined over the set of density matrices, i.e., the set of quantum states. When the function is given as a function only of the diagonal part of the density matrix, the function is given as a function of probability distribution composed of the diagonal part. That is, the preceding study [19] covers the case when the function is optimized over a set of probability distributions as a special case. The obtained result of this paper covers the case when the function is optimized over a mixture family. That is, the preceding study [19] does not consider the case with linear constraints. In this sense, the obtained result of this paper generalizes the above special case of the result of [19], and Algorithm 1 is a generalization of the algorithm given in [19].

Lemma 3.2 [19] is composed of several statements. The combination of Lemmas 1 and 2 is a generalization of the above special case of [19, Lemma 3.2]. That is, the classical restriction of [19, Lemma 3.2] is equivalent to the combination of Lemmas 1 and 2 without linear constraints.

Due to Lemmas 1 and 2, when all pairs \((P^{(t+1)},P^{(t)})\) satisfy (15), the relations

$$\begin{aligned} \mathcal {G}(P^{(t)})&=J _\gamma (P^{(t)},P^{(t)})\ge J _\gamma (P^{(t+1)},P^{(t)}) \nonumber \\ {}&\ge J _\gamma (P^{(t+1)},P^{(t+1)})= \mathcal {G}(P^{(t+1)}) \end{aligned}$$
(18)

hold under Algorithm 1. In addition, we have the following theorem.

Theorem 1

When all pairs \((P^{(t+1)},P^{(t)})\) satisfy (15), i.e., the positive number \(\gamma \) is sufficiently large, Algorithm 1 converges to a local minimum.

Proof

Since \(\{\mathcal{G}(P^{(t)}) \}\) is monotonically decreasing for t, we have

$$\begin{aligned} \lim _{n \rightarrow \infty } \mathcal{G}(P^{(t)})-\mathcal{G}(P^{(t+1)}) =0. \end{aligned}$$
(19)

Using (12), we have

$$\begin{aligned} \mathcal{G}(P^{(t)})&=J_{\gamma }(P^{(t)},P^{(t)}) \nonumber \\&= \gamma D(P^{(t)}\Vert P^{(t+1)}) + J_{\gamma }( P^{(t+1)}, P^{(t)}) \nonumber \\&\ge \gamma D(P^{(t)}\Vert P^{(t+1)}) + \mathcal{G}( P^{(t+1)}) . \end{aligned}$$
(20)

Thus, we have

$$\begin{aligned} \gamma D(P^{(t)}\Vert P^{(t+1)}) \le \mathcal{G}(P^{(t)})-\mathcal{G}(P^{(t+1)}) . \end{aligned}$$
(21)

Since due to (19) and (21), the sequence \(\{\mathcal{G}(P^{(t)})\}\) is a Cauchy sequence, it converges. \(\quad \square \)

To discuss the details of Algorithm 1, we focus on the \(\delta \)-neighborhood \(U(P^{0},\delta )\) of \(P^{0}\) defined as

$$\begin{aligned} U(P^{0},\delta ):=\{ P \in \mathcal{M}_a | D(P^{0} \Vert P)\le \delta \}. \end{aligned}$$
(22)

In particular, we denote \(\mathcal{M}_a\) by \(U(P^{0},\infty )\). Then, we address the following conditions for the \(\delta \)-neighborhood \(U(P^{0},\delta )\) of \(P^{0}\);

  1. (A0)

    Any distribution \(Q \in U(P^{0},\delta )\) satisfies the inequality

    $$\begin{aligned} \mathcal{G}(\mathcal{F}_2[Q]) \ge \mathcal{G}(P^{0}). \end{aligned}$$
    (23)
  2. (A1)

    Any distribution \(Q \in U(P^{0},\delta )\) satisfies

    $$\begin{aligned} D_{\varPsi }(\mathcal{F}_2[Q] \Vert Q) \le \gamma D(\mathcal{F}_2[Q] \Vert Q) . \end{aligned}$$
    (24)
  3. (A2)

    Any distribution \(Q \in U(P^{0},\delta )\) satisfies

    $$\begin{aligned} D_{\varPsi }(P^{0} \Vert Q) \ge 0. \end{aligned}$$
    (25)
  4. (A3)

    There exists a positive number \(\beta >0\) such that any distribution \(Q \in U(P^{0},\delta )\) satisfies

    $$\begin{aligned} D_{\varPsi }(P^{0} \Vert Q) =\sum _{x \in \mathcal{X}} P^{0}(x) (\varPsi [P^{0}](x)- \varPsi [Q](x)) \ge \beta D(P^{0} \Vert Q). \end{aligned}$$
    (26)

The condition (A3) is a stronger version of (A2).

However, the convergence to the global minimum is not guaranteed. As a generalization of [19, Theorem 3.3], the following theorem discusses the convergence to the global minimum and the convergence speed.

Theorem 2

Assume that the \(\delta \)-neighborhood \(U(P^{0},\delta )\) of \(P^{0}\) satisfies the conditions (A1) and (A2) with \(\gamma \), and \(P^{(1)} \in U(P^{0},\delta )\). Then, Algorithm 1 with \(t_0\) iterations has one of the following two behaviors.

  1. (i)

    There exists an integer \(t_1 \le t_0+1\) such that

    $$\begin{aligned} \mathcal{G}(P^{(t_1)}) < \mathcal{G}(P^{0}). \end{aligned}$$
    (27)
  2. (ii)

    Algorithm 1 satisfies the conditions \(\{P^{(t)}\}_{t=1}^{t_0+1} \subset U(P^{0},\delta )\) and

    $$\begin{aligned} \mathcal{G}(P^{(t_0+1)}) -\mathcal{G}(P^{0}) \le \frac{\gamma D(P^{0}\Vert P^{(1)}) }{t_0}. \end{aligned}$$
    (28)

When the condition (A0) holds additionally, Algorithm 1 with \(t_0\) iterations satisfies (ii).

The above theorem is shown in Appendix 10. Now, we choose an element \(P^* \in \mathcal{M}_a\) to satisfy \(\mathcal{G}(P^*)=\min _{P \in \mathcal{M}_a} \mathcal{G}(P)\). Then, the condition (A0) holds with \(U(P^*,\infty )=\mathcal{M}_a\) and the choice \(P^0=P^*\). When the conditions (A1) and (A2) hold with \(U(P^*,\infty )=\mathcal{M}_a\) and the choice \(P^0=P^*\), Theorem 2 guarantees the convergence to the minimizer \(P^*\) in Algorithm 1. Although Theorem 2 requires the conditions (A1) and (A2), the condition (A2) is essential due to the following reason. When we choose \(\gamma >0\) to be sufficiently large, the condition (A1) holds with the \(\delta \)-neighborhood \(U(P^*,\delta )\) of \(P^*\) because \(U(P^*,\delta )\) is a compact set. Hence, it is essential to check the condition (A2) for Theorem 2.

However, as seen in (28), a larger \(\gamma \) makes the convergence speed slower. Therefore, it is important to choose \(\gamma \) to be small under the condition (A1). Practically, it is better to change \(\gamma \) to be smaller when the point \(P^{(t)}\) is closer to the minimizer \(P^*\). In fact, as a generalization of [19, Proposition 3.6], we have the following exponential convergence under a stronger condition dependent on \(\gamma \). In this sense, the parameter is called an acceleration parameter [19, Remark 3.4].

Theorem 3

Assume that the \(\delta \)-neighborhood \(U(P^{0},\delta )\) of \(P^{0}\) satisfies the conditions (A1) and (A3) with \(\gamma \), and \(P^{(1)} \in U(P^{0},\delta )\). Then, Algorithm 1 with \(t_0\) iterations has one of the following two behaviors.

  1. (i)

    There exists an integer \(t_1 \le t_0+1\) such that

    $$\begin{aligned} \mathcal{G}(P^{(t_1)}) < \mathcal{G}(P^{0}). \end{aligned}$$
    (29)
  2. (ii)

    Algorithm 1 satisfies the conditions \(\{P^{(t)}\}_{t=1}^{t_0+1} \subset U(P^{0},\delta )\) and

    $$\begin{aligned} \mathcal{G}(P^{(t_0+1)})-\mathcal{G}(P^{0}) \le (1-\frac{\beta }{\gamma })^{t_0} D(P^{0} \Vert P^{(1)}). \end{aligned}$$
    (30)

When the condition (A0) holds additionally, Algorithm 1 with \(t_0\) iterations satisfies (ii).

The above theorem is shown in Appendix 11. Next, we consider the case when there are several local minimizers \(P^*_1, \ldots , P^*_{n} \in \mathcal{M}_a\) while the true minimizer is \(P^*\). These local minimizers are characterized by the following corollary, which is shown in Appendix 10 as a corollary of Theorem 2.

Corollary 1

$$\begin{aligned} D_{\varPsi }(P^* \Vert P^*_i)= \sum _{x \in \mathcal{X}}P^*(x) (\varPsi [P^*](x)-\varPsi [P^*_i](x))= \mathcal{G}(P^*)-\mathcal{G}(P^*_i)<0. \qquad \end{aligned}$$
(31)

Hence, if there exist local minimizers, the condition (A2) does not hold with \(U(P^*,\infty )=\mathcal{M}_a\) and the choice \(P^0=P^*\). In this case, when the \(\delta \)-neighborhood \(U(P^*_i,\delta )\) of \(P^*_i\) satisfies the conditions (A0), (A1), and (A2), Algorithm 1 converges to the local minimizer \(P^*_i\) with the speed (28) except for the case (i). Since \(P^*_i\) is a local minimizer, the \(\delta \)-neighborhood \(U(P^*_i,\delta )\) of \(P^*_i\) satisfies the conditions (A0) and (A1) with sufficiently small \(\delta >0\). When the following condition (A4) holds, as shown below, the \(\delta \)-neighborhood \(U(P^*_i,\delta )\) of \(P^*_i\) satisfies the condition (A2) with sufficiently small \(\delta >0\). That is, when the initial point belongs to the \(\delta \)-neighborhood \(U(P^*_i,\delta )\), Algorithm 1 converges to \(P^*_i\).

  1. (A4)

    The function \(\eta \mapsto \varPsi [P_\eta ](x)\) is differentiable, and the relation

    $$\begin{aligned} \sum _{x \in \mathcal{X}}P_{\eta } (x) \Big (\frac{\partial }{\partial \eta _i}\varPsi [P_\eta ](x)\Big )= 0 \end{aligned}$$
    (32)

    holds for \(i=k+1, \ldots , l\) and \(P_{\eta } \in \mathcal{M}_a\).

Lemma 3

We consider the following two conditions for a convex subset \(\mathcal{K} \subset \mathcal{M}_a\).

  1. (B1)

    The relation

    $$\begin{aligned} D_{\varPsi }(P \Vert Q)=\sum _{x \in \mathcal{X}} P(x) (\varPsi [P](x)-\varPsi [Q](x)) \ge 0 \end{aligned}$$
    (33)

    holds for \(P,Q \in \mathcal{K}\).

  2. (B2)

    \(\mathcal{G}(P)\) is convex for the mixture parameter in \(\mathcal{K}\).

The condition (B1) implies the condition (B2). In addition, when the condition (A4) holds, the condition (B2) implies the condition (B1).

We consider two kinds of mixture parameters. These parametrizations can be converted to each other via affine conversion, which preserves the convexity. Therefore, the condition (B2) does not depend on the choice of mixture parameter.

When the function \(\eta \mapsto \varPsi [P_\eta ](x)\) is twice-differentiable, and the Hessian of \(\mathcal{G}(P_\eta )\) is strictly positive semi-definite at a local minimizer \( P^*_i\), this function is convex in the \(\delta \)-neighborhood \(U(P^*_i,\delta )\) of \(P^*_i\) with a sufficiently small \(\delta >0\) because the Hessian of \(\mathcal{G}(P_\eta )\) is strictly positive semi-definite in the neighborhood due to the continuity.

Then, Lemma 3 guarantees the condition (A2) for the \(\delta \)-neighborhood \(U(P^*_i,\delta )\). Algorithm 1 converges to the local minimizer \(P^*_i\) with the speed (28) except for the case (i). The mathematical symbols introduced in Sect. 2.1 are summarized in Table 1.

Table 1 List of mathematical symbols for Sect. 2.1

Proof of Lemma 3

Assume the condition (B1). Then, for \(\lambda \in [0,1]\), we have

$$\begin{aligned} \varphi (\lambda )&:=\lambda \mathcal{G}(P)+(1-\lambda ) \mathcal{G}(Q) - \mathcal{G}(\lambda P+(1-\lambda )Q)\nonumber \\&=\lambda \sum _{x \in \mathcal{X}} P(x) (\varPsi [P](x)- \varPsi [\lambda P+(1-\lambda )Q](x)) \nonumber \\&\quad +(1-\lambda ) \sum _{x \in \mathcal{X}} Q(x) (\varPsi [Q](x)- \varPsi [\lambda P+(1-\lambda )Q](x))\nonumber \\&\ge 0, \end{aligned}$$
(34)

which implies (B2).

Assume the conditions (A4) and (B2). Since \(\varphi (\lambda )\ge 0\) for \(\lambda \in [0,1]\), we have

$$\begin{aligned} 0&\le \frac{d \varphi (\lambda )}{d \lambda }|_{\lambda =0} \nonumber \\&=\mathcal{G}(P)- \mathcal{G}(Q) -\sum _{x \in \mathcal{X}} (P(x)-Q(x))\varPsi [Q](x)\nonumber \\&\quad -\sum _{x \in \mathcal{X}} Q(x) \frac{d \varPsi [\lambda P+(1-\lambda )Q](x)}{d\lambda }|_{\lambda =0}\nonumber \\&{\mathop {=}\limits ^{(a)}}\mathcal{G}(P)- \mathcal{G}(Q) -\sum _{x \in \mathcal{X}} (P(x)-Q(x))\varPsi [Q](x), \end{aligned}$$
(35)

which implies (B1), where (a) follows from the condition (A4). \(\square \)

Remark 2

The preceding study [19, Theorem 3.3 and Proposition 3.6] consider similar statements as Theorems 2 and 3. As mentioned in Remark 1, the preceding study [19] covers the case when \(\mathcal{M}_a\) is given as \(\mathcal{P}(\mathcal{X})\), and does not cover the case with a general mixture family \(\mathcal{M}_a\). In addition, the preceding study [19, Theorem 3.3 and Proposition 3.6] covers only the case when \(P^0\) and \(U(P^{0},\delta )\) are \(P^*\) and \( \mathcal{P}(\mathcal{X})\), respectively. That is, the preceding study does not cover the case with local minimizers. In this sense, Theorems 2 and 3 are more general under the classical setting.

2.2 Algorithm with approximated iteration

In general, it is not so easy to calculate the e-projection \(\varGamma ^{(e)}_{\mathcal{M}_a}(\varPhi [Q])\). We consider the case when it is approximately calculated. There are two methods to calculate the e-projection. One is the method based on the minimization in the given mixture family, and the other is the method based on the minimization in the exponential family orthogonal to the mixture family. In the first method, the e-projection \(\varGamma ^{(e)}_{\mathcal{M}_a}(\varPhi [Q])\) is the minimizer of the following minimization;

$$\begin{aligned} \min _{P \in \mathcal{M}_a} D(P\Vert \varPhi [Q]). \end{aligned}$$
(36)

To describe the second method, using the functions \(f_j\) used in (1), we define the exponential family

$$\begin{aligned} Q_{\theta }(x):= \varPhi [Q](x) e^{\sum _{j=1}^k \theta ^j f_j(x)- \phi [Q](\theta )}, \end{aligned}$$
(37)

where

$$\begin{aligned} \phi [Q](\theta ):= \log \sum _{x \in \mathcal{X}}\varPhi [Q](x) e^{\sum _{j=1}^k \theta ^j f_j(x)}. \end{aligned}$$
(38)

The projected element \(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [Q]]\) is the unique element of the intersection \( \{Q_\theta \}\cap \mathcal{M}_a\). For example, for this fact, see [22, Lemma 3]. Then, the e-projection \(\varGamma ^{(e)}_{\mathcal{M}_a}(\varPhi [Q])\) is given as the solution of the following equation;

$$\begin{aligned} \frac{\partial \phi [Q]}{\partial \theta ^j }(\theta ) =\sum _{x\in \mathcal{X}}Q_\theta (x) f_j(x)= a_j \end{aligned}$$
(39)

for \(j=1, \ldots , k\). The solution of (39) is given as the minimizer of the following minimization;

$$\begin{aligned} \min _{\theta \in \mathbb {R}^k} \phi [Q](\theta )- \sum _{j=1}^k \theta ^j a_j. \end{aligned}$$
(40)

We discuss the precision of our algorithm when each step in the above minimization has a certain error. Allowing certain errors in the first method, we propose Algorithm 2 instead of Algorithm 1.

However, the first method requires the minimization with the same number of parameters as the original minimization \(\min _{P \in \mathcal{M}_a} \mathcal{G}(P)\). Hence, it is better to employ the second method. In fact, when \(\mathcal{M}_a\) is given as a subset of \(\mathcal{P}(\mathcal{X})\) with one linear constraint, the minimization (40) is written as a one-parameter convex minimization. Since any one-parameter convex minimization is performed by the bisection method, which needs \(O(-\log \epsilon )\) iterations [35] to achieve a smaller error of the minimum of the objective function than \(\epsilon \), the cost of this minimization is much smaller than that of the original minimization \(\min _{P \in \mathcal{M}_a} \mathcal{G}(P)\). To consider an algorithm based on the minimization (40), we assume that \(\varPsi \) is defined in \(\mathcal{P}(\mathcal{X})\). In the multi-parameter case, we can use the gradient method and the accelerated proximal gradient method [53,54,55,56,57,58].

Algorithm 2
figure b

Minimization of \(\mathcal{G}(P)\) with an error in (36)

To consider the convergence of Algorithm 2, we extend the conditions (A1) and (A2). For this aim, we focus on the \(\delta \)-neighborhood \(\bar{U}(P^{0},\delta )\) of \(P^{0} \in \mathcal{M}_a\) defined as

$$\begin{aligned} \bar{U}(P^{0},\delta ):=\{ P \in \mathcal{P}(\mathcal{X}) | D(P^{0} \Vert P)\le \delta \}. \end{aligned}$$
(43)

Then, we introduce the following conditions for the \(\delta \)-neighborhood \(\bar{U}(P^{0},\delta )\) of \(P^{0}\) as follows.

  1. (A1+)

    Any distribution \(Q \in \bar{U}(P^{0},\delta )\cap \mathcal{M}_a ={U}(P^{0},\delta )\) satisfies the following condition with a positive real number \(\epsilon _2 >0\). When a distribution \(P \in \mathcal{M}_a\) satisfies \(D( P \Vert \mathcal{F}_2[Q]) \le \epsilon _2\), we have

    $$\begin{aligned} \sum _{x \in \mathcal{X}} P(x) (\varPsi [P](x)- \varPsi [ Q ](x)) \le \gamma D(P \Vert Q). \end{aligned}$$
    (44)
  2. (A2+)

    Any distribution \(Q \in \bar{U}(P^{0},\delta )\) satisfies (25).

The convergence of Algorithm 2 is guaranteed in the following theorem.

Theorem 4

Assume that the \(\delta \)-neighborhood \(\bar{U}(P^{0},\delta )\) of \(P^{0}\) satisfies the conditions (A1+) and (A2+) with two positive real numbers \(\gamma >0\), \(\epsilon _2>0\), and \(P^{(1)} \in U(P^{0},\delta )\). Then, for a positive real number \(\epsilon _1>0\), Algorithm 2 satisfies the conditions

$$\begin{aligned} D(\varGamma ^{(e)}_{\mathcal{M}_a}[\varPhi [\bar{P}^{(t)}]]\Vert \bar{P}^{(t+1)} )&\le \epsilon _1 \end{aligned}$$
(45)
$$\begin{aligned} \mathcal{G}(P_f^{(t_1)}) -\mathcal{G}(P^*)&\le \frac{\gamma D(P^*\Vert P^{(1)}) }{t_1-1} + \epsilon _1 +\gamma \epsilon _2. \end{aligned}$$
(46)

The above theorem is shown in Appendix 12. We discussed the convergences of Algorithms 1 and 2 under several conditions. When these conditions do not hold, we cannot guarantee its global convergence but, the algorithms achieve a local minimum. Hence, we need to repeat these algorithms by changing the initial value. The mathematical symbols introduced in Sect. 2.2 are summarized in Table 2.

Table 2 List of mathematical symbols for Sect. 2.2

Remark 3

To address the minimization with a cost constraint, the paper [13] added a linear penalty term to the objective function. However, this method does not guarantee that the obtained result satisfies the required cost constraint. Our method can be applied to any mixture family including the distribution family with cost constraint(s). Hence, our method can be applied directly without the above modification while we need to calculate the e-projection. As explained in this subsection, this e-projection can be obtained with the convex minimization whose number of variables is the number of the constraint to define the mixture family. If the number of the constraints is not so large, still the e-projection is feasible.

2.3 Combination of the gradient method and the Algorithm 1

Although we can use the gradient method to calculate (40) for a general mixture family \(\mathcal{M}_a\), in order to calculate \(\overline{\mathcal{G}}(a):=\min _{P \in \mathcal{M}_a} \mathcal{G}(P)\) with \(a \in \mathbb {R}^k\), we propose another algorithm to combine the gradient method and Algorithm 1. This algorithm avoids the calculation of the e-projection \(\varGamma ^{(e)}_{\mathcal{M}_a}\). For simplicity, we assume that the function \(\overline{\mathcal{G}}(a)\) is convex and \(\mathcal{M}_a\) is not empty. Then, we replace \(f_i(x)\) by \(f_i(x)-a_i\), which implies that \(\overline{\mathcal{G}}(a)\) is changed to \(\overline{\mathcal{G}}(0)\). In the following, we aim to calculate \(\overline{\mathcal{G}}(0)\), and we denote the expectation of the function f under the distribution P by P[f].

Then, we consider the following functions by using Legendre transform; For \(b=(b^1, \ldots , b^k)\in \mathbb {R}^k\) and \(c=(c^1, \ldots , c^{l-k})\in \mathbb {R}^{l-k}\), we define

$$\begin{aligned} \mathcal{G}_*(b,c)&:= \sup _{P \in \mathcal{P}(\mathcal{X})} \sum _{i=1}^k b^i P[f_i] +\sum _{j=1}^{l-k} c^i P[f_{k+i}]-\mathcal{G}(P) , \end{aligned}$$
(47)

and

$$\begin{aligned} \overline{\mathcal{G}}_*(b)&:=\mathcal{G}_*(b,0) =\sup _{P \in \mathcal{P}(\mathcal{X})} \sum _{i=1}^k b^i P[f_i]-\mathcal{G}(P) = \sup _{a \in \mathbb {R}^k} \sum _{i=1}^k b^i a_i -\overline{\mathcal{G}}(a). \end{aligned}$$
(48)

In the following, we consider the calculation of \(\overline{\mathcal{G}}(0)\) by assuming that the function \(\eta \mapsto \mathcal{G}(P_\eta ) \) is \(C^2\)-continuous and convex. Since Legendre transform of \(\overline{\mathcal{G}}_*(b)\) is \(\overline{\mathcal{G}}(a)\) due to the convexity of \(\overline{\mathcal{G}}(a)\), we have \(\sup _{b \in \mathbb {R}^k} \sum _{i=1}^k b^i a_i -\overline{\mathcal{G}}_*(b) =\overline{\mathcal{G}}(a)\). As a special case, we have

$$\begin{aligned} -\inf _{b \in \mathbb {R}^k} \overline{\mathcal{G}}_*(b) =\overline{\mathcal{G}}(0). \end{aligned}$$
(49)

That is, when we find the minimizer \(b_*:= \mathop {\textrm{argmin}}\limits _{a \in \mathbb {R}^k} \overline{\mathcal{G}}_*(b)\), we can calculate \(\overline{\mathcal{G}}(0)\) as \(\overline{\mathcal{G}}(0)= -\sup _{P \in \mathcal{P}(\mathcal{X})} \sum _{i=1}^k b_*^i P[f_i]-\mathcal{G}(P) =\inf _{P \in \mathcal{P}(\mathcal{X})} \mathcal{G}(P)-\sum _{i=1}^k b_*^i P[f_i] \).

To find it, we denote the gradient vector of a function f on \(\mathbb {R}^k\) by \(\nabla f\). That is, \(\nabla f\) is the vector \( (\frac{\partial }{\partial x^1}f ,\ldots , \frac{\partial }{\partial x^k}f )\). Then, we choose a real number L that is larger than the matrix norm of the Hessian of \(\overline{\mathcal{G}}_*\), which implies the uniform Lipschitz condition;

$$\begin{aligned} \Vert \nabla \overline{\mathcal{G}}_*(b)- \nabla \overline{\mathcal{G}}_*(b')\Vert \le L \Vert b-b'\Vert . \end{aligned}$$
(50)

Then, we apply the following update rule for the minimization of \(\overline{\mathcal{G}}_*(b)\);

$$\begin{aligned} b_{t+1}:= b_t -\frac{1}{L}\nabla \overline{\mathcal{G}}_*(b_t). \end{aligned}$$
(51)

The following precision is guaranteed [52, Chapter 10] [53, 54];

$$\begin{aligned} |\overline{\mathcal{G}}_*(b_k)- \overline{\mathcal{G}}_*(b_*)| \le \frac{L}{2k} \Vert b_*-b_0\Vert ^2. \end{aligned}$$
(52)

We notice that

$$\begin{aligned} \nabla \overline{\mathcal{G}}_*(b)= \mathop {\textrm{argmax}}\limits _{a \in \mathbb {R}^k} \sum _{i=1}^k b^i a_i -\overline{\mathcal{G}}(a)= (Q_b[f_i])_{i=1}^k, \end{aligned}$$
(53)

where

$$\begin{aligned} Q_b&:=\mathop {\textrm{argmax}}\limits _{P \in \mathcal{P}(\mathcal{X})} \sum _{i=1}^k b^i P[f_i]-\mathcal{G}(P) \nonumber \\&=\mathop {\textrm{argmin}}\limits _{P \in \mathcal{P}(\mathcal{X})} \sum _{x \in \mathcal{X}}P(x) \Big (\varPsi [P](x) -\sum _{i=1}^k b^i f_i(x)\Big ) . \end{aligned}$$
(54)

However, the calculation of (54) requires a large calculation amount. Hence, replacing the update rule (51) by a one-step iteration in Algorithm 1, we propose another algorithm.

Using \(\varPhi ^{b}[Q](x):= \frac{1}{\kappa }Q(x)\exp ( -\frac{1}{\gamma } \Big (\varPsi [P](x) -\sum _{i=1}^k b^i f_i(x)\Big )\) with the normalizing constant \(\kappa \), we propose Algorithm 3.

Algorithm 3
figure c

Minimization of \(\mathcal{G}(P)\)

It is not so easy to evaluate the convergence speed of Algorithm 3. But, when it converges, the convergent point is the true minimizer.

Theorem 5

When the pair (bP) is a convergence point, we have \(b=b_*\) and \(P=P_*\).

Proof

Since the pair (bP) is a convergence point, we have \(P=\varPhi ^b[P]\), which implies

$$\begin{aligned} \sum _{i=1}^k b^i P[f_i]-\mathcal{G}(P) =\sup _{P' \in \mathcal{P}(\mathcal{X})} \sum _{i=1}^k b^i P'[f_i]-\mathcal{G}(P') =\overline{\mathcal{G}}_*(b). \end{aligned}$$
(55)

Since the pair (bP) is a convergence point, we have \( P[f_i] =0\) for \(i=1, \ldots , k\), i.e., the distribution P satisfies the required condition in (1). The relation (53) implies \(\nabla \overline{\mathcal{G}}_*(b)=0\). Hence, (49) yields \(\overline{\mathcal{G}}_*(b)=\overline{\mathcal{G}}(0)\), which implies \(b=b_*\). Therefore, the relation (55) is rewritten as \(\mathcal{G}(P) =\overline{\mathcal{G}}(0)\), which implies \(P=P_*\). \(\square \)

Remark 4

We compare our algorithm with a general algorithm proposed in [13]. The input of the objective function in [13] forms a mixture family. The function f given in [13, (6)] satisfies the condition of \(\mathcal{G}\) by considering the second line of [13, (6)] as \(\varPsi \). Their algorithm is the same as Algorithm 1 with \(\gamma =1\) when there is no constraint because their extended objective function g defined in [13, (16)] can be considered as \(D(P\Vert Q)+ \sum _{x \in \mathcal{X}}P(x) \varPsi [Q](x)\), where the choice of q in [13] corresponds to the choice of P and the choice of \(Q_1,\ldots , Q_K\) in [13] does to the choice of Q.

Also, we can show that the function f given in [13, (6)] satisfies the condition (A4). Since the condition (A4) holds, the convexity of f is equivalent to the condition (B1). This equivalence, in this case, was shown as [13, Proposition 4.1]. They showed the convergence of their algorithm as [13, Theorem 4.1], which can be considered as a special case of our Theorem 2.

However, our treatment for the constraint is different from theirs. They consider the minimization \(\min _{P \in \mathcal{P}(\mathcal{X})} \mathcal{G}(P)-\sum _{i=1}^k b^i P[f_i] \) without updating the parameter b. Hence, their algorithm cannot achieve the minimum with the desired constraint while Algorithms 12, and 3 achieve the minimum with the desired constraint. Although their algorithm is similar to Algorithm 3, Algorithm 3 updates the parameter b to achieve the minimum with the desired constraint.

3 Application to information theoretical problems

3.1 Channel capacity

In the same way as the reference [19], we apply our problem setting to the channel coding. A channel is given as a conditional distribution \(W_{Y|X}\) on the sample space \(\mathcal{Y}\) with conditions on the sample space \(\mathcal{X}\), where \(\mathcal{Y}\) is a general sample space with a measure \(\mu \) and \(\mathcal{X}\) is a finite sample space. For two absolutely continuous distributions \(P_Y\) and \(Q_Y\) with respect to \(\mu \) on \(\mathcal{Y}\), the Kullback–Leibler divergence \(D(P_Y\Vert Q_Y)\) is given as

$$\begin{aligned} D(P_Y\Vert Q_Y):= \int _{\mathcal{Y}}p_Y(y) (\log p_Y(y)-\log q_Y(y))\mu (dy), \end{aligned}$$
(56)

where \(p_Y\) and \(q_Y\) are the probability density functions of \(P_Y\) and \(Q_Y\) with respect to \(\mu \). This quantity is generalized to the Renyi divergence with order \(\alpha >0\) as

$$\begin{aligned} D_\alpha (P_{Y} \Vert Q_Y):= \frac{1}{\alpha -1}\log \int _{\mathcal{Y}} \Big (\frac{p_Y(y)}{q_Y(y)}\Big )^{\alpha -1}p_Y(y) \mu (dy). \end{aligned}$$
(57)

The channel capacity \(C(W_{Y|X})\) is given as the maximization of the mutual information \(I(P_X,W_{Y|X})\) as [10]

$$\begin{aligned} C(W_{Y|X})&:=\max _{P_X} I(P_X,W_{Y|X}) \end{aligned}$$
(58)
$$\begin{aligned} I(P_X,W_{Y|X})&:=\sum _{x \in \mathcal{X}}P_X(x) D(W_{Y|X=x}\Vert W_{Y|X} \cdot P_X) \nonumber \\&= D(W_{Y|X} \times P_X \Vert (W_{Y|X} \cdot P_X) \times P_X), \end{aligned}$$
(59)

where \(W_{Y|X} \cdot P_X\) and \(W_{Y|X} \times P_X\) are defined as the following probability density functions \(w_{Y|X} \cdot P_X\) and \(w_{Y|X} \times P_X\);

$$\begin{aligned} (w_{Y|X} \cdot P_X)(y)&:= \sum _{x \in \mathcal{X}}P_X(x) w_{Y|X=x}(y) \end{aligned}$$
(60)
$$\begin{aligned} (w_{Y|X} \times P_X)(x, y)&:= P_X(x) w_{Y|X=x}(y) . \end{aligned}$$
(61)

However, the mutual information \(I(P_X,W_{Y|X})\) has another form as

$$\begin{aligned} I(P_X, W_{Y|X})= \min _{Q_Y} \sum _{x \in \mathcal{X}}P_X(x) D(W_{Y|X=x}\Vert Q_{Y}). \end{aligned}$$
(62)

When we choose \(\mathcal{M}_a\) and \(\varPsi \) as \(\mathcal{P}(\mathcal{X})\) and

$$\begin{aligned} \varPsi _{W_{Y|X}}[P_X](x):= - D(W_{Y|X=x}\Vert W_{Y|X} \cdot P_X), \end{aligned}$$
(63)

\(-I(P_X,W_{Y|X})\) coincides with \(\mathcal{G}(P_X)\) [19]. Since

$$\begin{aligned} D_{\varPsi }(P_X\Vert Q_X) = D(W_{Y|X} \cdot P_X\Vert W_{Y|X} \cdot Q_X) \ge 0, \end{aligned}$$
(64)

the condition (A2) holds with \(\mathcal{P}(\mathcal{X})\). In addition, since the information processing inequality guarantees that

$$\begin{aligned} D(W_{Y|X} \cdot P_X\Vert W_{Y|X} \cdot Q_X) \le D(P_X\Vert Q_X), \end{aligned}$$
(65)

the condition (A1) holds with \(\gamma =1\) and \(\mathcal{P}(\mathcal{X})\). In this case, \(\varPhi \) is given as

$$\begin{aligned} \varPhi [Q_X](x)=\frac{1}{\kappa _{W_{Y|X}}[Q_X]}Q_X(x) \exp ( \frac{1}{\gamma } D(W_{Y|X=x}\Vert W_{Y|X} \cdot Q_X)), \end{aligned}$$
(66)

where the normalizing constant \(\kappa _{W_{Y|X}}[Q_X]\) is given as

$$\begin{aligned} \kappa _{W_{Y|X}}[Q_X]=\sum _{x \in \mathcal{X}}Q_X(x) \exp ( \frac{1}{\gamma } D(W_{Y|X=x}\Vert W_{Y|X} \cdot Q_X)). \end{aligned}$$
(67)

When \(\gamma =1\), it coincides with the Arimoto–Blahut algorithm [8, 9]. Since \(\varPhi [Q_X] \in \mathcal{P}(\mathcal{X})\), \(P_X^{(t+1)}\) is given as \(\varPhi [P_X^{(t)}]\).

Remark 5

The reference [19] covers the case when \(\mathcal{M}_a\) is given as \(\mathcal{P}(\mathcal{X})\), and the reference [19] presented the algorithms presented in this subsection in a more general form. Also, they proposed an adaptive choice of \(\gamma \) in this case [19, (22)]. In addition, they numerically compared their adaptive choice with the case of \(\gamma =1\) [19, Figs. 1,..., 6]. These comparisons show a significant improvement by their adaptive choice.

3.2 Reliability function in channel coding

In channel coding, we consider the reliability function, which was originally introduced by Gallager [25] and expresses the exponential decreasing rate of an upper bound of the decoding block error probability under the random coding. To achieve this aim, for \(\alpha > 0 \), we define

$$\begin{aligned} I_{\alpha }(P_X,W_{Y|X}):= \frac{\alpha }{\alpha -1}\log \Big ( \int _{\mathcal{Y}} \Big (\sum _{x \in \mathcal{X}} P_X(x) w_{Y|X=x}(y)^{\alpha }\Big )^{\frac{1}{\alpha }} \mu (dy) \Big ). \end{aligned}$$
(68)

Then, when the code is generated with the random coding based on the distribution \(P_X\), the decoding block error probability with coding rate R is upper bounded by the following quantity;

$$\begin{aligned} e^{n \min _{\rho \in [0,1]} \big (\rho R -\rho I_{\frac{1}{1+\rho }}(P_X,W_{Y|X})\big )} \end{aligned}$$
(69)

when we use the channel \(W_{Y|X}\) with n times. Notice that \(e^{-\rho I_{\frac{1}{1+\rho }}(P_X,W_{Y|X})}= \int _{\mathcal{Y}} \Big (\sum _{x \in \mathcal{X}} P_X(x) w_{Y|X=x}(y)^{\frac{1}{1+\rho }}\Big )^{1+\rho } \mu (dy)\). That is, the Gallager function [25] is given as \(\rho I_{\frac{1}{1+\rho }}(P_X,W_{Y|X})\), i.e., the parameter \(\alpha \) is different from the parameter \(\rho \) in the Gallager function. Taking the minimum for the choice of \(P_X\), we have

$$\begin{aligned} \min _{P_X} e^{\min _{\rho \in [0,1]} \big (\rho R -\rho I_{\frac{1}{1+\rho }}(P_X,W_{Y|X})\big )} = \min _{\alpha \in [1/2,1]} \Big (e^{-\frac{\alpha -1}{\alpha } R} \min _{P_X} e^{\frac{\alpha -1}{\alpha } I_{\alpha }(P_X,W_{Y|X})}\Big ) \end{aligned}$$
(70)

with \(\alpha =\frac{1}{1-\rho }\in [1/2,1]\). Therefore, we consider the following minimization;

$$\begin{aligned} \min _{P_X} e^{\frac{\alpha -1}{\alpha } I_{\alpha }(P_X,W_{Y|X})} =\min _{P_X} \int _{\mathcal{Y}} \Big (\sum _{x \in \mathcal{X}} P_X(x) w_{Y|X=x}(y)^{\alpha }\Big )^{\frac{1}{\alpha }} \mu (dy) . \quad \end{aligned}$$
(71)

In the following, we discuss the RHS of (71) with \(\alpha \in [1/2,1]\).

To apply our method, as a generalization of (62), we consider another expression of \(I_{\alpha }(P_X,W_{Y|X})\);

$$\begin{aligned} I_{\alpha }(P_X,W_{Y|X})= \min _{Q_Y} D_\alpha (W_{Y|X}\times P_X \Vert Q_Y\times P_X), \end{aligned}$$
(72)

which was shown in [27, Lemma 2]. Using

$$\begin{aligned} Q_{Y|\alpha ,P_X}&:= \mathop {\textrm{argmin}}\limits _{Q_Y} D_\alpha (W_{Y|X}\times P_X \Vert Q_Y\times P_X) \nonumber \\&=\mathop {\textrm{argmax}}\limits _{Q_Y}\sum _{x\in \mathcal{X}}P_X(x)e^{ (\alpha -1)D_\alpha (W_{Y|X=x} \Vert Q_{Y})}, \end{aligned}$$
(73)

we have

$$\begin{aligned} \Big (\min _{P_X} e^{\frac{\alpha -1}{\alpha } I_{\alpha }(P_X,W_{Y|X})}\Big )^\alpha = \min _{P_X} \sum _{x\in \mathcal{X}}P_X(x)e^{ (\alpha -1) D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})}. \end{aligned}$$
(74)

The probability density function \(q_{Y|\alpha ,P_X}\) of the minimizer \(Q_{Y|\alpha ,P_X}\) is calculated as

$$\begin{aligned} q_{Y|\alpha ,P_X}(y)= C \Big (\sum _{x \in \mathcal{X}} P_X(x) w_{Y|X=x}(y)^{\alpha }\Big )^{\frac{1}{\alpha }} , \end{aligned}$$
(75)

where C is the normalizing constant [27, Lemma 2].

To solve the minimization (74), we apply our method to the case when we choose \(\mathcal{M}_a\) and \(\varPsi \) as \(\mathcal{P}(\mathcal{X})\) and

$$\begin{aligned} \varPsi _{\alpha ,W_{Y|X}}[P_X](x):= e^{(\alpha -1) D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})}. \end{aligned}$$
(76)

Since (73) guarantees that

$$\begin{aligned}&\sum _{x\in \mathcal{X}}P_X(x) (\varPsi _{\alpha ,W_{Y|X}}[P_X](x) -\varPsi _{\alpha ,W_{Y|X}}[Q_X](x)) \nonumber \\&\quad = \sum _{x\in \mathcal{X}}P_X(x) \Big (e^{ (\alpha -1)D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})} - e^{ (\alpha -1)D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,Q_X})}\Big ) \ge 0, \end{aligned}$$
(77)

the condition (A2) holds with \(\mathcal{M}_a=\mathcal{P}(\mathcal{X})\). The condition (A1) can be satisfied with sufficiently large \(\gamma \). In this case, \(\varPhi \) is given as

$$\begin{aligned} \varPhi _\alpha [Q_X](x)=\frac{1}{\kappa _{\alpha ,W_{Y|X}}[Q_X]}Q_X(x) \exp ( -\frac{1}{\gamma } e^{(\alpha -1) D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})}), \end{aligned}$$
(78)

where the normalizing constant \(\kappa _{\alpha ,W_{Y|X}}[Q_X]\) is given as \(\kappa _{\alpha ,W_{Y|X}}[Q_X]=\) \(\sum _{x \in \mathcal{X}}Q_X(x) \exp ( -\frac{1}{\gamma } e^{(\alpha -1) D_\alpha ( W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})})\). Since \(\varPhi _\alpha [Q_X] \in \mathcal{M}_a\), \(P_X^{(t+1)}\) is given as \(\varPhi _\alpha [P_X^{(t)}]\).

3.3 Strong converse exponent in channel coding

In channel coding, we discuss an upper bound of the probability of correct decoding. This probability is upper bounded by the following quantity;

$$\begin{aligned} \max _{P_X} e^{n \min _{\rho \in [0,1]} \Big (-\rho R +\rho I_{\frac{1}{1-\rho }}(P_X,W_{Y|X})\Big )} \end{aligned}$$
(79)

when we use the channel \(P_{Y|X}\) with n times and the coding rate is R [26]. Therefore, we consider the following maximization;

$$\begin{aligned} \max _{P_X} e^{\rho I_{\frac{1}{1-\rho }}(P_X,W_{Y|X})} =\max _{P_X} e^{\frac{\alpha -1}{\alpha } I_{\alpha }(P_X,W_{Y|X})} \end{aligned}$$
(80)

with \(\alpha =\frac{1}{1-\rho }>1\). In the following, we discuss the RHS of (80) with \(\alpha >1\).

To apply our method, we consider another expression (72) of \(I_{\alpha }(P_X,W_{Y|X})\). Using (73), we have

$$\begin{aligned} \Big (\max _{P_X} e^{\frac{\alpha -1}{\alpha } I_{\alpha }(P_X,W_{Y|X})}\Big )^\alpha = \max _{P_X} \sum _{x\in \mathcal{X}}P_X(x)e^{ (\alpha -1) D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})}. \end{aligned}$$
(81)

The maximization (81) can be solved by choosing \(\mathcal{M}_a\) and \(\varPsi \) as \(\mathcal{P}(\mathcal{X})\) and

$$\begin{aligned} \varPsi _{\alpha ,W_{Y|X}}[P_X](x):= - e^{(\alpha -1) D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})}. \end{aligned}$$
(82)

Since (73) guarantees that

$$\begin{aligned}&\sum _{x\in \mathcal{X}}P_X(x) (\varPsi _{\alpha ,W_{Y|X}}[P_X](x) -\varPsi _{\alpha ,W_{Y|X}}[Q_X](x)) \nonumber \\&\quad = \sum _{x\in \mathcal{X}}P_X(x) \Big (-e^{ (\alpha -1)D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})} + e^{ (\alpha -1)D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,Q_X})}\Big ) \ge 0, \end{aligned}$$
(83)

the condition (A2) holds with \(\mathcal{M}_a=\mathcal{P}(\mathcal{X})\). Similarly, the condition (A1) can be satisfied with sufficiently large \(\gamma \). In this case, \(\varPhi \) is given as

$$\begin{aligned} \varPhi _\alpha [Q_X](x)=\frac{1}{\kappa _{\alpha ,W_{Y|X}}[Q_X]}Q_X(x) \exp ( \frac{1}{\gamma } e^{(\alpha -1) D_\alpha ( W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})}), \end{aligned}$$
(84)

where the normalizing constant \(\kappa _{\alpha ,W_{Y|X}}[Q_X]\) is given as \(\kappa _{\alpha ,W_{Y|X}}[Q_X]=\)

\(\sum _{x \in \mathcal{X}}Q_X(x) \exp ( \frac{1}{\gamma } e^{(\alpha -1) D_\alpha (W_{Y|X=x} \Vert Q_{Y|\alpha ,P_X})})\). Since \(\varPhi _\alpha [Q_X] \in \mathcal{M}_a\), \(P_X^{(t+1)}\) is given as \(\varPhi _\alpha [P_X^{(t)}]\).

3.4 Wiretap channel capacity

3.4.1 General case

Given a pair of a channel \(W_{Y|X}\) from \(\mathcal{X}\) to a legitimate user \(\mathcal{Y}\) and a channel \(W_{Z|X}\) from \(\mathcal{X}\) to a malicious user \(\mathcal{Z}\), the wiretap channel capacity is given as [28, 29]

$$\begin{aligned} C(W_{Y|X},W_{Z|X}):=\max _{P_{VX}} I(P_V, W_{Y|X}\cdot P_{X|V})-I(P_V, W_{Z|X}\cdot P_{X|V}) \end{aligned}$$
(85)

with a sufficiently large discrete set \(\mathcal{V}\). The recent papers showed that the above rate can be achieved even with the strong security [30,31,32] and the semantic security [33, 34].Footnote 1 Furthermore, the paper [34, Appendix D] showed the above even when the output systems are general continuous systems including Gaussian channels. The wiretap capacity (85) can be calculated via the minimization;

$$\begin{aligned} \min _{P_{VX}} -I(P_V, W_{Y|X}\cdot P_{X|V})+I(P_V, W_{Z|X}\cdot P_{X|V}). \end{aligned}$$
(86)

Here, \(\mathcal{V}\) is an additional discrete sample space. When we choose \(\mathcal{M}_a\) and \(\varPsi \) as \(\mathcal{P}(\mathcal{X} \times \mathcal{V})\) and

$$\begin{aligned}&\varPsi _{W_{Y|X},W_{Z|X}}[P_{VX}](v,x)\nonumber \\&\quad := D(W_{Z|X}\cdot P_{X|V=v}\Vert W_{Z|X}\cdot P_{X} ) -D(W_{Y|X}\cdot P_{X|V=v}\Vert W_{Y|X}\cdot P_{X} ), \end{aligned}$$
(87)

\(-I(P_V, W_{Y|X}\cdot P_{X|V})+I(P_V, W_{Z|X}\cdot P_{X|V})\) coincides with \(\mathcal{G}(P_{VX})\). Here, although \(\varPsi _{W_{Y|X},W_{Z|X}}[P_{VX}]\) is a function of (vx), the function value depends only on v. Hence, the general theory in Sect. 2 can be used for the minimization of (86). In this case, it is difficult to clarify whether the conditions (A1) and (A2) hold in general. \(\varPhi \) is given as

$$\begin{aligned}&\varPhi [Q_{VX}](v,x)\nonumber \\&\quad =\frac{1}{\kappa _{W_{Y|X},W_{Z|X}}[Q_{VX}]}Q_{VX}(v,x) \exp \Big ( \frac{1}{\gamma } \Big ( D(W_{Y|X}\cdot P_{X|V=v}\Vert W_{Y|X}\cdot P_{X} )\nonumber \\&\qquad -D(W_{Z|X}\cdot P_{X|V=v}\Vert W_{Z|X}\cdot P_{X} ) \Big ) \Big ), \end{aligned}$$
(88)

where \(\kappa _{W_{Y|X},W_{Z|X}}[Q_{VX}]\) is the normalizing constant. Since \(\varPhi [Q_X] \in \mathcal{M}_a\), \(P_X^{(t+1)}\) is given as \(\varPhi [P_X^{(t)}]\).

3.4.2 Degraded case

However, when there exists a channel \(W_{Z|Y}\) from \(\mathcal{Y}\) to \(\mathcal{Z}\) such that \( W_{Z|Y} \cdot W_{Y|X}=W_{Z|X}\), i.e., the channel \(W_{Z|X}\) is a degraded channel of \(W_{Y|X}\), we can define the joint channel \(W_{YZ|X}\) with the following conditional probability density function

$$\begin{aligned} w_{YZ|X}(yz|x):= w_{Z|Y}(z|y)w_{Y|X}(y|x). \end{aligned}$$
(89)

Then, the maximization (85) is simplified as

$$\begin{aligned} C(W_{YZ|X}):=\max _{P_{X}} I(X;Y|Z)[P_X, W_{YZ|X}] \end{aligned}$$
(90)

where the conditional mutual information is given as

$$\begin{aligned} I(X;Y|Z)[P_X, W_{YZ|X}] := \sum _{x,z} P_{XZ}(x,z) D(P_{Y|X=x,Z=z}\Vert P_{Y|Z=z}), \end{aligned}$$
(91)

where the conditional distributions \(P_{Y|XZ}\) and \(P_{Y|Z}\) are defined from the joint distribution \(W_{YZ|X}\times P_X\). To consider (90), we consider the following minimization with a general two-output channel \(W_{YZ|X}\);

$$\begin{aligned} \min _{P_{X}} - I(X;Y|Z)[P_X, W_{YZ|X}] . \end{aligned}$$
(92)

When we choose \(\mathcal{M}_a\) and \(\varPsi \) as \(\mathcal{P}(\mathcal{X})\) and

$$\begin{aligned} \varPsi _{W_{YZ|X}}[P_{X}](x):= - \sum _{z} P_{Z|X=x}(z) D(P_{Y|X=x,Z=z}\Vert P_{Y|Z=z}). \end{aligned}$$
(93)

\(- I(X;Y|Z)[P_X, W_{YZ|X}]\) coincides with \(\mathcal{G}(P_{X})\). Hence, the general theory in Sect. 2 can be used for the minimization of (92). In this case, as shown in Sect. 6.2, the conditions (A1) with \(\gamma =1\) and (A2) hold. \(\varPhi \) is given as

$$\begin{aligned}&\varPhi [Q_{X}](x)\nonumber \\&\quad =\frac{1}{\kappa _{W_{YZ|X}}[Q_{X}]}Q_{X}(x) \exp \Big ( \frac{1}{\gamma } \Big ( \sum _{z} P_{Z|X=x}(z) D(P_{Y|X=x,Z=z}\Vert P_{Y|Z=z}) \Big ) \Big ),\quad \end{aligned}$$
(94)

where \(\kappa _{W_{YZ|X}}[Q_{X}]\) is the normalizing constant. Since \(\varPhi [Q_X] \in \mathcal{M}_a\), \(P_X^{(t+1)}\) is given as \(\varPhi [P_X^{(t)}]\). The above algorithm with \(\gamma =1\) coincides with the algorithm proposed by [14].

3.5 Capacities with cost constraint

Next, we consider the case when a cost constraint is imposed. Consider a function f on \(\mathcal{X}\) and the following constraint for a distribution \(P_X \in \mathcal{X}\);

$$\begin{aligned} P_X[f]=a. \end{aligned}$$
(95)

We define \(\mathcal{M}_a\) by imposing the condition (95) as a special case of (1). The capacity of the channel \(W_{Y |X}\) under the cost constraint is given as \(\max _{P_X \in \mathcal{M}_a}I(P_X,W_{Y|X})\). That is, we need to solve the minimization \(\min _{P_X \in \mathcal{M}_a}-I(P_X,W_{Y|X})\). In this case, the \(t+1\)-th distribution \(P^{(t+1)}\) is given as \(\varGamma _{\mathcal{M}_a}^{(e)} [\varPhi [P_X^{(t)}]]\). Since \(\varGamma _{\mathcal{M}_a}^{(e)} [\varPhi [P_X^{(t)}]]\) cannot be calculated analytically, we can use Algorithm 2 instead of Algorithm 1. Since conditions (A1) with \(\gamma =1\) and (A2) hold, Theorem 4 guarantees the global convergence to the minimum in Algorithm 2.

We can consider the cost constraint (95) for the problems (74) and (81). In these cases, we have a similar modification by considering \(\varGamma _{\mathcal{M}_a}^{(e)} [\varPhi [P_X^{(t)}]]\).

4 em problem

We apply our algorithm to the problem setting with the em algorithm [2,3,4], which is a generalization of Boltzmann machines [5]. The em algorithm is implemented by iterative applications of the projection to an exponential family (the m-projection) and the projection to a mixture family (the e-projection). Hence, this algorithm is called the em algorithm. On the other hand, the EM algorithm is implemented by iterative applications of expectation and maximization. Their relation is summarized as follows. In particular, the expectation in the EM algorithm, which is often called E-step, corresponds to the e-projection to a mixture family, which is often called e-step in the em algorithm. Also, the maximization in the EM algorithm, which is often called M-step, corresponds to the m-projection to an exponential family, which is often called m-step in the em algorithm. In this reason, they are essentially the same [2].

For this aim, we consider a pair of an exponential family \(\mathcal{E}\) and a mixture family \(\mathcal{M}_a\) on \(\mathcal{X}\). We denote the m-projection to \(\mathcal{E}\) of P by \(\varGamma _\mathcal{E}^{(m)}[P]\), which is defined as [1, 2]

$$\begin{aligned} \varGamma _\mathcal{E}^{(m)}[P]:= \mathop {\textrm{argmin}}\limits _{Q\in \mathcal{E}} D(P\Vert Q). \end{aligned}$$
(96)

We consider the following minimization;

$$\begin{aligned}&\min _{P\in \mathcal{M}_a}\min _{Q\in \mathcal{E}} D(P\Vert Q)= \min _{P\in \mathcal{M}_a} D(P\Vert \varGamma _\mathcal{E}^{(m)}[P]) \nonumber \\&\quad = \min _{P\in \mathcal{M}_a} \sum _{x \in \mathcal{X}}P(x) (\log P(x) - \log \varGamma _\mathcal{E}^{(m)}[P](x)). \end{aligned}$$
(97)

We choose the function \(\varPsi \) as

$$\begin{aligned} \varPsi _\mathrm{{em}}[P](x):= (\log P(x) - \log \varGamma _\mathcal{E}^{(m)}[P](x)), \end{aligned}$$
(98)

and apply the discussion in Sect. 2. Then, we have

$$\begin{aligned}&\sum _{x \in \mathcal{X}} P^{0}(x) (\varPsi _\mathrm{{em}}[P^{0}](x)- \varPsi _\mathrm{{em}}[Q](x)) \nonumber \\&\quad =\sum _{x \in \mathcal{X}} P^{0}(x) \Big ( (\log P^{0}(x) - \log \varGamma _\mathcal{E}^{(m)}[P^{0}](x)) - (\log Q(x) - \log \varGamma _\mathcal{E}^{(m)}[Q](x)) \Big ) \nonumber \\&\quad = \sum _{x \in \mathcal{X}} P^{0}(x)(\log P^{0}(x)-\log Q(x)) \nonumber \\&\qquad + \sum _{x \in \mathcal{X}} P^{0}(x) \Big ( \log \varGamma _\mathcal{E}^{(m)}[Q](x)) - \log \varGamma _\mathcal{E}^{(m)}[P^{0}](x)) \Big ) \nonumber \\&\quad = D (P^{0}\Vert Q) + D(P^{0} \Vert \varGamma _\mathcal{E}^{(m)}[P^{0}]) -D(P^{0} \Vert \varGamma _\mathcal{E}^{(m)}[Q]) \nonumber \\&\quad = D (P^{0}\Vert Q) -D(\varGamma _\mathcal{E}^{(m)}[P^{0}] \Vert \varGamma _\mathcal{E}^{(m)}[Q]), \end{aligned}$$
(99)

where the final equation follows from (4). The condition (A1) holds with \(U(P^0,\infty )=\mathcal{M}_a\) and \(\gamma =1\). There is a possibility that the condition (A1) holds with a smaller \(\gamma \). Therefore, with \(\gamma =1\), Theorem 1 guarantees that Algorithm 1 converges to a local minimum. In addition, when the relation

$$\begin{aligned} D (P^0\Vert Q) \ge D(\varGamma _\mathcal{E}^{(m)}[P^0] \Vert \varGamma _\mathcal{E}^{(m)}[Q]) \end{aligned}$$
(100)

holds for \(Q \in U(P^0,\delta )\), the condition (A2) holds with \(U(P^0,\delta )\). That is, if the condition (100) holds, Algorithm 1 has the global convergence to the minimizer. The condition (100) is a condition similar to the condition given in [22].

In this case, \(\varPhi \) is given as

$$\begin{aligned} \varPhi [Q](x)&=\frac{1}{\kappa _\mathrm{{em}}[Q]}Q(x) \exp \Big ( - \frac{1}{\gamma } (\log Q(x) - \log \varGamma _\mathcal{E}^{(m)}[Q](x))\Big )\nonumber \\&=\frac{1}{\kappa _\mathrm{{em}}[Q]}Q(x)^{\frac{\gamma -1}{\gamma }} \varGamma _\mathcal{E}^{(m)}[Q](x)^{\frac{1}{\gamma }}, \end{aligned}$$
(101)

where the normalizing constant \(\kappa _\mathrm{{em}}[Q_X]\) is given as \(\kappa _\mathrm{{em}}[Q_X]=\sum _{x \in \mathcal{X}} Q(x)^{\frac{\gamma -1}{\gamma }} \varGamma _\mathcal{E}^{(m)} [Q](x)^{\frac{1}{\gamma }}\). Since \(\varPhi [Q] \in \mathcal{M}_a\), \(P^{(t+1)}\) is given as \(\varGamma _{\mathcal{M}_a}^{(e)}[\varPhi [P^{(t)}]]\). When \(\gamma =1\), it coincides with the conventional em-algorithm [2,3,4] because \(\varPhi [P^{(t)}]=\varGamma _\mathcal{E}^{(m)}[P^{(t)}]\). The above analysis suggests the choice of \(\gamma \) as a smaller value than 1. That is, there is a possibility that a smaller \(\gamma \) improves the conventional em-algorithm. In addition, we may use Algorithm 2 instead of Algorithm 1 when the calculation of e-projection is difficult.

Lemma 4

When \(\varPsi _\mathrm{{em}}\) is given as (98), the condition (A4) holds.

Therefore, by combining Lemmas 3 and 4, the assumption of Theorem 2 holds in the \(\delta \) neighborhood of a local minimizer with sufficiently small \(\delta >0\). That is, the convergence speed can be evaluated by Theorem 2.

Proof

Pythagorean theorem guarantees

$$\begin{aligned}&\sum _{x \in \mathcal{X}}P (x) \big ( \log \varGamma _\mathcal{E}^{(m)}[Q](x)- \log \varGamma _\mathcal{E}^{(m)}[P](x) \big ) \nonumber \\&\quad = D(P\Vert \varGamma _\mathcal{E}^{(m)}[P]) - D(P\Vert \varGamma _\mathcal{E}^{(m)}[Q]) = D(\varGamma _\mathcal{E}^{(m)}[P]\Vert \varGamma _\mathcal{E}^{(m)}[Q]) . \end{aligned}$$
(102)

We make the parameterization \(P_\eta \in \mathcal{M}_a\) with mixture parameter \(\eta \). We denote \(\eta (h,i):=(\eta (0)_1, \ldots , \eta (0)_{i-1},\eta (0)_i+h,\eta (0)_{i+1},\ldots , \eta (0)_k)\).

$$\begin{aligned}&\sum _{x \in \mathcal{X}}P_{\eta (0)} (x) \Big (\frac{\partial }{\partial \eta _i}\varPsi [P_\eta ](x)|_{\eta =\eta (0)}\Big ) \nonumber \\&\quad = \sum _{x \in \mathcal{X}}P_{\eta (0)} (x) \Big ( \lim _{h\rightarrow 0}\frac{\varPsi [P_{\eta (h,i)}](x)-\varPsi [P_{\eta (0)}](x)}{h} \Big ) \nonumber \\&\quad = \sum _{x \in \mathcal{X}}P_{\eta (0)} (x) \Big ( \lim _{h\rightarrow 0}\frac{ \log P_{\eta (h,i)}(x)-\log P_{\eta (0)}(x)}{h} \nonumber \\&\qquad -\lim _{h\rightarrow 0}\frac{ \log \varGamma _\mathcal{E}^{(m)}[P_{\eta (h,i)}](x)- \log \varGamma _\mathcal{E}^{(m)}[P_{\eta (0)}](x)}{h} \Big ) \nonumber \\&\quad {\mathop {=}\limits ^{(a)}} \sum _{x \in \mathcal{X}}P_{\eta (0)} (x) \Big ( \lim _{h\rightarrow 0}\frac{ \log P_{\eta (h,i)}(x)-\log P_{\eta (0)}(x)}{h} \Big )\nonumber \\&\qquad - \sum _{x \in \mathcal{X}}\varGamma _\mathcal{E}^{(m)}[P_{\eta (0)}] (x) \Big ( \lim _{h\rightarrow 0} \frac{ \log \varGamma _\mathcal{E}^{(m)}[P_{\eta (h,i)}](x)- \log \varGamma _\mathcal{E}^{(m)}[P_{\eta (0)}](x)}{h} \Big ) \nonumber \\&\quad = \sum _{x \in \mathcal{X}} \frac{\partial }{\partial \eta _i} P_\eta (x)|_{\eta =\eta (0)} - \sum _{x \in \mathcal{X}} \frac{\partial }{\partial \eta _i} \varGamma _\mathcal{E}^{(m)}[P_\eta ](x)|_{\eta =\eta (0)} =0, \end{aligned}$$
(103)

which implies the condition (A4). Here, (a) follows from (102). \(\square \)

5 Commitment capacity

Using the same notations as Sect. 3, we address the bit commitment via a noisy channel \(W_{Y|X}\). When we can guarantee the communication channel is given by \(W_{Y|X}\), bit commitment is possible. In this topic, we are interested in the secure transmission rate in the sense of bit commitment per a single use of the noisy channel \(W_{Y|X}\). The maximum value of this rate is called the commitment capacity. Given a distribution \(P_X\), the Shannon entropy is given as

$$\begin{aligned} H(X)_{P_X}:=-\sum _{x \in \mathcal{X}}P_X(x)\log P_X(x). \end{aligned}$$
(104)

Given a joint distribution \(P_{XY}\), the conditional entropy is defined as

$$\begin{aligned} H(X|Y)_{P_{XY}}:= \int _{\mathcal{Y}} H(X)_{W_{X|Y=y}} p_Y(y) \mu (dy). \end{aligned}$$
(105)

The commitment capacity is given as

$$\begin{aligned} C_c(W_{Y|X}):=\max _{P_X} H(X|Y)_{W_{Y|X} \times P_X}. \end{aligned}$$
(106)

This problem setting has several versions. To achieve the bit commitment, the papers [36,37,38] considered interactive protocols with multiple rounds, where each round has one use of the given noisy channel \(W_{Y|X}\) and free noiseless communications in both directions. Then, it derived the commitment capacity (106). Basically, the proof is composed of two parts. One is the achievability part, which is often called the direct part and shows the existence of the code to achieve the capacity. The other is the impossibility part, which is often called the converse part and shows the non-existence of the code to exceed the capacity. As the achievability part, they showed that the commitment capacity can be achieved with non-interactive protocol, which has no free noiseless communication during multiple uses of the given noisy channel \(W_{Y|X}\). However, as explained in [39], their proof of the impossibility part skips so many steps that it cannot be followed. Later, the paper [40] showed the impossibility part only for non-interactive protocols by applying the wiretap channel. Recently, the paper [41] constructed a code to achieve the commitment capacity by using a specific type of list decoding. Further, the paper showed the achievability of the commitment capacity even in the quantum setting. In addition, the paper [39] showed the impossibility part for interactive protocols by completing the proof by [39]. The proof in [39] covers the impossibility part for a certain class even in the quantum setting.

5.1 Algorithm based on the em-algorithm problem

To calculate the commitment capacity, we consider the following mixture and exponential families;

$$\begin{aligned} \mathcal{M}_a&:=\{W_{Y|X} \times P_X| P_X \in \mathcal{P}(\mathcal{X})\}\nonumber \\&=\bigg \{P_{XY} \in \mathcal{P}(\mathcal{X}\times \mathcal{Y})\bigg | \sum _{y \in \mathcal{Y}}P_{XY}(x,y)=P_X(x)\bigg \} \end{aligned}$$
(107)
$$\begin{aligned} \mathcal{E}&:=\{ Q_Y \times P_{X,Uni} | Q_Y \in \mathcal{P}(\mathcal{Y})\}, \end{aligned}$$
(108)

where \(P_{X,Uni}\) is the uniform distribution on \(\mathcal{X}\). In (107), the integer k is chosen to be \(|\mathcal{X}|-1\), the linear functions \(f_1, \ldots , f_k\) are chosen to be \(\sum _{y \in \mathcal{Y}}P_{XY}(x_1,y), \ldots ,\) \(\sum _{y \in \mathcal{Y}}P_{XY}(x_{k-1},y)\), and the vector a is chosen to be \(P_X(x_1), \ldots , P_X(x_k)\). These choices clarify that (107) gives a mixture family.

Since \(\varGamma _\mathcal{E}^{(m)}[W_{Y|X} \times P_X]=(W_{Y|X} \cdot P_X) \times P_{X,Uni} \), the commitment capacity is rewritten as

$$\begin{aligned} \log |\mathcal{X}|-C_c(W_{Y|X})&=\min _{P_X} H(X)_{P_{X,Uni}} + H(X)_{W_{Y|X} \cdot P_X} - H(X Y)_{W_{Y|X} \times P_X}\nonumber \\&=\min _{P_X} D( W_{Y|X} \times P_X\Vert (W_{Y|X} \cdot P_X) \times P_{X,Uni} ) \nonumber \\&=\min _{P_X} D( W_{Y|X} \times P_X\Vert \varGamma _\mathcal{E}^{(m)}[W_{Y|X} \times P_X] ). \end{aligned}$$
(109)

Hence, the minimization (109) is a special case of the minimization (97). Since \( \varGamma _\mathcal{E}^{(m)}[W_{Y|X} \times Q_X](x,y) =(W_{Y|X} \cdot Q_X) \times P_{X,Uni} \),

$$\begin{aligned} D(\varGamma _\mathcal{E}^{(m)}[P^*] \Vert \varGamma _\mathcal{E}^{(m)}[Q_X]) =D(W_{Y|X}\cdot P_X\Vert W_{Y|X}\cdot Q_X) \le D (P^*\Vert Q_X) , \end{aligned}$$
(110)

which yields the condition (100). Hence, the global convergence is guaranteed.

By applying (101), \(\varPhi \) is calculated as

$$\begin{aligned}&\varPhi [W_{Y|X} \times Q_X](x,y)\nonumber \\&\quad =\frac{1}{\kappa _{W_{Y|X}}^{1}[Q_X]} w_{Y|X}(y|x)^{\frac{\gamma -1}{\gamma }} Q_X(x)^{\frac{\gamma -1}{\gamma }} (w_{Y|X} \cdot Q_X)(y)^{\frac{1}{\gamma }} P_{X,Uni}(x)^{\frac{1}{\gamma }}, \end{aligned}$$
(111)

where \(\kappa _{W_{Y|X}}^{1}[Q_X]\) is the normalizer. Then, after a complicated calculation, the marginal distribution of its projection to \(\mathcal{M}_a\) is given as

$$\begin{aligned}&\int _{\mathcal{Y}} \varGamma _{\mathcal{M}_a}^{(e)}[\varPhi [W_{Y|X} \times Q_X]](x,y)\mu (dy)\nonumber \\&\quad =\frac{1}{\kappa _{W_{Y|X}}^{2}[Q_X]} Q_X(x)^{1-\frac{1}{\gamma }} \exp ( -\frac{1}{\gamma } D(W_{Y|X=x}\Vert W_{Y|X} \cdot Q_X)) , \end{aligned}$$
(112)

where \(\kappa _{W_{Y|X}}^{2}[Q_X]\) is the normalizer. In the algorithm, we update \(P_X^{(t+1)}\) as \(P_X^{(t+1)}(x):= \int _{\mathcal{Y}} \varGamma _{\mathcal{M}_a}^{(e)}[\varPhi [W_{Y|X} \times P_X^{(t)}]] (x,y)\mu (dy) \).

5.2 Direct application

The update formula (112) requires a complicated calculation, we can derive the same update rule by a simpler derivation as follows. The commitment capacity is rewritten as

$$\begin{aligned} -C_c(W_{Y|X})&=\min _{P_X} I(P_X,W_{Y|X})-H(X)_{P_{X}} \nonumber \\&=\min _{P_X} \sum _{x\in \mathcal{X}}P_X(x) (D(W_{Y|X=x}\Vert W_{Y|X}\cdot P_X )+\log P_X(x)). \end{aligned}$$
(113)

We choose \(\mathcal{M}_a\) and \(\varPsi \) as \(\mathcal{P}(\mathcal{X})\) and

$$\begin{aligned} \varPsi _{c,W_{Y|X}}[P_X](x):= D(W_{Y|X=x}\Vert W_{Y|X}\cdot P_X )+\log P_X(x). \end{aligned}$$
(114)

Then, we have

$$\begin{aligned}&\sum _{x \in \mathcal{X}}P_X(x)( \varPsi [P_X](x)- \varPsi [Q_X](x))\nonumber \\&\quad = D(P_X\Vert Q_X)-D(W_{Y|X}\cdot P_X\Vert W_{Y|X}\cdot Q_X) \ge 0 \end{aligned}$$
(115)

and

$$\begin{aligned} D(P_X\Vert Q_X) \ge D(P_X\Vert Q_X)- D(W_{Y|X}\cdot P_X\Vert W_{Y|X}\cdot Q_X) . \end{aligned}$$
(116)

Since the condition (A1) with \(\gamma =1\) and the condition (A2) hold, Algorithm 1 converges with \(\gamma =1\). In this case, \(\varPhi \) is given as

$$\begin{aligned}&\varPhi [Q_X](x) \nonumber \\&\quad =\frac{1}{\kappa _{W_{Y|X}}^3[Q_X]}Q_X(x) \exp ( - \frac{1}{\gamma } (\log Q_X(x)+ D(W_{Y|X=x}\Vert W_{Y|X} \cdot Q_X))) \nonumber \\&\quad =\frac{1}{\kappa _{W_{Y|X}}^3[Q_X]}Q_X(x)^{1-\frac{1}{\gamma }} \exp ( -\frac{1}{\gamma } D(W_{Y|X=x}\Vert W_{Y|X} \cdot Q_X)) , \end{aligned}$$
(117)

where the normalizing constant \(\kappa _{W_{Y|X}}^3[Q_X]\) is given as \(\kappa _{W_{Y|X}}^3[Q_X]:=\)

\(\sum _{x \in \mathcal{X}} Q_X(x)^{1-\frac{1}{\gamma }} \exp ( - \frac{1}{\gamma } D(W_{Y|X=x}\Vert W_{Y|X} \cdot Q_X))\). Since \(\varPhi [Q_X] \in \mathcal{M}_a\), \(P_X^{(t+1)}\) is given as \(\varPhi [P_X^{(t)}]\).

To consider the effect of the acceleration parameter \(\gamma \), we made a numerical comparison when the channel with \(\mathcal{X}=\{1,2,3,4\}\) and \(\mathcal{Y}=\{1,2,3,4\}\) is given as follows.

$$\begin{aligned}&W_{Y|X}(1,1)=0.6, W_{Y|X}(2,1)=0.2, W_{Y|X}(3,1)=0.1, W_{Y|X}(4,1)=0.1, \nonumber \\&W_{Y|X}(1,2)=0.1, W_{Y|X}(2,2)=0.2, W_{Y|X}(3,2)=0.1, W_{Y|X}(4,2)=0.6,\nonumber \\&W_{Y|X}(1,3)=0.1, W_{Y|X}(2,3)=0.2, W_{Y|X}(3,3)=0.15, W_{Y|X}(4,3)=0.55,\nonumber \\&W_{Y|X}(1,4)=0.05, W_{Y|X}(2,4)=0.85, W_{Y|X}(3,4)=0.05, W_{Y|X}(4,4)=0.05 . \end{aligned}$$
(118)

We choose \(\gamma \) to be 1, 0.95, and 0.9. Figure 1 shows the numerical result for the iteration of our algorithm when the channel input is limited into \(\{1,2,3\}\). A smaller \(\gamma \) does not improve the convergence in this case. Figure 2 shows the same numerical result when all elements of \(\{1,2,3,4\}\) are allowed as the channel input. In this case, a smaller \(\gamma \) improves the convergence.

Fig. 1
figure 1

Calculation of commitment capacity for the channel given in (118) with \(\mathcal{X}=\{1,2,3\}\). The lower plot shows an enlarged plot of the upper plot. The horizontal axis shows the number of iterations. The vertical axis shows the conditional entropy. Red points show the case with \(\gamma =1\). Green points show the case with \(\gamma =0.95\). Blue points show the case with \(\gamma =0.9\). For \(t=5,6,\ldots , 10\), these cases have almost the same value. Hence, these plots cannot be distinguished for \(t=5,6,7,8,9,10\). At \(t=2,3\), the case with \(\gamma =1\) is better than other cases. However, in this case, a smaller \(\gamma \) does not improve the convergence

Fig. 2
figure 2

Calculation of commitment capacity for the channel given in (118) with \(\mathcal{X}=\{1,2,3,4\}\). The role of color is the same as Fig. 1. In this case, a smaller \(\gamma \) improves the convergence

6 Reverse em problem

6.1 General problem description

In this section, given a pair of an exponential family \(\mathcal{E}\) and a mixture family \(\mathcal{M}_a\) on \(\mathcal{X}\), we consider the following maximization;

$$\begin{aligned}&\max _{P\in \mathcal{M}_a}\min _{Q\in \mathcal{E}} D(P\Vert Q)= \max _{P\in \mathcal{M}_a} D(P\Vert \varGamma _\mathcal{E}^{(m)}[P]) \nonumber \\&\quad = \max _{P\in \mathcal{M}_a} \sum _{x \in \mathcal{X}}P(x) (\log P(x) - \log \varGamma _\mathcal{E}^{(m)}[P](x)) \end{aligned}$$
(119)

while Sect. 4 considers the minimization of the same value. When \(\mathcal{M}_a\) is given as (107) and \(\mathcal{E}\) is given as \(\mathcal{P}(\mathcal{X})\times \mathcal{P}(\mathcal{Y})\), this problem coincides with the channel capacity (58). This problem was firstly studied for the channel capacity in [20], and was discussed with a general form in [21]. To discuss this problem, we choose the function \(\varPsi \) as \(\varPsi _\mathrm{{rem}}:=-\varPsi _\mathrm{{em}}\), and apply the discussion in Sect. 2. Due to (99), (24) in the condition (A1) is written as

$$\begin{aligned} (\gamma +1) D (P^0\Vert Q) \ge D(\varGamma _\mathcal{E}^{(m)}[P^0] \Vert \varGamma _\mathcal{E}^{(m)}[Q]), \end{aligned}$$
(120)

and (25) in the condition (A2) is written as

$$\begin{aligned} D(\varGamma _\mathcal{E}^{(m)}[P^0] \Vert \varGamma _\mathcal{E}^{(m)}[Q]) \ge D (P^0\Vert Q) . \end{aligned}$$
(121)

Further, due to Lemma 4, the condition (A4) holds.

6.2 Application to wiretap channel

Now, we apply this problem setting to wiretap channel with the degraded case discussed in Sect. 3.4.2. We choose \(\mathcal{M}_a\) as \(\{ W_{YZ|X}\times P_X | P_X\in \mathcal{P}(\mathcal{X}) \}\) and \(\mathcal{E}\) as the set of distributions with the Markov chain condition \(X-Z-Y\) [42]. Then, the conditional mutual information \(I(X;Y|Z)[P_X,W_{YZ|X}]\) is given as \(D( W_{YZ|X}\times P_X\Vert \varGamma _{\mathcal{E}}^{(m)} [W_{YZ|X}\times P_X])\). In this application, we have

$$\begin{aligned} D_{\varPsi _\mathrm{{rem}}}(W_{YZ|X}\times P_X\Vert W_{YZ|X}\times Q_X)&= D_{\varPsi _{W_{YZ|X}}}D( P_X \Vert Q_X) \end{aligned}$$
(122)
$$\begin{aligned} D( W_{YZ|X}\times P_X\Vert W_{YZ|X}\times Q_X)&=D( P_X \Vert Q_X). \end{aligned}$$
(123)

To check the conditions (A1) and (A2) for \(\varPsi _{W_{YZ|X}}\), it is sufficient to check them for \(\varPsi _\mathrm{{rem}}\) in this application. Since we have

$$\begin{aligned}&D( \varGamma _{\mathcal{E}}^{(m)} [W_{YZ|X}\times P_X]\Vert \varGamma _{\mathcal{E}}^{(m)} [W_{YZ|X}\times Q_X]) \nonumber \\&\quad = D( W_{Z|X} \times P_{X} \Vert Q_{XZ}) + D( W_{YZ|X}\cdot P_X \Vert W_{YZ|X}\cdot Q_X)\nonumber \\&\qquad - D( W_{Z|X}\cdot P_X \Vert W_{Z|X}\cdot Q_X) \nonumber \\&\quad = D( P_X \Vert Q_X) + D( W_{YZ|X}\cdot P_X \Vert W_{YZ|X}\cdot Q_X) \nonumber \\&\qquad - D( W_{Z|X}\cdot P_X \Vert W_{Z|X}\cdot Q_X) \nonumber \\&\quad \le 2D( P_X \Vert Q_X), \end{aligned}$$
(124)

LHS of (24) in the condition (A1) is written as

$$\begin{aligned}&\gamma D (P_X^0\Vert Q_X) - D( W_{YZ|X}\cdot P_X \Vert W_{YZ|X}\cdot Q_X) \nonumber \\&\qquad + D( W_{Z|X}\cdot P_X \Vert W_{Z|X}\cdot Q_X) \nonumber \\&\quad \ge \gamma D (P^0\Vert Q) - D( P_X^0 \Vert Q_X). \end{aligned}$$
(125)

It is not negative when \(\gamma \ge 1\). Also, RHS of (25) in the condition (A2) is written as

$$\begin{aligned} D( W_{YZ|X}\cdot P_X \Vert W_{YZ|X}\cdot Q_X) - D( W_{Z|X}\cdot P_X \Vert W_{Z|X}\cdot Q_X) \ge 0. \end{aligned}$$
(126)

Hence, the conditions (A1) and (A2) hold with \(\gamma \ge 1\).

7 Information bottleneck

As a method for information-theoretical machine learning, we often consider information bottleneck [43]. Consider two correlated systems \(\mathcal{X}\) and \(\mathcal{Y}\) and a joint distribution \(P_{XY}\) over \(\mathcal{X} \times \mathcal{Y}\). The task is to extract an essential information from the space \(\mathcal{X}\) to \(\mathcal{T}\) with respect to \(\mathcal{Y}\). Here, we discuss a generalized problem setting proposed in [44]. For this information extraction, given parameters \(\alpha \in [0,1]\) and \(\beta \ge \alpha \), we choose a conditional distribution \(P_{T|X}^*\) as

$$\begin{aligned} P_{T|X}^*:= \mathop {\textrm{argmin}}\limits _{P_{T|X}} \alpha I(T;X)+(1-\alpha )H(T)-\beta I(T;Y). \end{aligned}$$
(127)

This method is called information bottleneck. To apply our method to this problem, we set \(\mathcal{M}_a\) to be \(\mathcal{P}(\mathcal{X})\), and define

$$\begin{aligned} {\varPsi }_{\alpha ,\beta }[P_{TX}](t,x)&:= \alpha \log P_{TX}(t,x) -\alpha \log P_{X}(x) +(\beta -1) \log P_{T}(t) \nonumber \\&\quad +\beta \sum _{y\in \mathcal{Y}} P_{Y|X}(y|x) ( \log P_{Y}(y) - \log P_{TY}(t,y) ). \end{aligned}$$
(128)

Then, when the joint distribution \(P_{TX}\) is chosen to be \(P_{T|X} \times P_X\), the objective function is written as

$$\begin{aligned} \mathcal{G}_{\alpha ,\beta }( P_{TX}):= \sum _{t\in \mathcal{T},x \in \mathcal{X}} P_{TX}(t,x) {\varPsi }_{\alpha ,\beta }[P_{TX}](t,x). \end{aligned}$$
(129)

That is, our problem is reduced to the minimization

$$\begin{aligned} \min _{P_{TX}\in \mathcal{M}(P_X)} \mathcal{G}_{\alpha ,\beta }( P_{TX}), \end{aligned}$$
(130)

where \(\mathcal{M}(P_X)\) is the set of joint distributions on \(\mathcal{X} \times \mathcal{Y}\) whose marginal distribution on \(\mathcal{X}\) is \(P_X\). The e-projection \(\varGamma ^{(e)}_{\mathcal{M}(P_X)}\) to \(\mathcal{M}(P_X)\) is written as

$$\begin{aligned} \varGamma ^{(e)}_{\mathcal{M}(P_X)}[Q_{TX}]= Q_{T|X}\times P_X \end{aligned}$$
(131)

because the relation \(D(P_{TX}\Vert Q_{TX})= D(P_{X}\Vert Q_{X})+ D(P_{TX}\Vert Q_{T|X}\times P_X)\) holds for a distribution \(P_{TX}\in \mathcal{M}(P_X)\).

When Algorithm 1 is applied to this problem, due to (131), the update rule for the conditional distribution is given as

$$\begin{aligned} P_{T| X}^{(t)}\mapsto P_{T| X}^{(t+1)}(t|x) := \kappa _x P_{T|X}^{(t)}(t|x) \exp ( -\frac{1}{\gamma }\varPsi _{\alpha ,\beta }[ P_{T|X}^{(t)}\times P_X ](t,x)), \end{aligned}$$
(132)

where \(\kappa _x\) is a normalizing constant. This update rule is the same as the update rule proposed in Sect. 3 of [45] when the states \(\{\rho _{Y|x}\}\) are given as diagonal density matrices, i.e., a conditional distribution \(P_{Y|X}\). Also, as shown in [45, (22)], we have the relation

$$\begin{aligned} D_{\varPsi _{\alpha ,\beta }}( P_{TX} \Vert Q_{TX}) \le \alpha D( P_{TX} \Vert Q_{TX}) \end{aligned}$$
(133)

for \(P_{TX},Q_{TX} \in \mathcal{M}(P_X)\) as follows. First, we have

$$\begin{aligned} D( P_{T|X} \cdot P_{XY}\Vert Q_{T|X}\cdot P_{XY}) \ge D( P_{T} \Vert Q_{T}), \end{aligned}$$
(134)

where \(P_{T|X} \cdot P_{XY}(t,y):= \sum _{x \in \mathcal{X}} P_{T|X}(t|x) P_{XY}(x,y)\). Then, we have

$$\begin{aligned}&D_{\varPsi _{\alpha ,\beta }}( P_{TX} \Vert Q_{TX}) \nonumber \\&\quad = (\beta -1)D( P_{T} \Vert Q_{T}) +\alpha \sum _{x \in \mathcal{X}}P_X(x)D( P_{T|X=x}\Vert Q_{T|X=x})\nonumber \\&\quad -\beta D( P_{T|X} \cdot P_{XY}\Vert Q_{T|X}\cdot P_{XY}) \nonumber \\&\quad \le -D( P_{T} \Vert Q_{T}) +\alpha \sum _{x \in \mathcal{X}}P_X(x)D( P_{T|X=x}\Vert Q_{T|X=x})\nonumber \\&\quad \le \alpha \sum _{x \in \mathcal{X}}P_X(x)D( P_{T|X=x}\Vert Q_{T|X=x}) = \alpha D( P_{TX} \Vert Q_{TX}). \end{aligned}$$
(135)

That is, the condition (A1) holds with \(\gamma =\alpha \). Therefore, with \(\gamma =\alpha \), Theorem 1 guarantees that Algorithm 1 converges to a local minimum, which was shown as [45, Theorem 3]. This fact shows the importance of the choice of \(\gamma \) dependently on the problem setting. That is, it shows the necessity of our problem setting with a general positive real number \(\gamma >0\).

The paper [45] also discussed the case when \(\mathcal{Y}\) and \(\mathcal{T}\) are quantum systems. It numerically compared these algorithms depending on \(\gamma \) [45, Fig. 2]. This numerical calculation indicates the following behavior. When \(\gamma \) is larger than a certain threshold, a smaller \(\gamma \) realizes faster convergence. But, when \(\gamma \) is smaller than a certain threshold, the algorithm does not converge.

8 Conclusion

We have proposed iterative algorithms with an acceleration parameter for a general minimization problem over a mixture family. For these algorithms, we have shown convergence theorems in various forms, one of which covers the case with approximated iterations. Then, we have applied our algorithms to various problem settings including the em algorithm and several information theoretical problem settings.

There are two existing studies to numerically evaluate the effect of the acceleration parameter \(\gamma \) [19, 45]. They reported improvement in the convergence by modifying the acceleration parameter \(\gamma \). For example, in the numerical calculation for information bottleneck in [45, Fig. 2], the case with \(\gamma =0.55 \) improves the convergence. Our numerical calculation for the commitment capacity has two cases. In one case, the choices with \(\gamma =0.95,0.9\) do not improve the convergence. In another case, the choices with \(\gamma =0.95,0.9\) improve the convergence. These facts show that the effect of the acceleration parameter \(\gamma \) depends on the parameters of the problem setting. The commitment capacity is considered as a special case of the divergence between a mixture family and an exponential family.

There are several future research directions. The first direction is the evaluation of the convergence speed of Algorithm 3 because we could not derive its evaluation. The second direction is to find various applications of our methods. Although this paper studied several examples, in order to clarify the usefulness of our algorithm, it is needed to find more useful examples for our algorithm. The third direction is the extensions of our results. A typical extension is the extension to the quantum setting [46,47,48]. As a further extension, it is an interesting topic to extend our result to the setting with Bregman divergence. Recently, the Bregman proximal gradient algorithm has been studied for the minimization of a convex function [49,50,51]. Since this algorithm uses Bregman divergence, it might have an interesting relation with the above-extended algorithm. Therefore, it is an interesting study to investigate this relation.