1 Introduction

Orthogonal nonnegative matrix factorization (ONMF), first proposed by Ding et al. (2006), factorizes a nonnegative matrix into two nonnegative matrices under a one-sided orthogonal constraint imposed on the first factor matrix. That is, ONMF is a minimization problem:

$$\begin{aligned} \begin{array}{l} \min _{{\mathbf {F}},{\mathbf {G}}} \Vert {\mathbf {X}} - {\mathbf {F}}{\mathbf {G}}^{T} \Vert ^{2}_{F},\\ \text{ subject } \text{ to } {\mathbf {F}} \ge 0,\ {\mathbf {G}} \ge 0,\ {\mathbf {F}}^{T}{\mathbf {F}}={\mathbf {I}}, \end{array} \end{aligned}$$
(1)

where \({\mathbf {X}} \in \mathbb {R}^{M \times N}\), \({\mathbf {F}} \in \mathbb {R}^{M \times J}\), \({\mathbf {G}} \in \mathbb {R}^{N \times J}\) (\(J \ll N,M\)) and \({\mathbf {I}}\) is the identity matrix. In addition, T denotes the transpose and \(\Vert \cdot \Vert _{F}^{2}\) denotes the squared Frobenius norm (the sum of squared elements). In this formulation, \({\mathbf {F}}^{T}{\mathbf {F}}={\mathbf {I}}\) is imposed as a condition, but the strict application of both nonnegativity and orthogonality is too strong. In fact, it yields a subset of orthonormal vectors in the standard basis. Therefore, in a practical sense, the optimization problem is stated as

$$\begin{aligned} \min _{{\mathbf {F}},{\mathbf {G}}} \Vert {\mathbf {X}}-{\mathbf {FG}}^{T}\Vert ^{2}_{F}+\lambda \Vert {\mathbf {F}}^{T}{\mathbf {F}}-{\mathbf {I}}\Vert \end{aligned}$$
(2)

with a positive coefficient \(\lambda \). This corresponds to a Lagrangian formulation, as will be shown in the following section.

To the best of the authors’ knowledge, conventional algorithms for solving ONMF problems are all based on matrix-wise alternating block coordinate descent. However, it is known that matrix-wise update algorithms require a relatively large number of iterations to converge. This is because those algorithms do not solve each conditional matrix-wise problem optimally (Cichocki and Anh-Huy 2009; Kim and Park 2011). In NMF without the orthogonal constraint, some state-of-the-art algorithms update \({\mathbf {F}}\) and \({\mathbf {G}}\) column-wisely or element-wisely to gain faster convergence. In ONMF, however, it is difficult to incorporate the orthogonal constraint into column-wise or element-wise coordinate descent updates.

In this paper, we propose a Fast Hierarchical Alternating Least Squares (HALS) algorithm for ONMF (HALS-ONMF). Our algorithm is based on a column-wise update algorithm for NMF proposed by Cichocki and Anh-Huy (2009). To enable such a column-wise update even in ONMF, we derive a set of column-wise orthogonal constraints, taking into consideration both nonnegativity and orthogonality at the same time. Furthermore, we show that the column-wise orthogonal constraint can also be applied to column-wise update algorithms called scalar Block Coordinate Descent for solving Bregman divergence NMF (sBCD-NMF) (Li et al. 2012) where the Frobenius norm in (1) is replaced with more general Bregman divergence (Li et al. 2012). This sBCD-ONMF algorithm is the first algorithm to solve ONMF with Bregman divergence.

The rest of this paper is organized as follows. We summarize previously proposed NMF algorithms and ONMF algorithms by connecting them to the corresponding optimization criteria in Sect. 2. Then we explain HALS-NMF proposed by Cichocki and Anh-Huy (2009) and propose HALS-ONMF with a newly invented column-wise orthogonal constraint in Sect. 3. In Sect. 4, we incorporate the column-wise orthogonal constraint into sBCD-NMF proposed by Li et al. (2012) in order to propose sBCD-ONMF algorithm. In Sect. 5, we present the results of experiments using the conventional and proposed algorithms on several real-life datasets. The conclusion is given in Sect. 6.

We will use a bold uppercase letter for a matrix, such as \({\mathbf {X}}\), and an italic lowercase letter for a vector such as \({\varvec{x}}\). Both \({\mathbf {X}}_{ij}\) and \({\varvec{x}}_{ij}\) stand for the (ij)th element in a matrix \({\mathbf {X}}\). A vector \({\mathbf {1}}_{J} \in \mathbb {R}^{J}\) shows the vector whose elements are of one’s. In this paper, NMF means Frobenius norm NMF, unless stated otherwise.

2 Related work

In this section, we first provide a brief review of NMF and ONMF algorithms.

2.1 Nonnegative matrix factorization

NMF aims to find a nonnegative matrix \({\mathbf {F}}=[{\varvec{f}}_{1},{\varvec{f}}_{2},\ldots ,{\varvec{f}}_{J}] \in \mathbb {R}^{N \times J}_{+}\) and another nonnegative matrix \({\mathbf {G}}=[{\varvec{g}}_{1},{\varvec{g}}_{2},\ldots ,{\varvec{g}}_{J}] \in \mathbb {R}^{M \times J}_{+}\) whose product approximates a given nonnegative matrix \({\mathbf {X}} \in \mathbb {R}^{N \times M}_{+}\):

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _{{{\mathbf {F}},{\mathbf {G}}}} \Vert {\mathbf {X}} - {\mathbf {F}}{\mathbf {G}}^{T} \Vert ^{2}_{F}, \\ \text{ subject } \text{ to } {\mathbf {F}} \ge 0,\ {\mathbf {G}} \ge 0. \end{array} \end{aligned}$$
(3)

Since the NMF problem is not convex both in \({\mathbf {F}}\) and \({\mathbf {G}}\), various iterative algorithms have been proposed (Lee and Seung 2000; Cichocki et al. 2009; Kim and Park 2011; Hsieh and Dhillon 2011). They are categorized according to the unit of updates as follows.

2.1.1 Matrix-wise update algorithms

Lee and Seung (2000) proposed a Multiplicative Update (MU) algorithm. This MU algorithm is one of the efficient algorithms for NMF proposed in the early stage, and thus many extensions followed (e.g., Cai et al. 2011; Cichocki et al. 2009). However, from the viewpoint of convergence, they were not sufficient (Kim et al. 2014). Lin (2007) proposed a Project Gradient Descent (PGD) algorithm for NMF. This algorithm solves an NMF problem by solving Nonnegative Least Squares (NLS) problems for \({\mathbf {F}}\) and \({\mathbf {G}}\) alternatively and, gains faster convergence than MU algorithms. The difference in these algorithms is that the MU algorithm uses a fixed step size in the gradient descent, while PGD uses a flexible step size.

2.1.2 Vector-wise update algorithms

Cichocki and Anh-Huy (2009) proposed a Hierarchical Alternating Least Squares (HALS) algorithm. The HALS algorithm solves a set of column-wise NLS problems for each column and updates \({\mathbf {F}}\) and \({\mathbf {G}}\) column-wisely. Since column-wise NLS problems can be solved at a high accuracy and efficiency, HALS converges very fast. Kim and Park (2011) proposed an active-set like algorithm that also decomposes a matrix NLS problem into a set of column-wise sub-problems. The difference between HALS and the active-set like method lies in the way to solve a column-wise sub-problem. The former uses the gradient to solve a sub-problem, while the latter uses an active-set method to solve a sub-problem. The active-set method consists of two stages: first, it finds a feasible point in standard NMF, as a nonnegative point, and then it solves a column-wise NLS problem while maintaining feasibility. Li et al. (2012) recently proposed scalar Block Coordinate Descent (sBCD) algorithm. The sBCD algorithm is applicable to not only NMF with Frobenius norm but also NMF with more general Bregman divergence. They used Taylor series expansion to derive the element-wise problem. Since the sBCD algorithm uses the column-wise residual in their update rule, its complexity is the same as that of column-wise update algorithms. Therefore, in this paper, we consider sBCD as a column-wise update algorithm (see Sect. 4.1). All of these vector-wise update algorithms can be regarded as state-of-the-art algorithms, because they converge empirically faster than matrix-wise update algorithms. However, addition of matrix-based constraints such as \({\mathbf {F}}^{T}{\mathbf {F}} = {\mathbf {I}}\) is still challenging in such column-wise updates.

2.1.3 Element-wise update algorithms

Hsieh and Dhillon (2011) proposed an element-wise update algorithm called a Greedy Coordinate Descent (GCD) algorithm. To the authors’ knowledge, it is the fastest algorithm for NMF. The GCD algorithm takes a greedy strategy to decrease the value of the objective function. It selects and updates the most contributable variables for minimization. The low computational cost of GCD is due to the fact that it does not update unnecessary elements. Unfortunately, the GCD algorithm cannot work with such a constraint that affects all elements of one column at the same time. One example of such a constraint is the graph regularized constraint that appears when we minimize \( \alpha (\text {tr}({\mathbf {F}}^{T}{\mathbf {LF}}))\), where \({\mathbf {L}}\) is a graph Laplacian matrix of \({\mathbf {X}}^{T}{\mathbf {X}}\). The GCD relies on the fact that, with a fixed \({\mathbf {G}}\), updating of an element \(f_{ij}\) of \({\mathbf {F}}\) changes only the gradients of elements in the same row \({\varvec{f}}_{i:}\) because the gradient in \({\mathbf {F}}\) is given by \( (-2{\mathbf {XG}}+2{\mathbf {FG}}^{T}{\mathbf {G}})\). In more detail, GCD iteratively selects and updates the most contributable variable \(f_{ij}\) in the ith row. Unfortunately, the GCD is not applicable to ONMF because the orthogonal condition requires an interaction between different rows.

2.2 Orthogonal NMF

An additional orthogonal constraint, \({\mathbf {F}}^{T}{\mathbf {F}}={\mathbf {I}}\), is imposed in ONMF. At first, we briefly review the first ONMF algorithm proposed by Ding et al. (2006) and reveal the problem behind ONMF.

The goal of ONMF is to find a nonnegative orthogonal matrix \({\mathbf {F}}\) and a nonnegative matrix \( {\mathbf {G}}\) minimizing the following objective function with a Lagrangian multiplier \(\lambda \),

$$\begin{aligned} L({\mathbf {F}},{\mathbf {G}})= \Vert {\mathbf {X}} - {\mathbf {F}} {\mathbf {G}}^{T} \Vert ^{2}_{F} + \text {Tr}[ \lambda ({\mathbf {F}}^{T}{\mathbf {F}} - {\mathbf {I}})]. \end{aligned}$$
(4)

The KKT complementary condition givesFootnote 1

$$\begin{aligned} (-2 {\mathbf {XG}} + 2 {\mathbf {FG}}^{T}{\mathbf {G}}+ 2 {\mathbf {F}}\lambda )_{nj} {\mathbf {F}}^{2}_{nj}=0,\ n=1,2,\ldots ,N,\ j=1,2,\ldots J. \end{aligned}$$
(5)

Then the update rule of the constrained matrix \({\mathbf {F}}\) is derived as

$$\begin{aligned} {\mathbf {F}}_{nj} \leftarrow {\mathbf {F}}_{nj} \sqrt{\frac{({\mathbf {XG}})_{nj}}{[ {\mathbf {F}}( {\mathbf {G}}^{T}{\mathbf {G}} + \lambda )]_{nj}}}. \end{aligned}$$
(6)

The point is how to determine the value of the Lagrange multiplier \(\lambda \). Since it is not easy to solve this problem for every value of \(\lambda \), Ding et al. (2006) ignored the nonnegativity and relied only on \({\mathbf {F}}^{T}{\mathbf {F}}={\mathbf {I}}\) to have approximate values of off-diagonal elements. By multiplying \({\mathbf {F}}^{T}\) from the left in (5), we have

$$\begin{aligned} \lambda ={\mathbf {F}}^{T}{\mathbf {X}}{\mathbf {G}}-{\mathbf {G}}^{T}{\mathbf {G}}. \end{aligned}$$

Thus, inserting this \(\lambda \) into (6), we have the final update form as

$$\begin{aligned} {\mathbf {F}}_{nj} \leftarrow {\mathbf {F}}_{nj} \sqrt{\frac{({\mathbf {XG}})_{nj}}{ ({\mathbf {F}}{\mathbf {F}}^{T}{\mathbf {X}}{\mathbf {G}})_{nj}}}. \end{aligned}$$

Note that their formulation with the specific values of \(\lambda \) do not strictly satisfy the orthogonality. Nevertheless, this specification is useful in avoiding the zero-lock problem appearing both in ONMF and in NMF: Once an element becomes zero in the middle of iterations, the element will not be recasted in the following steps [see the multiplicative update rule (6)]. Besides, when the orthogonality is strictly posed with nonnegativity, each row vector of \({\mathbf {F}}\) must have only one non-zero value. That is, any algorithm using a multiplicative update rule falls easily into a hole of the zero-lock problem. Therefore, ONMF algorithms put the first priority on the approximation while loosening the degree of orthogonality.

As a result of compromise, an ONMF algorithm can be seen as an algorithm that balances the trade-off between orthogonality and approximation with a weighting parameter \( \lambda \) as seen in (2). We do not categorize ONMF algorithms by the unit of updates because all conventional ONMF algorithms are based only on matrix-wise updates. Rather, those algorithms should be categorized according to whether the algorithm employs a weighting parameter or not. If an algorithm minimizes an objective function with a weighting parameter and if the value is not appropriately chosen, then the algorithm would fail in either acceptable degree of approximation or orthogonality. Such a failure has often been reported in past experimental results (Li et al. 2010; Mirzal 2014; Pompili et al. 2012).

2.2.1 Without a weighting parameter

Ding et al. (2006) proposed the first ONMF algorithm based on the MU algorithm (Lee and Seung 2000). This algorithm has no weighting parameter. It solves approximately the Lagrangian (4) as we reviewed. Yoo and Choi (2008) also proposed an MU-based algorithm. They used the gradient on the Stiefel manifold that is the set of all orthogonal matrices. The gradient on the Stiefel manifold is compatible with that of the MU algorithm because the manifold constrains every matrix to be orthogonal and the employed MU algorithm guarantees nonnegative values.Footnote 2

2.2.2 With a weighting parameter

Mirzal proposed a convergent algorithm that is also based on the MU algorithm in practice. He proposed two algorithms in Mirzal (2014), one of which is the same as the one by Li et al. (2010). The first algorithm introduced a weighting parameter \(\alpha \) instead of the Lagrangian multiplier \(\lambda \) in (4). The second algorithm was a convergent algorithm. The convergence of the algorithm is proved, but the computational cost is high. In this algorithm, the zero-lock problem was forcibly avoided by replacing zero values with a small positive value \(\epsilon \). There are algorithms that put the first priority on nonnegativity rather than orthogonality. Pompili et al. (2012) tackled directly the zero-lock problem. They employed the Augmented Lagrangian method. In more detail, they used the gradient on the Stiefel manifold and explicitly introduced a Lagrangian multiplier \(\psi \) for nonnegativity. The initial value of the Lagrangian was approximated to a smaller value in order to avoid the zero-lock problem. They increase the value of \(\psi \) gradually to strengthen the nonnegativity while the iteration is repeated. As a result, the nonnegativity was not strictly guaranteed in the algorithm. In addition, it has three parameters to be set appropriately for orthogonality, nonnegativity and step size.

There are mainly two problems to be solved in order to develop fast ONMF algorithms. First, we have to incorporate the matrix-type orthogonal condition \({\mathbf {F}}^{T}{\mathbf {F}}={\mathbf {I}}\) into column-wise or element-wise updating NMF algorithms. This is necessary to obtain efficiency. Next, we need to solve the zero-lock problem. This is necessary to find an appropriate balance between orthogonality and nonnegativity without a weighting parameter. This problem prevents us from using the Lagrangian and alternatively forces us to take a balance between orthogonality and nonnegativity appropriately. In this paper, we show a way to realize two things in ONMF algorithms (Table 1).

Table 1 A summary of categorization of ONMF algorithms. Frobenius, KL and Bregman denotes distortions used for measuring the degree of approximation

3 Hierarchical alternating least squares algorithm for ONMF

In this section, we show a way of utilizing the HALS for ONMF. First, we briefly review the HALS for standard NMF and then describe how to incorporate the orthogonal constraint column-wisely to propose HALS-ONMF.

3.1 Hierarchical alternating least squares for NMF

The key idea of HALS is efficient decomposition of the residual. Suppose that all of the elements of matrices \({\mathbf {F}}\) and \({\mathbf {G}}\) are fixed except for the jth columns \({\varvec{f}}_{j}\) and \({\varvec{g}}_{j}\). Since \({\mathbf {FG}}^{T}=\sum _{j=1}^{J} {\varvec{f}}_{j}{\varvec{g}}^{T}_{j}\), the objective function (3) can be minimized by finding more appropriate \({\varvec{f}}_{j}\) and \({\varvec{g}}_{j}\) than the current ones such as

$$\begin{aligned} \min _{{\varvec{f}}_{j},{\varvec{g}}_{j}}J_{j}=\left\| {\mathbf {X}}^{(j)}-{\varvec{f}}_{j}{\varvec{g}}^{T}_{j}\right\| ^{2}_{F}, \end{aligned}$$
(7)

where \({\mathbf {X}}^{(j)}\triangleq {\mathbf {X}} - \sum _{k \ne j} {\varvec{f}}_{k} {\varvec{g}}^{T}_{k}\) is a residue. Since \({\varvec{f}}_{j}\) affects only \({\varvec{g}}_{j}\), HALS alternatively minimizes (7) for \(j=1,2,\ldots ,J,1,2,\ldots \), until convergence, keeping the nonnegative constraints, \({\varvec{f}}_{j} \ge 0\) and \({\varvec{g}}_{j} \ge 0\). This objective function (7) with nonnegative constraints can be considered as an Nonnegative Least Squares (NLS) problem. HALS solves the set of such NLS problems.

In order to find a stationary point, the gradients of (7) in \({\varvec{f}}_{j}\) and \({\varvec{g}}_{j}\) are calculated:

$$\begin{aligned} {\mathbf {0}}=\frac{\partial J_{j}}{\partial {\varvec{f}}_{j}}= & {} {\varvec{f}}_{j}{\varvec{g}}_{j}^{T}{\varvec{g}}_{j}-{\mathbf {X}}^{(j)}{\varvec{g}}_{j}, \text{ and } \end{aligned}$$
(8)
$$\begin{aligned} {\mathbf {0}}=\frac{\partial J_{j}}{\partial {\varvec{g}}_{j}}= & {} {\varvec{g}}_{j}{\varvec{f}}_{j}^{T}{\varvec{f}}_{j}-{\mathbf {X}}^{(j)T}{\varvec{f}}_{j}. \end{aligned}$$
(9)

Hence, we have the following update rules:

$$\begin{aligned} {\varvec{f_{j}}}\leftarrow & {} \frac{1}{{\varvec{g_{j}}}^{T}{\varvec{g}}_{j}}[{\mathbf {X}}^{(j)}{\varvec{g}}_{j}]_{+}, \end{aligned}$$
(10)
$$\begin{aligned} {\varvec{g_{j}}}\leftarrow & {} \frac{1}{{\varvec{f_{j}}}^{T}{\varvec{f}}_{j}}[{\mathbf {X}}^{(j)T}{\varvec{f}}_{j}]_{+}, \end{aligned}$$
(11)

where \([x]_{+}=\text {max}(\epsilon ,x)\) (\(\epsilon \) being a sufficiently small positive value).

Without loss of generality, we may normalize so as to \( \Vert {\varvec{f}}_{j} \Vert ^{2}_{2}=1\) after updating. Assuming this normalization, we can remove \({\varvec{g}}_{j}^{T}{\varvec{g}}_{j}\) and \({\varvec{f}}_{j}^{T}{\varvec{f}}_{j}\) from (10) and (11), respectively. Now the update rules (10) and (11) become simpler:

$$\begin{aligned} {\varvec{f_{j}}}\leftarrow & {} [{\mathbf {X}}^{(j)}{\varvec{g}}_{j}]_{+},\quad \text{ and, } \text{ after } \text{ re-normalization } \text{ to } \Vert {\varvec{f}}_{j}\Vert ^{2}_{2}=1,\\ {\varvec{g_{j}}}\leftarrow & {} [{\mathbf {X}}^{(j)T}{\varvec{f}}_{j}]_{+}. \end{aligned}$$

Since \({\mathbf {X}}^{(j)}={\mathbf {X}}-\sum _{k \ne j} {\varvec{f}}_{k}{\varvec{g}}_{k}^{T}={\mathbf {X}}-{\mathbf {F}}{\mathbf {G}}^{T}+{\varvec{f}}_{j}{\varvec{g}}_{j}^{T}\), we finally obtain the following column-wise update rules:

$$\begin{aligned} {\varvec{f_{j}}}\leftarrow & {} \left[ ({\mathbf {XG}})_{j}-{\mathbf {F}}({\mathbf {G}}^{T}{\mathbf {G}})_{j}+{\varvec{f}}_{j}{\varvec{g}}_{j}^{T}{\varvec{g}}_{j}\right] _{+}, \text{ and } \\ {\varvec{g_{j}}}\leftarrow & {} \left[ ({\mathbf {X}}^{T}{\mathbf {F}})_{j}-{\mathbf {G}}({\mathbf {F}}^{T}{\mathbf {F}})_{j}+{\varvec{g}}_{j}{\varvec{f}}_{j}^{T}{\varvec{f}}_{j}\right] _{+}. \end{aligned}$$

Note that \({\mathbf {XG}}\) and \({\mathbf {G}}^{T}{\mathbf {G}}\) do not change their values while vectors \({\varvec{f}}_{j}\) \((j=1,\ldots ,J)\) are updated. Therefore, HALS computes \({\mathbf {XG}}\) and \({\mathbf {G}}^{T}{\mathbf {G}}\) before updating those vectors. Similarly, we per-calculate \({\mathbf {X}}^{T}{\mathbf {F}}\) and \({\mathbf {F}}^{T}{\mathbf {F}}\) before updating \({\varvec{g}}_{j}\) \((j=1,\ldots ,J)\).Footnote 3 This is the HALS algorithm usable for regular NMF.

3.2 Column-wise orthogonal constraint

Since \({\varvec{f}}_{j}\) affects the other columns in \({\mathbf {F}}^{T}{\mathbf {F}}\), the orthogonal constraint cannot be directly introduced into the HALS algorithm above. In this paper, we exploit a simple fact that if the sum of nonnegative values is zero, then all of the values are zero. Since the orthogonal condition \({\mathbf {F}}^{T}{\mathbf {F}}={\mathbf {I}}\) means \({\varvec{f}}_{k}^{T}{\varvec{f}}_{j}=0\) for every \(k \ne j\), we can use a single condition \(\sum _{k\ne j} {\varvec{f}}_{k}^{T}{\varvec{f}}_{j}=0\) for fixed j coupled with \({\varvec{f}}_{k}^{T} {\varvec{f}}_{j} \ge 0\), instead of \(J-1\) conditions \({\varvec{f}}_{k}^{T}{\varvec{f}}_{j}=0\) for every \(k \; (\ne j)\). That is, one matrix condition \({\mathbf {F}}^T{\mathbf {F}}={\mathbf {I}}\) is equivalently replaced with 2J column-wise constraints of \({\varvec{f}}_{j}^{T}{\varvec{f}}_{j}=1\) and \(\sum _{k\ne j} {\varvec{f}}_{k}^{T}{\varvec{f}}_{j}=0\) for every j. As will be shown, the newly derived column-wise constraints can be updated with O(M) for each column (M being the number of rows of \({\mathbf {X}}\) to be factorized).

Now it suffices to impose the conditions

$$\begin{aligned} {\mathbf {F}}^{(j)T}{\varvec{f}}_{j} \triangleq \sum _{k \ne j} {\varvec{f}}_{k}^{T}{\varvec{f}}_{j}=0, \ \ j=1,2,\ldots ,J. \end{aligned}$$
(12)

In addition, we normalize each column vector so as to \(\Vert {\varvec{f}}_{j}\Vert ^{2}={\varvec{f}}_{j}^{T}{\varvec{f}}_{j}=1\) to satisfy \({\mathbf {F}}^{T}{\mathbf {F}}={\mathbf {I}}\). Thus, we introduce constraint \({\mathbf {F}}^{(j)T}{\varvec{f}}_{j}=0\) \((j=1,2,\ldots ,J)\) into (4) as the column-wise orthogonal constraint. The nonnegativity of the elements is preserved with the \(\epsilon \)-truncate function \([\,]_{+}\).

3.3 HALS-ONMF

With the derived column-wise constraint (12), the localized objective function is formulated as a Lagrangian:

$$\begin{aligned} L({\varvec{f}}_{j},{\varvec{g}}_{j},\lambda _{j})= & {} \left\| {\mathbf {X}}^{(j)}-{\varvec{f}}_{j}{\varvec{g}}^{T}_{j}\right\| ^{2}_{F} + \lambda _{j}\left( {\mathbf {F}}^{(j)T}{\varvec{f}}_{j}\right) , \ \ \text{ where } \\&{\mathbf {X}}^{(j)}={\mathbf {X}}- \sum _{k \ne j} {\varvec{f}}_{k} {\varvec{g}}^{T}_{k}, \\&{\mathbf {F}}^{(j)} = \sum _{k \ne j} {\varvec{f}}_{k}, \ \ \lambda _{j} \ge 0. \end{aligned}$$

The gradient is given as

$$\begin{aligned} \frac{\partial L}{\partial {\varvec{f}}_{j}}=-2{\mathbf {X}}^{(j)}{\varvec{g}}_{j}+2{\varvec{f}}_{j}{\varvec{g}}_{j}^{T}{\varvec{g}}_{j} +\lambda _{j}{\mathbf {F}}^{(j)}. \end{aligned}$$
(13)

By solving \(\partial L / \partial {\varvec{f}}_{j} = {\varvec{0}}\) and forcibly keeping the nonnegativity, we obtain the update rule, under the assumption of normalization of \({\varvec{f}}_{j}^{T}{\varvec{f}}_{j}=1\), as post-processing:

$$\begin{aligned} {\varvec{f}}_{j} \leftarrow \left[ {\mathbf {X}}^{(j)}{\varvec{g}}_{j} - \frac{\lambda _{j}}{2}{\mathbf {F}}^{(j)}\right] _{+}. \end{aligned}$$
(14)

Unfortunately, the setting of the value of \(\lambda \) still remains as a problem. In this study, we take the same way as Ding et al. did in (2006). By multiplying \({\mathbf {F}}^{(j)}\) from the left in (13) and using \({\mathbf {F}}^{(j)T}{\varvec{f}}_{j}={\varvec{0}}\), we obtain

$$\begin{aligned} \lambda _{j} = \frac{2{\mathbf {F}}^{(j)T}{\mathbf {X}}^{(j)}{\varvec{g}}_{j}}{{\mathbf {F}}^{(j)T}{\mathbf {F}}^{(j)}}. \end{aligned}$$

Hence, (14) becomes

$$\begin{aligned} {\varvec{f}}_{j} \leftarrow \left[ {\mathbf {X}}^{(j)}{\varvec{g}}_{j}-\frac{{\mathbf {F}}^{(j)T}{\mathbf {X}}^{(j)}{\varvec{g}}_{j}}{{\mathbf {F}}^{(j)T}{\mathbf {F}}^{(j)}}{\mathbf {F}}^{(j)}\right] _{+}. \end{aligned}$$
(15)

Since the orthogonal constraint \({\mathbf {F}}^{(j)T}{\varvec{f}}_{j}=0\) does not affect \({\varvec{g}}_{j}\), we can use the same update rule as HALS-NMF, that is, with (11),

$$\begin{aligned} {\varvec{f}}_{j}\leftarrow & {} \left[ {\mathbf {X}}^{(j)}{\varvec{g}}_{j}-\frac{{\mathbf {F}}^{(j)T}{\mathbf {X}}^{(j)}{\varvec{g}}_{j}}{{\mathbf {F}}^{(j)T}{\mathbf {F}}^{(j)}}{\mathbf {F}}^{(j)}\right] _{+} , \text{ and } \\ {\varvec{g_{j}}}\leftarrow & {} \left[ {\mathbf {X}}^{(j)T}{\varvec{f}}_{j}\right] _{+}. \end{aligned}$$

Using \({\mathbf {X}}^{(j)}={\mathbf {X}}-\sum _{k \ne j} {\varvec{f}}_{k}{\varvec{g}}_{k}^{T}={\mathbf {X}}-{\mathbf {F}}{\mathbf {G}}^{T}+{\varvec{f}}_{j}{\varvec{g}}_{j}^{T}\), we have the final form of updating rules:

$$\begin{aligned} {\varvec{f}}_{j}\leftarrow & {} \left[ {\varvec{h}}-\frac{{\mathbf {F}}^{(j)T}{\varvec{h}}}{{\mathbf {F}}^{(j)T}{\mathbf {F}}^{(j)}}{\mathbf {F}}^{(j)}\right] _{+},\\ {\varvec{f}}_{j}\leftarrow & {} {\varvec{f}}_{j} / \Vert {\varvec{f}}_{j}\Vert ^{2}_{2}, \ \text { and } \\ {\varvec{g}}_{j}\leftarrow & {} \left[ ({\mathbf {X}}^{T}{\mathbf {F}})_{j}-{\mathbf {G}}({\mathbf {F}}^{T}{\mathbf {F}})_{j}+{\varvec{g}}_{j}{\varvec{f}}_{j}^{T}{\varvec{f}}_{j}\right] _{+},\ \,\text{ where }\\&{\varvec{h}} =({\mathbf {XG}})_{j}-{\mathbf {F}}({\mathbf {G}}^{T}{\mathbf {G}})_{j}+{\varvec{f}}_{j}{\varvec{g}}_{j}^{T}{\varvec{g}}_{j}. \end{aligned}$$

The zero-lock problem is resolved by \([\ ]_{+}\) operation as it is in Mirzal (2014). The proposed HALS-ONMF algorithm is shown in Algorithm 1.

This vector-wise update algorithm is faster than conventional matrix-wise update algorithms for the following reason. The matrix-wise update rule (6) is derived from (5), while the vector-wise update rule (15) of the proposed HALS-ONMF is derived from (13). The former comes from the KKT complementary condition which is just a necessary condition for the solution to minimize (4). Therefore, there is no guarantee for the updating to be optimum in each iteration. While, in the latter, the corresponding optimization problem can be solved analytically in a closed form. Therefore, the updating is always optimal in each iteration.

figure a

4 ONMF with Bregman divergence

In this section, we consider a wider class of ONMF problems; that is, Bregman divergence is introduced instead of the Frobenius norm to measure the degree of approximation. In the case of NMF, Li et al. (2012) already proposed a column-wise update algorithm called scalar Block Coordinate Descent (sBCD) to solve Bregman divergence NMF. In this paper, we develop Bregman divergence ONMF, by incorporating the column-wise orthogonal constraint into their sBCD algorithm. We first briefly review the sBCD-algorithm (Li et al. 2012) and then explain how our column-wise orthogonal constraint can be incorporated in sBCD.

4.1 Scalar block coordinate descent algorithm (sBCD)

The objective function is now given as

$$\begin{aligned} \begin{array}{l} \min _{{\mathbf {F}},{\mathbf {G}}} D_{\phi }({\mathbf {X}} || {\mathbf {F}}{\mathbf {G}}^{T}),\\ \text{ subject } \text{ to } {\mathbf {F}} \ge 0, {\mathbf {G}} \ge 0, \end{array} \end{aligned}$$
(16)

where \(D_{\phi }({\mathbf {A}}||{\mathbf {B}})\) is a Bregman divergence between matrices \({\mathbf {A}}\) and \({\mathbf {B}}\) using a strictly convex function \(\phi \). The definition of Bregman divergence is as follows.

Definition 1

(Bregman divergence) Let \(\phi :S \subseteq \mathbb {R} \rightarrow \mathbb {R}\) be a strictly convex function with the continuous first derivation \(\nabla \phi \). Then, Bregman divergence corresponding to \(\phi \), \(D_{\phi } : S \times \mathrm {int}(S) \rightarrow \mathbb {R}_{+}\), is defined as \(D_{\phi }(x,y) = \phi (x) - \phi (y) -\nabla \phi (y)(x-y)\). Here \(\mathrm {int}(S)\) is the interior of S.

A Bregman divergence for scalars is extended to the one for matrices by \(D_{\phi }({\mathbf {A}}||{\mathbf {B}}) = \sum _{m,n}D_{\phi } ({\mathbf {A}}_{mn}|{\mathbf {B}}_{mn})\). Bregman divergences include many well-known divergences such as Frobenius norm and KL-divergence (Table 2). Recently, Li et al. proposed a column-wise update algorithm for Bregman divergence NMF.Footnote 4 The key idea of the update rules is Taylor series of Bregman divergences.

Table 2 Examples of Bregman Divergence

Let \(E_{t}(a||b) \triangleq |a-b|^{t}\) and \(E_{t}({\mathbf {X}}||{\mathbf {X}}') \triangleq \sum _{mn}|x_{mn} -x_{mn}'|^{t}\) be the tth power of t-norm distance. Then, for \({\mathbf {X}}'={\mathbf {FG}}^{T}\), we have

$$\begin{aligned} \min _{{\mathbf {X}}'}E_{t}({\mathbf {X}}||{\mathbf {X}}')=\min _{{\mathbf {F}}{\mathbf {G}}}E_{t}({\mathbf {X}}||{\mathbf {FG}}^{T})= \min _{\forall j {\varvec{f}}_{j},{\varvec{g}}_{j}}E_{t}\left( {\mathbf {X}}^{(j)}||{\varvec{f}}_{j}{\varvec{g}}_{j}^{T}\right) . \end{aligned}$$

We want to connect \(E_{t}({\mathbf {X}}^{(j)}||{\varvec{f}}_{j}{\varvec{g}}_{j}^{T})\) with Bregman divergence \(D_{\phi }({\mathbf {X}}||{\mathbf {FG}}^{T})\) to minimize \(E_{t}({\mathbf {X}}^{(j)}||{\varvec{f}}_{j}{\varvec{g}}_{j}^{T})\). In the scalar case, by applying the Taylor series of \(\phi (x)\) at \(x=b\) to \(\phi (a)\), we have

$$\begin{aligned} D_{\phi }(a||b)= & {} \phi (a)-\phi (b) - \nabla \phi (b)(a-b) \nonumber \\= & {} \nabla \phi (b)(a-b) + \sum _{t=2}^{\infty } \frac{\nabla ^{t}\phi (b)}{t!}(a-b)^{t}\nonumber \\&\quad -\,\nabla \phi (b)(a-b)\quad \text {(Taylor series of the first two terms)} \nonumber \\= & {} \sum _{t=2}^{\infty } \frac{\nabla ^{t}\phi (b)}{t!}(a-b)^{t} \nonumber \\= & {} \sum _{t=2}^{\infty } \frac{\nabla ^{t}\phi (b)}{t!}(-\text {sgn}(b-a))^{t} E_{t}(a||b), \end{aligned}$$
(17)

where \(\nabla ^{t}\phi (b)\) is the t-order derivative of \(\phi (x)\) at \(x=b\). The last equation comes from the relation: \( (a-b)^{t}= (\text {sgn}(a-b))^{t} |a-b|^{t}\). Hence, as a natural extension, \(D_{\phi }({\mathbf {X}}||{\mathbf {FG}}^{T})\) can be re-written as

$$\begin{aligned} D_{\phi }({\mathbf {X}} ||{\mathbf {FG}}^{T})=\sum _{mn}\sum _{t=2}^{\infty } \frac{\nabla ^{t}\phi \left( x'_{mn}\right) }{t!}\left( -\text {sgn}\left( x'_{mn}-x_{mn}\right) \right) ^{t} E_{t}\left( x^{(j)}_{mn}||f_{mj}g_{nj}\right) , \end{aligned}$$

where \(x_{mn}^{(j)}=({\mathbf {X}}-\sum _{k \ne j}{\varvec{f}}_{k}{\varvec{g}}_{k})_{mn}\) and \(x_{mn}'=({\mathbf {FG}}^{T})_{mn}\). Thus, we can use the partial derivation of \(E_{t}(x^{(j)}_{mn}||f_{mj}g_{nj})\) instead of that of \(D_{\phi }({\mathbf {X}} ||{\mathbf {FG}}^{T})\). Since

$$\begin{aligned}&\frac{\partial }{\partial f_{mj}}\left( \frac{\nabla ^{t}\phi \left( x_{mn}'\right) }{t!}\left( -\text {sgn}\left( f_{mj}g_{nj}-x_{mn}^{(j)}\right) \right) ^{t} E_{t}\left( x^{(j)}_{mn}||f_{mj}g_{nj}\right) \right) \\&\qquad =-g_{nj}\frac{\nabla ^{t}\phi \left( x_{mn}'\right) }{t-1!}\left( x_{mn}^{(j)}-f_{mj}g_{nj}\right) ^{t-1} + g_{nj}\frac{\nabla ^{t+1}\phi \left( x_{mn}'\right) }{t!}\left( x_{mn}^{(j)}-f_{mj}g_{nj}\right) ^{t}, \end{aligned}$$

with (17), we have

$$\begin{aligned} \frac{\partial D_{\phi }\left( x_{mn}||x_{mn}'\right) }{f_{mj}}= & {} g_{nj}\nabla ^{2}\phi \left( x_{mn}'\right) \left( f_{mj}g_{nj}-x_{mn}^{(j)}\right) \\&+\sum _{t=2}^{\infty }\left( -g_{nj}\frac{\nabla ^{t+1} \phi \left( x_{mn}'\right) }{t!}\left( x_{mn}^{(j)}-f_{mj}g_{nj}\right) ^{t} \right. \\&\left. +\,g_{nj}\frac{\nabla ^{t}\phi \left( x_{mn}'\right) }{t!}\left( x_{mn}^{(j)}-f_{mj}g_{nj}\right) ^{t}\right) \\= & {} g_{nj}\nabla ^{2}\phi \left( x_{mn}'\right) \left( f_{mj}g_{nj}-x_{mn}^{(j)}\right) . \end{aligned}$$

Taking the sum over the rows and columns, we obtain the gradient of \(D_{\phi }({\mathbf {X}}||{\mathbf {FG}}^{T})\) in \(f_{nj}\):

$$\begin{aligned} \frac{\partial D_{\phi }\left( {\mathbf {X}}||{\mathbf {FG}}^{T}\right) }{\partial f_{nj}}=\sum _{n=1}^{N} g_{nj}\nabla ^{2}\phi \left( x_{mn}'\right) \left( f_{mj}g_{nj}-x_{mn}^{(j)}\right) . \end{aligned}$$
(18)

Finally, the update rule of sBCD is given by

$$\begin{aligned} f_{nj} \leftarrow \left[ \frac{\sum _{n=1}^{N} \nabla ^{2}\phi \left( x_{mn}'\right) x_{mn}^{(j)} g_{nj}}{\sum _{n=1}^{N}\nabla ^{2}\phi \left( x_{mn}'\right) g_{nj}g_{nj}}\right] _{+}. \end{aligned}$$
(19)

This sBCD algorithm (19) needs to calculate column-wise residual \({\mathbf {X}}^{(j)}={\mathbf {X}}-\sum _{k \ne j} {\varvec{f}}_{k}{\varvec{g}}_{k}\) for \(x_{mn}^{(j)}\) in (19). Therefore, instead of the element-wise update (19), we adopt the following:

$$\begin{aligned} {\varvec{f}}_{j} \leftarrow \left[ \frac{ (\nabla ^{2}\phi ({\mathbf {FG}}^{T}) \odot {\mathbf {X}}^{(j)}){\varvec{g}}_{j}}{\nabla ^{2}\phi ({\mathbf {FG}}^{T}) {\varvec{g}}^2_{j}}\right] _{+}. \end{aligned}$$
(20)

4.2 Bregman divergence ONMF

Now, we introduce the orthogonal constraint into Bregman divergence NMF to have Bregman divergence ONMF. The minimization problem of Bregman divergence ONMF is given by

$$\begin{aligned} \begin{array}{l} \min _{{\mathbf {F}},{\mathbf {G}}} D_{\phi }({\mathbf {X}} || {\mathbf {F}}{\mathbf {G}}^{T}) \\ \text{ subject } \text{ to } {\mathbf {F}} \ge 0, {\mathbf {G}} \ge 0, \ {\mathbf {F}}^{T}{\mathbf {F}} = {\mathbf {I}}. \end{array} \end{aligned}$$
(21)

For the same reason as that stated before, we solve its relaxed version:

$$\begin{aligned} \begin{array}{l} \min _{{\mathbf {F}},{\mathbf {G}}} D_{\phi }({\mathbf {X}} || {\mathbf {F}}{\mathbf {G}}^{T}) + \lambda \Vert {\mathbf {F}}^{T}{\mathbf {F}}- {\mathbf {I}}\Vert \\ \text{ subject } \text{ to } {\mathbf {F}} \ge 0, {\mathbf {G}} \ge 0. \end{array} \end{aligned}$$

This problem can be re-written equivalently and column-wisely as

$$\begin{aligned} O= & {} \min _{\forall j \ {\varvec{f}}_{j},{\varvec{g}}_{j},\lambda _{j}} D_{\phi }\left( {\mathbf {X}} || \left( \sum _{k \ne j} {\varvec{f}}_{k}{\varvec{g}}_{k}^{T}\right) + {\varvec{f}}_{j}{\varvec{g}}_{j}^{T}\right) + \lambda {\mathbf {F}}^{(j)T}{\varvec{f}}_{j},\nonumber \\&\quad \text{ where }\,\,{\mathbf {F}}^{(j)} = \sum _{k \ne j} {\varvec{f}}_{k},\,\, \lambda _{j} \ge 0. \end{aligned}$$
(22)

Note that the first term of RHS of (22) is equivalent to (21). Hence, we have

$$\begin{aligned} \frac{\partial O}{\partial {\varvec{f_{j}}}}= \left( \nabla ^{2}\phi \left( {\mathbf {X}}_{mn}'\right) \odot \left( {\varvec{f}}_{j}{\varvec{g}}_{j}^{T}-{\mathbf {X}}^{(j)}\right) \right) {\varvec{g}}_{j}+ \lambda _{j} {\mathbf {F}}^{(j)}. \end{aligned}$$
(23)

To determine the value of the Lagrangian multiplier \(\lambda _{j}\), we again assume \({\mathbf {F}}^{(j)T}{\varvec{f}}_{j}=0\) and multiply \({\mathbf {F}}^{(j)T}\) from the left to (23) to be zero. This gives

$$\begin{aligned} \lambda _{j}{\mathbf {F}}^{(j)T}{\mathbf {F}}^{(j)}= -{\mathbf {F}}^{(j)T}\left( {\varvec{f}}_{j}{\varvec{g}}_{j}^{T} \odot \nabla ^{2}\phi ({\mathbf {FG}}^{T})\right) {\varvec{g}}_{j} + {\mathbf {F}}^{(j)T}\left( {\mathbf {X}}^{(j)} \odot \nabla ^{2}\phi ({\mathbf {FG}}^{T})\right) {\varvec{g}}_{j}. \end{aligned}$$

Under nonnegativity \({\varvec{f}}_{j} \ge 0\) and \({\mathbf {F}}^{(j)} \ge 0\), with the assumption \({\mathbf {F}}^{(j)T}{\varvec{f}}_{j}=0\), we have

$$\begin{aligned} {\varvec{f}}_{j},{\mathbf {F}}^{(j)} \ge 0\,\,\text{ and }\,\, {\mathbf {F}}^{(j)T}{\varvec{f}}_{j}=0 \Rightarrow {\mathbf {F}}^{(j)T}\left( {\varvec{f}}_{j}{\varvec{g}}_{j}^{T} \odot \nabla ^{2}\phi ({\mathbf {FG}}^{T})\right) {\varvec{g}}_{j}=0. \end{aligned}$$

This is because the row indices of zero values of \(({\varvec{f}}_{j}{\varvec{g}}_{j}^{T} \odot \nabla ^{2}\phi ({\mathbf {FG}}^{T})){\varvec{g}}_{j}\) are the same as those of \({\varvec{f}}_{j}\). Hence, we may set \(\lambda _{j}\) to

$$\begin{aligned} \lambda _{j} = \frac{{\mathbf {F}}^{(j)T}({\mathbf {X}}^{(j)} \odot \nabla ^{2}\phi ({\mathbf {FG}}^{T})){\varvec{g}}_{j}}{{\mathbf {F}}^{(j)T}{\mathbf {F}}^{(j)}}. \end{aligned}$$

Then the update rule of sBCD-ONMF becomes

$$\begin{aligned} {\varvec{f}}_{j} \leftarrow \left[ \frac{ (\nabla ^{2}\phi ({\mathbf {FG}}^{T}) \odot {\mathbf {X}}^{(j)}){\varvec{g}}_{j} -\frac{{\mathbf {F}}^{(j)T}({\mathbf {X}}^{(j)} \odot \nabla ^{2}\phi ({\mathbf {FG}}^{T})){\varvec{g}}_{j}}{{\mathbf {F}}^{(j)T}{\mathbf {F}}^{(j)}} {\mathbf {F}}^{(j)}}{\nabla ^{2}\phi ({\mathbf {FG}}^{T}) {\varvec{g}}^2_{j}}\right] _{+}. \end{aligned}$$
(24)

If we use \(\phi (x)=x^{2}/2\) corresponding to the Frobenius norm, it is easy to verify \(\nabla ^{2} \phi ({\mathbf {FG}}^{T})= {\mathbf {1}}\). It implies that (24) is equivalent to (15) with post-processing normalization \(\Vert {\varvec{f}}_{j}^{T}\Vert ^2_{2}=1\).

This sBCD-ONMF algorithm is an extension of the previous HALS-ONMF algorithm, but its convergence is slower because sBCD-ONMF needs to update the column-wise residual in addition to the updating of each column, while HALS-ONMF does not need to do so for the residual. The proposed sBCD-ONMF algorithm is shown in Algorithm 2.

figure b

4.3 Relation to Bregman hard clustering

The original ONMF is known to be related to k-means clustering (Ding et al. 2006). So, in this section, we make clear the relationship between Bregman divergence ONMF and Bregman Hard Clustering proposed by Banerjee et al. (2005b). The criterion of Bregman hard clustering to minimize is a natural extension of that of k-means clustering as shown below:

$$\begin{aligned} \min _{\pi _{j=1,2,\ldots ,J}} \sum _{j=1}^{J} \sum _{ n \in \pi _{j}}D_{\phi }({\varvec{x}}_{n}||{\varvec{\mu }}_{j}), \end{aligned}$$
(25)

where \(\pi _{j}\) for \(j=1,2,\ldots ,J\) is a set of disjoint clusters and \(\mu _{j} = \sum _{n \in \pi _{j}} \frac{1}{|\pi _{j}|}{\varvec{x}}_{n}\) is the centroid of cluster \(\pi _{j}\). Then, we have the following theorem.

Theorem 1

(Equivalence between Bregman divergence ONMF and Bregman hard clustering) The minimization problem of Bregman divergence ONMF (21) is equivalent to that of the Bregman hard clustering defined in (25).

Proof

Let us suppose a given data matrix \({\mathbf {X}}=[{\varvec{x}}_{1},{\varvec{x}}_{2},\ldots ,{\varvec{x}}_{N}]\in \mathbb {R}^{M \times N}\) and impose the orthogonal constraint into \({\mathbf {G}}\) instead of \({\mathbf {F}}\), that is, \({\mathbf {G}}^{T}{\mathbf {G}}={\mathbf {I}}\).Footnote 5 We first consider the minimization problem for ONMF with Bregman divergence:

$$\begin{aligned} D_{\phi }({\mathbf {X}} ||{\mathbf {FG}}^{T})=\sum _{n=1}^{N} D_{\phi }({\varvec{x}}_{n} ||({\mathbf {FG}}^{T})_{n}), \end{aligned}$$

where \(({\mathbf {FG}}^{T})_{n}\) denotes the nth column vector of matrix \({\mathbf {FG}}^{T}\). As stated before, each row vector of the orthogonal nonnegative matrix \({\mathbf {G}}\) has only one non-zero value. In a clustering task, this non-zero value corresponds to the clustering index that the data belong to. Therefore, we can rewrite the minimization problem as

$$\begin{aligned} \sum _{n=1}^{N} \sum D_{\phi }({\varvec{x}}_{n} ||({\mathbf {FG}}^{T})_{n}) = \sum _{j=1}^{J} \sum _{n: g_{nj}\ne 0} D_{\phi }({\varvec{x}}_{n} ||g_{nj}{\varvec{f}}_{j}). \end{aligned}$$

According to Ding et al. (2008), let us impose the row normalization condition

$$\begin{aligned} g_{nj}= 1. \end{aligned}$$

Then, it suffices to minimize

$$\begin{aligned} \sum _{\pi _{j=1,2,\ldots ,J}} \sum _{n \in \pi _{j}} D_{\phi }({\varvec{x}}_{n} ||{\varvec{f}}_{j}). \end{aligned}$$

The last thing we need to show is \({\varvec{f}}_{j} = \mu _{j}=\sum _{n \in \pi _{j}} \frac{1}{|\pi _{j}|}{\varvec{x}}_{n}\), but this has already been proved in previous studies (Banerjee et al. 2005a, b): the best predictor in Bregman divergence is the arithmetic mean of the data. Therefore, the optimal solution \({\varvec{f}}_{j}^{*}\) with a fixed \({\mathbf {G}}\) is given by

$$\begin{aligned} {\varvec{f}}^{*}_{j} =\sum _{n: g_{nj} \ne 0 } \frac{1}{|{\varvec{g}}_{j}|}{\varvec{x}}_{n} =\sum _{n \in \pi _{j}} \frac{1}{|\pi _{j}|}{\varvec{x}}_{n}= \mu _{j}. \end{aligned}$$

\(\square \)

Since Bregman hard clustering is applicable to various data types with appropriate choices of \(\phi (x)\) (e.g., text data with KL-divergence and speech data with IS-divergence), Bregman divergence ONMF has a wider variety of applications than does the standard ONMF.

5 Performance evaluation

5.1 Datasets

We compared the performance of those algorithms for six real-life datasets and one artificial dataset. For the artificial dataset, we followed the setting in Li et al. (2012).Footnote 6 A summary of the datasets is given in Table 3.Footnote 7

Table 3 Datasets used in the experiments

5.2 Compared algorithms and evaluation measures

In ONMF problems, we compared the proposed HALS-ONMF with four traditional Frobenius norm ONMF algorithms. In Bregman divergence ONMF problems, we compared sBCD-ONMF with Li’s KL-divergence ONMF. In addition, in order to investigate the trade-off of orthogonality and approximation in Bregman divergence, we also compared sBCD-ONMF with sBCD-NMF, although sBCD-NMF does not have an imposed orthogonality constraint. This is because sBCD-ONMF is the first and only one ONMF algorithm workable with any Bregman divergence. We used IS- (\(\phi (x)=-\log x\)) and \(\beta \)-divergence (\(\frac{1}{\beta (\beta +1)}(x^{\beta +1}-(\beta +1)x+\beta )\) with \(\beta =2\)) in this comparison. We measured the orthogonality and approximation accuracy. We compared all of the algorithms shown in Table 1 except for Pompili’s ONMF (Pompili et al. 2012). Pompili’s ONMF algorithm (Pompili et al. 2012) was not used because their algorithm attains orthogonality but not nonnegativity. It is also reported already in their own comparison (Pompili et al. 2012). We also noted that their algorithm was slowest among all the compared algorithms. For the weighting parameter \(\alpha \) in Li’s ONMF (Li et al. 2010) and Mirzal’s ONMF (Mirzal 2014), we used \(\alpha =1\), because the value worked satisfactorily on most datasets.

We employed the same evaluation setting as that in Li et al. (2012). Ten trials with different initial values are conducted and the average values of measurements were shown here. We fixed the number of iterations to 100 for all of the algorithms. We evaluated the degree of approximation and the degree of orthogonality by

$$\begin{aligned} \text{ Normalized } \text{ Residual } \text{ Value: }&\frac{\parallel {\mathbf {X}}-{\mathbf {F}}{\mathbf {G}}^{T} \parallel ^{2}_{F}}{\parallel {\mathbf {X}} \parallel ^{2}_{F}}\,\,\text {(for ONMF)} \end{aligned}$$
(26)
$$\begin{aligned} \text{ Relative } \text{ Residual } \text{ Value: }&\log _{10} \frac{D_{\phi }({\mathbf {X}}||{\mathbf {F}}{\mathbf {G}}^{T})}{D_{\phi }({\mathbf {X}}||{\mathbf {F}}_{0}{\mathbf {G}}^{T}_{0})}\,\,\text {(for Bregman ONMF)} \end{aligned}$$
(27)
$$\begin{aligned} \text{ Orthogonality: }&\Vert {\mathbf {F}}^{T}{\mathbf {F}}-{\mathbf {I}} \Vert ^{2}_{F} \end{aligned}$$
(28)

Here, \({\mathbf {F}}_{0}\) and \({\mathbf {G}}_{0}\) are the matrices used for initialization. In addition, we evaluated the computation time (seconds), the normalized residual value (26), and the degree of orthogonality (28) for Frobenius norm ONMF. For Bregman divergence ONMF, we evaluated the relative residual value (27) and (28) since (26) cannot be appropriately normalized for Bregman divergence.

5.3 Comparison on ONMF problems

Figure 1 shows the values of the normalized residual for \(J=30\) (number of components) for the six real-life datasets. The proposed HALS-ONMF converges faster than do the other ONMF algorithms. HALS-ONMF converges before 250 seconds for all six datasets. This is because HALS-ONMF needs a smaller number of iterations, because of the fact that HALS-ONMF solves vector-wise problems with the analytical solutions [(8) and (9)]. Figure 2 shows the degrees of orthogonality attained. The HALS-ONMF achieves almost the highest degree of orthogonality among the algorithms in the early stage, though the final degree of orthogonality is slightly worse than that of others. In the ORL dataset (Fig. 2e), a dense dataset, only HALS-ONMF succeeded in achieving an acceptable degree of orthogonality.

Fig. 1
figure 1

Comparison of five ONMF algorithms in the degree of approximation on six-real life datasets. The proposed HALS-ONMF algorithm converges faster than the other four conventional algorithms. a 20Newsgroup. b TDT. c Reuter. d RCV. e ORL. f MNIST

Fig. 2
figure 2

Orthogonality attained by three ONMF algorithms. The proposed HALS algorithm converges faster than the other four conventional algorithms, but the final state is worse than some of the others. a 20Newsgroup. b TDT. c Reuter. d RCV. e ORL. f MNIST

To show the speed of convergence, we defined the stopping criterion of the iteration according to the way conventional researches adopted (Pompili et al. 2012; Kim and Park 2008) as:

$$\begin{aligned} \frac{\parallel {\mathbf {X}}-{\mathbf {F}}^{t-1}{\mathbf {G}}^{t-1T} \parallel ^{2}_{F} -\parallel {\mathbf {X}}-{\mathbf {F}}^{t}{\mathbf {G}}^{tT} \parallel ^{2}_{F}}{\parallel {\mathbf {X}} \parallel ^{2}_{F}} < \epsilon , \end{aligned}$$
(29)

where \(\epsilon \) is a threshold and \({\mathbf {G}}^{t}\) and \({\mathbf {F}}^{t}\) are matrices after tth update. In this paper, we set the threshold \(\epsilon \) to \(10^{-4}\) in all datasets.

Table 4 The results when the stopping criterion (29) with \(\epsilon =10^{-4}\) is applied on four datasets

Table 4 shows the result on four datasets when we terminated the calculation with the stopping criterion (29).Footnote 8 The proposed HALS-ONMF is the fastest with the smallest number of iterations. The proposed algorithm converged about 1.6 to 4.1 times faster than the others, keeping comparable approximation accuracy and orthogonality.

5.4 Comparison on Bregman divergence ONMF problems

In Bregman divergence ONMF problems, we compared sBCD-ONMF with Li’s KL-divergence ONMF (Li et al. 2010) and sBCD-NMF with KL-divergence. In addition, we compared sBCD-ONMF with sBCD-NMF for IS- or \(\beta \)-divergence. Unfortunately, since KL-divergence and IS-divergence do not allow zero values (\( 0 \notin dom_{\phi }\)), most datasets were not suitable for this comparison. Besides, sBCD-NMF or sBCD-ONMF does not scale because of their high computational costs [see, for example, Step (11) and Step (17) in Algorithm 2]. Therefore, we dealt with only one artificial dataset of \({\mathbf {X}} \in \mathbb {R}^{2000 \times 1000}\).

Fig. 3
figure 3

Comparison of KL-divergence, IS-divergence and \(\beta \)-divergence NMF on an artificial dataset (\({\mathbf {X}} \in \mathbb {R}^{2000\times 1000}\)). The first row shows the value of relative residual and the second row shows the degree of orthogonality. a KL-divergence. b IS-divergence. c \(\beta \)-Divergence. d KL-divergence. e IS-divergence. f \(\beta \)-Divergence

The results are shown in Fig. 3. As predicted, the sBCD-NMF algorithm without an orthogonal constraint achieved better approximation than did the algorithms with orthogonal constraints, Li’s ONMF and sBCD-ONMF, while the latter two achieved a higher degree of orthogonality. In comparison of convergence speeds, sBCD-ONMF is almost the same as sBCD-NMF or even faster. Li’s ONMF is inferior to sBCD-ONMF in convergence speed.Footnote 9

In total, we can say that sBCD-ONMF is a fast algorithms to find a solution in Bregman divergence ONMF problems with a sufficient degree of orthogonality at the expense of a little amount of degradation of approximation.

5.5 Clustering experiments

As we stated before, ONMF is suitable for clustering tasks more than standard NMF. This is because the constrained matrix \({\mathbf {F}}\) can be considered as an indicator matrix in ONMF. Let \({\mathbf {X}}\) be an \(instance \times feature\) matrix factorized by \({\mathbf {FG}}^{T}\). Then ith row of \({\mathbf {F}}\) can be considered as a membership vector of instance i to J groups (features). Especially, a solution \({\mathbf {FG}}^{T}\) in ONMF is expected to have a crisp membership of a single one. We assign the ith instance to kth cluster such as

$$\begin{aligned} k = \mathop {\hbox {argmax}}\limits _{j} {\mathbf {F}}_{ij}. \end{aligned}$$

We compared the proposed HALS-ONMF and sBCD-ONMF with one standard NMF algorithm (HALS-NMF) and conventional ONMF algorithms [Ding’s ONMF (2006) and Yoo’s ONMF (2008)]. We set the number iteration to 30 which was sufficient for convergence in the previous experiments in Sects. 5.3 and 5.4. In addition, we conducted k-means algorithm as a base-line method. We used four TREC document classification datasets (see Table 5 for the detail). Since these dataset have class labels, we hid them for clustering and then evaluated the difference between the true clustering induced by the class labels and the obtained clustering.

We measured Normalized Mutual Information (NMI) defined as

$$\begin{aligned} \text {NMI}: \frac{I(\hat{C};C)}{(H(\hat{C})+H(C))/2}, \end{aligned}$$

where \(\hat{C}\) is the predicted clustering and C is the ground truth. Here, \(H(\cdot )\) is Shanon Entropy, and I(; ) is the Mutual Information. We averaged the results for ten trials with different initial points.

Table 5 Datasets used in the clustering experiments

Unfortunately, the general advantage of ONMF over NMF was not confirmed,Footnote 10 as long as their algorithms are with Frobenius norm. Nevertheless, the proposed HALS-ONMF achieved the best score in NMI among them. Rather, we confirmed the advantage of KL-divergence over Frobenius norm and IS-divergence. This is not an unexpected result because it is known that the document data is well explained by Multinomial distribution models and minimizing KL-divergence is corresponding to maximum likelihood with Multinomial distribution model (Banerjee et al. 2005b; Li et al. 2012). The best choice is one of conventional SBCD-NMF with KL-divergence, sBCD-ONMF with KL-divergence, and HALS-ONMF with Frobenius norm (Table 6).

Table 6 The clustering results evaluated in NMI (the higher, the better) on four-real life datasets

6 Conclusion

In this paper, we have proposed a fast algorithm for solving one-sided orthogonal nonnegative matrix factorization problems in the Frobenius norm and in Bregman divergence. Orthogonal NMF algorithms proposed so far suffered from slow convergence mainly due to their matrix-wise updates. By decomposing the matrix-type orthogonality condition into a set of column-wise orthogonality conditions, we succeeded in speeding up the convergence. One of the proposed algorithms is the first algorithm to solve a Bregman divergence NMF problem with an orthogonal constraint. In addition, we showed that Bregman divergence ONMF problem is equivalent to Bregman hard clustering. Experiments for six real-life datasets and an artificial dataset demonstrated that the proposed algorithms are in fact faster than state-of-the-art algorithms in convergence while keeping a satisfactory level of orthogonality. In the best case, the proposed algorithm converged more than four times faster than state-of-the-art algorithms.