Keywords

1 Introduction

For a data matrix \(\textbf{X}\) of size \(m\times n\), \(\textbf{WH} \approx \textbf{X}\) (where \(\textbf{W}\) is of size \(m \times k\), \(\textbf{H}\) is of size \(k \times n\)) is considered as a low rank approximation (\(k \ll n\)) of the data matrix \(\textbf{X}\). Low rank approximations are essential in machine learning applications and especially in natural language processing and topic modelling where the data matrix is constructed over a collection of words from a vocabulary and a usually large collection of documents [10, 21, 24, 34].

Singular value decomposition (SVD) [28] is an early approach for computing such a low rank approximation of data. SVD minimizes the Frobenius norm and the spectral norm simultaneously; not only that, the columns of \(\textbf{W}\) are orthogonal, and the rows of \(\textbf{H}\) are also orthogonal. However, the entries in \(\textbf{W, H}\) may be negative, which reduces the utility of SVD for data matrix \(\textbf{X}\) in which the entries are positive as the factors in \(\textbf{W}\) do not have an intuitive explanation. Non-negative matrix factorization (NMF), \(\textbf{WH} \approx \textbf{X}\) and \(\textbf{X, W,H} \ge \textbf{0},\) was introduced by Paatero and Tapper [20] to overcome this difficulty of interpretation of the factors. NMF was shown to be NP-complete by Vavasis [27]. NMF does not require the columns of \(\textbf{W}\) to be orthogonal, and this is considered a severe drawback in some applications as the columns (factors) of \(\textbf{W}\) are not separable by a large angle. Keeping this limitation in mind Ding et al. [34] introduced orthogonality constraints in NMF, \(\textbf{X} \approx \textbf{WH}\) and \(\textbf{X, W, H} \ge \textbf{0,}\) the rows of \(\textbf{H}\) are orthogonal and demonstrated that is an effective approach for clustering of documents.

We consider the following problem: given a \(m \times n\) data matrix \(\textbf{X}\), we wish to represent \(\textbf{X}\) as a product of two matrices \(\textbf{W, H}\) with dimensions \(m \times k\) and \(k \times n\) respectively with the following restrictions: entries in \(\textbf{W}\) are positive, the entries in \(\textbf{H}\) are either 0 or 1, \(\mathbf{{ H H}}^T = I\) and the norm \(|| \mathbf{{ X - WH }}||^2\) is minimized. Additionally, we want k to be small compared to nm. The columns of the data matrix \(\textbf{X}\) can be thought of as the n samples. Low rank \(\textbf{W}\) represents the latent features. We call this problem the binary orthogonal nonnegative matrix factorization problem (BONMF).

1.1 Contributions

This paper gives a new method (Algorithm 1) for computing a binary orthogonal NMF using the two-phase iterative approach. In the first phase, we use a known update rule [16] to compute the factor \(\textbf{W}\). In the second phase, we use the observation that the binary constraints on \(\textbf{H}\) have a geometric interpretation. This gives an efficient rule to update \(\textbf{H}\) in each iteration (Eq. (12)). The entries in \(\textbf{H}\) are binary, and they are computed column-wise. If all the entries in \(\textbf{H}\) are non-zero, then O(nk) space is needed. However, \(\textbf{H}\) is binary, and the rows of \(\textbf{H}\) are orthogonal. Therefore, only O(n) space is needed. If we compute the entries of \(\textbf{H}\) columns-wise, intermediate states also need O(n) space. The computation for each column of \(\textbf{H}\) takes \(O(n^2k)\) steps. Therefore, the method is space efficient.

We evaluate the method’s performance (in Sect. 4) for training and testing on reference data sets from the ML repository. The experiments demonstrate that the training and the classification phase are efficient (Table 2). The method is accurate and outperforms the state of art methods (Table 2). This method uses k dot products of m element vectors to update each column of the coefficient matrix \(\textbf{H}\) where k is the number of classes in the data set. This is a significant reduction in the computation needed compared to the algorithms of [15, 32, 34] in the classification phase. The method is also space efficient as \(\textbf{H}\) is sparse.

2 Related Work

We begin with NMF and the related background needed to describe our algorithm. Given a non-negative matrix , a non-negative matrix factorization of \(\textbf{X}\) finds two non-negative matrices and with \(k\ll \min (m,n)\) such that:

$$\begin{aligned} \textbf{X}\approx \textbf{WH}, \end{aligned}$$

and the entries in \(\textbf{W, H}\) are positive. The factorization has a natural interpretation [15] and can be computed using various unsupervised machine learning methods. Due to its intuitive interpretation, NMF has found numerous applications such as data consolidation [8], image clustering [6], topic modelling [2], community detection [29], recommender systems [22], and gene expression profiling [33].

BONMF is different from NMF. The entries in \(\textbf{H}\) are restricted to binary. If the columns of \(\textbf{H}\) are orthogonal, then the columns can be used to cluster the data. Therefore, BONMF factorization has several exciting applications [31].

Orthogonal NMF (ONMF) in which \(\textbf{X} \approx \textbf{WH}\) and \(\textbf{W, H} \ge \textbf{0}\) and \(\mathbf{{ H H}}^T = I\) was defined by Ding et al. [34] who gave an algorithm based on solving the Lagrangian relaxation. The entries in \(\textbf{H}\) in ONMF are not required to be binary. ONMF use for data clustering was popularized by Seung and Lee [23]. One of the first notable applications of orthogonal NMF to document clustering is in [30] who gave improved algorithms and showed that ONMF performed better at document clustering than NMF. K-means [19] is one of the most widely used algorithms for unsupervised learning. Bauckhage [3] showed that the objective function of K-means can be rewritten as ONMF if the entries in \(\textbf{H}\) are binary, and the following condition holds:

$$\begin{aligned} \sum _i {{\textbf {H}}}_{ij} = 1 \quad \forall j \end{aligned}$$
(1)

Therefore, BONMF is equivalent to K-means clustering. BONMF was also studied by Zdunek [31] and differs from the well-studied non-negative matrix factorization (NMF). Lee et al. [16] studied BONMF without the condition (1) on \(\textbf{H}\) and gave an algorithm for determining such a factorization. However, applications to classification are not many. In this paper, we study BONMF for its use in prediction and classifying data, including clustering.

2.1 NMF

NMF can be formulated as the following optimization problem that minimizes the square of the Frobenius norm:Footnote 1

$$\begin{aligned} F( {\textbf{W,H}})= \min _{ {\mathbf {W,H\ge 0}}}\frac{1}{2}\Vert {\mathbf {X-WH}}\Vert _F^2. \end{aligned}$$
(2)

Most of the methods for computing NMF are based on iterative update rules. A popular set of update rules given below is due to Lee and Seung [15], the iteration number is in superscript.

$$\begin{aligned} {\textbf{W}}^{t+1}_{ia}= & {} {\textbf{W}}^{t}_{ia}\frac{({\textbf{X}}{{\textbf{H}}^{t}}^{T})_{ia}}{({\textbf{W}}^{t}{\textbf{H}}^{t}{{\textbf{H}}^{t}}^{T})_{ia}}, ~~\forall i,a; \end{aligned}$$
(3)
$$\begin{aligned} {\textbf{H}}^{t+1}_{bj}= & {} {\textbf{H}}^{t}_{bj}\frac{({{\textbf{W}}^{t+1}}^{T}{\textbf{X}})_{bj}}{({{\textbf{W}}^{t+1}}^{T}{\textbf{W}}^{t+1}{\textbf{H}}^{t})_{bj}}, ~\forall b,j. \end{aligned}$$
(4)

For many more variations on such update rules, see [11]. Optimization approaches such as block-coordinate descent, projected gradient descent, and alternating non-negative least squares (ANLS) [18] have also been used for NMF. ANLS transforms the problem in (2) into two convex optimization problems:

$$\begin{aligned} {\textbf{W}}_{t+1}= & {} \min _{{\mathbf {W\ge 0}}}f({\textbf{W,H}}_t) ={\min _{{\mathbf {W\ge 0}}}\frac{1}{2}\Vert {\mathbf {X-WH}}_t\Vert _{F}^2,}\end{aligned}$$
(5)
$$\begin{aligned} {\textbf{H}}_{t+1}= & {} \min _{{\mathbf {H\ge 0}}}f({\textbf{W}}_{t+1},{\textbf{H}}) =\min _{{\mathbf {H\ge 0}}}\frac{1}{2}\Vert {\mathbf {X-W}}_{t+1}{\textbf{H}}\Vert _{F}^2. \end{aligned}$$
(6)

We can solve the optimization problems given by (5) and (6) in a few ways. [12, 13] gave the Rank-one Residue Iteration (RRI) algorithm for computing NMF. This algorithm was also independently proposed by Cichocki et al. [5], which is called the Hierarchical Alternating Least Squares (HALS) algorithm. The solution to (5) and (6) in HALS/RRI is given by explicit formulas, which make for easy implementation. Kim et al. [14] used Newton and quasi-Newton methods to solve (5), (6) and showed that their method has faster convergence. However, these methods require determining a suitable active set of the constraints in each iteration [4]. Two efficient algorithms for approximately orthogonal NMF were given by Li et al. [17]. Asymmetric NMF with Beta-divergences approach was studied by Lee et al. [16].

NMF is a quadratic boolean optimization problem, so it can also be solved using the Quantum Simulated Annealing (QSA) approach of Farhi et al. [7]. Recently, Golden and O’Malley [9] used a combination of forward and reverse annealing in the quantum annealing to obtain improved performance of QSA for NMF.

2.2 Binary Orthogonal NMF

Given a non-negative matrix \({\textbf{X}}\in \mathbb {R}^{m\times n}\), a BONMF of \({\textbf{X}}\) finds the non-negative matrix \({\textbf{W}}\in \mathbb {R}^{m\times k}\) and a binary \({\textbf{H}}\in \{0,1\}^{k\times n}\) with \(k\ll \min (m,n)\). The BONMF can be written as the following optimization problem:

(7)

Using the ANLS approach [18] we can transform (7) into the following sub-problems:

$$\begin{aligned} {\textbf{W}}_{t+1}= & {} {\min _{{\mathbf {W\ge 0}}}\frac{1}{2}\Vert {\mathbf {X-WH}}_t\Vert _{F}^2,}\end{aligned}$$
(8)
$$\begin{aligned} {\textbf{H}}_{t+1}= & {} \min _{{\textbf{H}}\in \{0,1\}}\frac{1}{2}\Vert {\mathbf {X-W}}_{t+1}{\textbf{H}}\Vert _{F}^2. \end{aligned}$$
(9)

The problem (8) can be solved using the update rule (3) of [15]. Sub-problem (9) is solved in two different ways in the following papers. Zhang et al. [32] update each row of the matrix H using the following strategy:

$$\begin{aligned} {\textbf{h}}=sgn\left( {\textbf{X}}^T{\textbf{z}}-\frac{1}{2}{\textbf{Iz}}^T{\textbf{z}}-{\textbf{H}}'^T{\textbf{W}}'^T{\textbf{z}}\right) , \end{aligned}$$
(10)

where

$$\begin{aligned} sgn(x)= \left\{ \begin{array}{ll} 1,&{}\qquad if ~ x>0 \\ 0,&{}\qquad otherwise, \end{array} \right. \end{aligned}$$

and \({\textbf{z}}\) is the k-th column of \({\textbf{W}}\), and \({\textbf{W}}'\) is the matrix of \({\textbf{W}}\) excluding \({\textbf{z}}\); \({\textbf{h}}^T\) is the k-th row of \({\textbf{H}}\) and \({\textbf{H}}'\) is the matrix of \({\textbf{H}}\) excluding \({\textbf{h}}^T\). In addition, \({\textbf{I}} \in \mathbb {R}^n\) is a vector whose entries are all one. Zdunek [31] presented another method for updating \(\textbf{H}\) under the assumption that \(\textbf{H}\) is orthogonal, which uses simulated annealing. Since they use a different approach, we don’t describe it in detail.

3 The Algorithm

This section describes our approach. The method solves the optimization problems given by (8) and (9). To solve (8), we use the update rule given by equation (3) [15] where \(\textbf{W}\) is computed using

$$ {\textbf{W}}^{t+1}_{ia} \leftarrow {\textbf{W}}^{t}_{ia}\frac{( {\textbf{X}}{{\textbf{H}}^{t}}^{T})_{ia}}{( {\textbf{W}}^{t} {\textbf{H}}^{t}{ {\textbf{H}}^{t}}^{T})_{ia}}, \qquad \forall i,a.$$

Given \( {\textbf{X, H}}\), to solve (9) we write the problem as:

$$\begin{aligned} F( {\textbf{H}})=\min _{ {\textbf{H}}\in \{0,1\}^{k \times n}}\Vert {\mathbf {X-WH}}\Vert _F^2. \end{aligned}$$
(11)

Each column of the matrix \( {\textbf{H}}\) is computed in two steps as follows:

  • In the first step, we calculate the angular distance between column i of \( {\textbf{X}}\) and column j of matrix \( {\textbf{W}}\) to obtain \( {\textbf{H}}_{j,i}\).

    $$\begin{aligned} {\textbf{H}}_{j,i}= \frac{\langle {\textbf{X}}_{:,i}, {\textbf{W}}_{:,j}\rangle }{\Vert {\textbf{X}}_{:,i}\Vert \Vert {\textbf{W}}_{:,j}\Vert }, \end{aligned}$$
    (12)

    where \( {\textbf{X}}_{:,i}\) denotes the i-th column of matrix \( {\textbf{X}}\) and \(\langle .,.\rangle \) is the inner product.

  • In the second step, the maximum value (any) in each matrix column \( {\textbf{H}}\) is changed to 1, and other values are changed to 0. The process can be summarized as follows:

    $$\begin{aligned} \mathbf{{H}}_{j,i}= \left\{ \begin{array}{ll} 1,&{}\qquad if ~ \mathbf{{H}}_{j,i}=\max \mathbf{{H}}_{:,i} \\ 0,&{}\qquad otherwise. \end{array} \right. \end{aligned}$$
    (13)

The pseudo-code for the method is in Algorithm 1. These steps are executed column by column for \( \textbf{H}.\)

figure d

4 Empirical Evaluation

This section examines three characteristics of Algorithm 1. We study the time needed for classification, the accuracy, and the time required for computing the factorization (training time) for the data sets shown in Table 1. The data sets are representative of the varying complexity of machine learning; some are easy (digits), some are hard (diabetes), and some have a significant number of features (ORL). These are popular datasets from the OpenML repository [26]. These data sets have multiple single label classes and serve as a nice testbed for evaluating unsupervised learning algorithms, even in deep learning.

Table 1. Data Sets

We compare the performance of Algorithm 1 with the algorithms for orthogonal matrix factorization [34], non-negative matrix factorization [15], and semi-binary non negative matrix factorization [32]. We examine the relative performance of these algorithms for accuracy and classification time. We use two methods for ONMF to classify a new data point j (column vector of \(\textbf{X}\)). Typically, ONMF uses the index i of the maximum entry in column \( {\textbf{H}}_{:,j}\) for classification, which gives the cluster to which data j belongs. The data points in cluster i may have different labels, and the label of j is the label of the point in cluster i that is closest (distance-wise). We refer to this default scheme for determining the label as ONMF in Table 2. The second scheme we use to determine the label uses the label on \(i'\), which is the point in cluster i that forms the smallest angle data point (vector for j); then, the label of \(i'\) is used to classify point j. The cluster to which data point j belongs is again computed based on the angles to the columns of \(\textbf{W}\), the closest column of \(\textbf{W}\) determines the cluster, and the closest point in the cluster (angle-wise) determines the label. The second scheme is ONMF-cos in Table 2. The other two algorithms that we used to compare are i) the popular and the foundational algorithm of [15] for NMF, labelled “Lee and Seung” and ii) the algorithm of [32] for NMF with the constraint that the entries in \(\textbf{H}\) are binary (labelled “Zhang et al."). We use only the matrix-based method factorization algorithms closest to the K-means for evaluation. As part of a future study, it would be interesting to see how these algorithms perform against a highly optimized implementation of K-means.

We report on experiments that were run on a laptop (i5-7200U, 12GB of RAM). The algorithms [15, 32] were coded in Python 3.1. We used the number of classes as the rank in factorization. Eighty percent of the data was used for training, and the remaining was used for testing the accuracy. We use the python library (ionmf.factorization.onmf) for ONMF [25]. Initialization of \(\textbf{W, H}\) is done using the following scheme: we sort the columns of the matrix \(\textbf{X}\) based on its norm. To determine the \(i^{th}\) column of \(\textbf{W}\), use the average of ten randomly chosen columns from the first thirty columns of \(\textbf{X}\) as in [1]. Initial matrix \( {\textbf{H}}^0\) is computed using \( {\textbf{H}}^0=( {\textbf{W}}^T {\textbf{W}})^{-1} {\textbf{W}}^T {\textbf{X}}.\) Since the initial values of \(\textbf{W}\) are random, we run the algorithm thirty times and report the averages in Table 2. The first thing to note is that in Algorithm 1 extra computation is needed to convert \(\textbf{H}\) to binary in each iteration. This computation increases the time needed for factorization relative to ONMF and NMF and is linear in the size of \(\textbf{H}\). However, given the factorization, the classification phase is more efficient, and \(\textbf{H}\) is sparse.

4.1 Classification

In the basic NMF approach given by update rules (3) and (4) (as in [15]), the number of steps needed for the classification of new data (column vector of \(\textbf{X}\)) is proportional to the number of columns in the factorization \(\mathbf{W, H.}\) We need to calculate the angle between the coefficient vector for the new data and all the columns of \(\textbf{H}\) (as many as the columns in \(\textbf{X}\)) to determine the label for the data. Algorithm 1 does not share this disadvantage. We can compute the angle of the sample to every column of \(\textbf{W}\) (a low-rank matrix) and use the closest column to determine the label. This observation is reflected in data in the row labelled “TT (s)" in Table 2.

4.2 Accuracy

The accuracy of the five methods is presented in Table 2. The entries in bold font indicate that a particular method had the most accuracy). Six of eight data sets (except pendigits and banking) have improved accuracy for classification when \(\cos \) angles are used to measure similarity, and the utility of using the angle (12) is evident. The method presented here has the best accuracy on six of the eight data sets (expect optdigits, phishing). Regarding classification time, it performs best on seven of the eight datasets. Note that the other method ONMF+\(\cos \) which is as competitive as Algorithm 1 uses O(nk) space to store \(\textbf{H}\) whereas Algorithm 1 uses O(n) space even in the intermediate stages of the calculations.

Table 2. Numerical Results

4.3 Running Time and Space

Table 2 shows the running time for the training and classification phase of the five algorithms. The entries in bold signify that the running time is the smallest. The proposed Algorithm 1 is as fast and accurate as the other best method in the table, which is the modified version of ONMF in which angles are used for classification. However, we cannot directly compare the factorization returned by the two algorithms, as ONMF returns \(\textbf{H}\) with orthogonal rows in which entries are real. In contrast, the method proposed here returns a sparse \(\textbf{H}\), which is binary and has orthogonal rows. Algorithm 1 is space efficient compared to the ONMF+cos method, which is an essential consideration for large data sets.

Based on the data in Table 2, we can conclude that Algorithm 1 has competitive accuracy and leads to better clustering with a natural interpretation and a sparse representation. It also classifies new data faster.

5 Conclusion

This paper gives a new geometric approach for binary orthogonal non-negative matrix factorization. The proposed method is space efficient. We also compared the proposed Algorithm with three other methods [15, 32, 34] for accuracy and time on representative datasets in machine learning. Our experiments show that the method is fast and accurate on the data sets tested.