Abstract
We propose a method for computing binary orthogonal non-negative matrix factorization (BONMF) for clustering and classification. The method is tested on several representative real-world data sets. The numerical results confirm that the method has improved accuracy compared to the related techniques. The proposed method is fast for training and classification and space efficient.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
For a data matrix \(\textbf{X}\) of size \(m\times n\), \(\textbf{WH} \approx \textbf{X}\) (where \(\textbf{W}\) is of size \(m \times k\), \(\textbf{H}\) is of size \(k \times n\)) is considered as a low rank approximation (\(k \ll n\)) of the data matrix \(\textbf{X}\). Low rank approximations are essential in machine learning applications and especially in natural language processing and topic modelling where the data matrix is constructed over a collection of words from a vocabulary and a usually large collection of documents [10, 21, 24, 34].
Singular value decomposition (SVD) [28] is an early approach for computing such a low rank approximation of data. SVD minimizes the Frobenius norm and the spectral norm simultaneously; not only that, the columns of \(\textbf{W}\) are orthogonal, and the rows of \(\textbf{H}\) are also orthogonal. However, the entries in \(\textbf{W, H}\) may be negative, which reduces the utility of SVD for data matrix \(\textbf{X}\) in which the entries are positive as the factors in \(\textbf{W}\) do not have an intuitive explanation. Non-negative matrix factorization (NMF), \(\textbf{WH} \approx \textbf{X}\) and \(\textbf{X, W,H} \ge \textbf{0},\) was introduced by Paatero and Tapper [20] to overcome this difficulty of interpretation of the factors. NMF was shown to be NP-complete by Vavasis [27]. NMF does not require the columns of \(\textbf{W}\) to be orthogonal, and this is considered a severe drawback in some applications as the columns (factors) of \(\textbf{W}\) are not separable by a large angle. Keeping this limitation in mind Ding et al. [34] introduced orthogonality constraints in NMF, \(\textbf{X} \approx \textbf{WH}\) and \(\textbf{X, W, H} \ge \textbf{0,}\) the rows of \(\textbf{H}\) are orthogonal and demonstrated that is an effective approach for clustering of documents.
We consider the following problem: given a \(m \times n\) data matrix \(\textbf{X}\), we wish to represent \(\textbf{X}\) as a product of two matrices \(\textbf{W, H}\) with dimensions \(m \times k\) and \(k \times n\) respectively with the following restrictions: entries in \(\textbf{W}\) are positive, the entries in \(\textbf{H}\) are either 0 or 1, \(\mathbf{{ H H}}^T = I\) and the norm \(|| \mathbf{{ X - WH }}||^2\) is minimized. Additionally, we want k to be small compared to n, m. The columns of the data matrix \(\textbf{X}\) can be thought of as the n samples. Low rank \(\textbf{W}\) represents the latent features. We call this problem the binary orthogonal nonnegative matrix factorization problem (BONMF).
1.1 Contributions
This paper gives a new method (Algorithm 1) for computing a binary orthogonal NMF using the two-phase iterative approach. In the first phase, we use a known update rule [16] to compute the factor \(\textbf{W}\). In the second phase, we use the observation that the binary constraints on \(\textbf{H}\) have a geometric interpretation. This gives an efficient rule to update \(\textbf{H}\) in each iteration (Eq. (12)). The entries in \(\textbf{H}\) are binary, and they are computed column-wise. If all the entries in \(\textbf{H}\) are non-zero, then O(nk) space is needed. However, \(\textbf{H}\) is binary, and the rows of \(\textbf{H}\) are orthogonal. Therefore, only O(n) space is needed. If we compute the entries of \(\textbf{H}\) columns-wise, intermediate states also need O(n) space. The computation for each column of \(\textbf{H}\) takes \(O(n^2k)\) steps. Therefore, the method is space efficient.
We evaluate the method’s performance (in Sect. 4) for training and testing on reference data sets from the ML repository. The experiments demonstrate that the training and the classification phase are efficient (Table 2). The method is accurate and outperforms the state of art methods (Table 2). This method uses k dot products of m element vectors to update each column of the coefficient matrix \(\textbf{H}\) where k is the number of classes in the data set. This is a significant reduction in the computation needed compared to the algorithms of [15, 32, 34] in the classification phase. The method is also space efficient as \(\textbf{H}\) is sparse.
2 Related Work
We begin with NMF and the related background needed to describe our algorithm. Given a non-negative matrix , a non-negative matrix factorization of \(\textbf{X}\) finds two non-negative matrices and with \(k\ll \min (m,n)\) such that:
and the entries in \(\textbf{W, H}\) are positive. The factorization has a natural interpretation [15] and can be computed using various unsupervised machine learning methods. Due to its intuitive interpretation, NMF has found numerous applications such as data consolidation [8], image clustering [6], topic modelling [2], community detection [29], recommender systems [22], and gene expression profiling [33].
BONMF is different from NMF. The entries in \(\textbf{H}\) are restricted to binary. If the columns of \(\textbf{H}\) are orthogonal, then the columns can be used to cluster the data. Therefore, BONMF factorization has several exciting applications [31].
Orthogonal NMF (ONMF) in which \(\textbf{X} \approx \textbf{WH}\) and \(\textbf{W, H} \ge \textbf{0}\) and \(\mathbf{{ H H}}^T = I\) was defined by Ding et al. [34] who gave an algorithm based on solving the Lagrangian relaxation. The entries in \(\textbf{H}\) in ONMF are not required to be binary. ONMF use for data clustering was popularized by Seung and Lee [23]. One of the first notable applications of orthogonal NMF to document clustering is in [30] who gave improved algorithms and showed that ONMF performed better at document clustering than NMF. K-means [19] is one of the most widely used algorithms for unsupervised learning. Bauckhage [3] showed that the objective function of K-means can be rewritten as ONMF if the entries in \(\textbf{H}\) are binary, and the following condition holds:
Therefore, BONMF is equivalent to K-means clustering. BONMF was also studied by Zdunek [31] and differs from the well-studied non-negative matrix factorization (NMF). Lee et al. [16] studied BONMF without the condition (1) on \(\textbf{H}\) and gave an algorithm for determining such a factorization. However, applications to classification are not many. In this paper, we study BONMF for its use in prediction and classifying data, including clustering.
2.1 NMF
NMF can be formulated as the following optimization problem that minimizes the square of the Frobenius norm:Footnote 1
Most of the methods for computing NMF are based on iterative update rules. A popular set of update rules given below is due to Lee and Seung [15], the iteration number is in superscript.
For many more variations on such update rules, see [11]. Optimization approaches such as block-coordinate descent, projected gradient descent, and alternating non-negative least squares (ANLS) [18] have also been used for NMF. ANLS transforms the problem in (2) into two convex optimization problems:
We can solve the optimization problems given by (5) and (6) in a few ways. [12, 13] gave the Rank-one Residue Iteration (RRI) algorithm for computing NMF. This algorithm was also independently proposed by Cichocki et al. [5], which is called the Hierarchical Alternating Least Squares (HALS) algorithm. The solution to (5) and (6) in HALS/RRI is given by explicit formulas, which make for easy implementation. Kim et al. [14] used Newton and quasi-Newton methods to solve (5), (6) and showed that their method has faster convergence. However, these methods require determining a suitable active set of the constraints in each iteration [4]. Two efficient algorithms for approximately orthogonal NMF were given by Li et al. [17]. Asymmetric NMF with Beta-divergences approach was studied by Lee et al. [16].
NMF is a quadratic boolean optimization problem, so it can also be solved using the Quantum Simulated Annealing (QSA) approach of Farhi et al. [7]. Recently, Golden and O’Malley [9] used a combination of forward and reverse annealing in the quantum annealing to obtain improved performance of QSA for NMF.
2.2 Binary Orthogonal NMF
Given a non-negative matrix \({\textbf{X}}\in \mathbb {R}^{m\times n}\), a BONMF of \({\textbf{X}}\) finds the non-negative matrix \({\textbf{W}}\in \mathbb {R}^{m\times k}\) and a binary \({\textbf{H}}\in \{0,1\}^{k\times n}\) with \(k\ll \min (m,n)\). The BONMF can be written as the following optimization problem:
Using the ANLS approach [18] we can transform (7) into the following sub-problems:
The problem (8) can be solved using the update rule (3) of [15]. Sub-problem (9) is solved in two different ways in the following papers. Zhang et al. [32] update each row of the matrix H using the following strategy:
where
and \({\textbf{z}}\) is the k-th column of \({\textbf{W}}\), and \({\textbf{W}}'\) is the matrix of \({\textbf{W}}\) excluding \({\textbf{z}}\); \({\textbf{h}}^T\) is the k-th row of \({\textbf{H}}\) and \({\textbf{H}}'\) is the matrix of \({\textbf{H}}\) excluding \({\textbf{h}}^T\). In addition, \({\textbf{I}} \in \mathbb {R}^n\) is a vector whose entries are all one. Zdunek [31] presented another method for updating \(\textbf{H}\) under the assumption that \(\textbf{H}\) is orthogonal, which uses simulated annealing. Since they use a different approach, we don’t describe it in detail.
3 The Algorithm
This section describes our approach. The method solves the optimization problems given by (8) and (9). To solve (8), we use the update rule given by equation (3) [15] where \(\textbf{W}\) is computed using
Given \( {\textbf{X, H}}\), to solve (9) we write the problem as:
Each column of the matrix \( {\textbf{H}}\) is computed in two steps as follows:
-
In the first step, we calculate the angular distance between column i of \( {\textbf{X}}\) and column j of matrix \( {\textbf{W}}\) to obtain \( {\textbf{H}}_{j,i}\).
$$\begin{aligned} {\textbf{H}}_{j,i}= \frac{\langle {\textbf{X}}_{:,i}, {\textbf{W}}_{:,j}\rangle }{\Vert {\textbf{X}}_{:,i}\Vert \Vert {\textbf{W}}_{:,j}\Vert }, \end{aligned}$$(12)where \( {\textbf{X}}_{:,i}\) denotes the i-th column of matrix \( {\textbf{X}}\) and \(\langle .,.\rangle \) is the inner product.
-
In the second step, the maximum value (any) in each matrix column \( {\textbf{H}}\) is changed to 1, and other values are changed to 0. The process can be summarized as follows:
$$\begin{aligned} \mathbf{{H}}_{j,i}= \left\{ \begin{array}{ll} 1,&{}\qquad if ~ \mathbf{{H}}_{j,i}=\max \mathbf{{H}}_{:,i} \\ 0,&{}\qquad otherwise. \end{array} \right. \end{aligned}$$(13)
The pseudo-code for the method is in Algorithm 1. These steps are executed column by column for \( \textbf{H}.\)
4 Empirical Evaluation
This section examines three characteristics of Algorithm 1. We study the time needed for classification, the accuracy, and the time required for computing the factorization (training time) for the data sets shown in Table 1. The data sets are representative of the varying complexity of machine learning; some are easy (digits), some are hard (diabetes), and some have a significant number of features (ORL). These are popular datasets from the OpenML repository [26]. These data sets have multiple single label classes and serve as a nice testbed for evaluating unsupervised learning algorithms, even in deep learning.
We compare the performance of Algorithm 1 with the algorithms for orthogonal matrix factorization [34], non-negative matrix factorization [15], and semi-binary non negative matrix factorization [32]. We examine the relative performance of these algorithms for accuracy and classification time. We use two methods for ONMF to classify a new data point j (column vector of \(\textbf{X}\)). Typically, ONMF uses the index i of the maximum entry in column \( {\textbf{H}}_{:,j}\) for classification, which gives the cluster to which data j belongs. The data points in cluster i may have different labels, and the label of j is the label of the point in cluster i that is closest (distance-wise). We refer to this default scheme for determining the label as ONMF in Table 2. The second scheme we use to determine the label uses the label on \(i'\), which is the point in cluster i that forms the smallest angle data point (vector for j); then, the label of \(i'\) is used to classify point j. The cluster to which data point j belongs is again computed based on the angles to the columns of \(\textbf{W}\), the closest column of \(\textbf{W}\) determines the cluster, and the closest point in the cluster (angle-wise) determines the label. The second scheme is ONMF-cos in Table 2. The other two algorithms that we used to compare are i) the popular and the foundational algorithm of [15] for NMF, labelled “Lee and Seung” and ii) the algorithm of [32] for NMF with the constraint that the entries in \(\textbf{H}\) are binary (labelled “Zhang et al."). We use only the matrix-based method factorization algorithms closest to the K-means for evaluation. As part of a future study, it would be interesting to see how these algorithms perform against a highly optimized implementation of K-means.
We report on experiments that were run on a laptop (i5-7200U, 12GB of RAM). The algorithms [15, 32] were coded in Python 3.1. We used the number of classes as the rank in factorization. Eighty percent of the data was used for training, and the remaining was used for testing the accuracy. We use the python library (ionmf.factorization.onmf) for ONMF [25]. Initialization of \(\textbf{W, H}\) is done using the following scheme: we sort the columns of the matrix \(\textbf{X}\) based on its norm. To determine the \(i^{th}\) column of \(\textbf{W}\), use the average of ten randomly chosen columns from the first thirty columns of \(\textbf{X}\) as in [1]. Initial matrix \( {\textbf{H}}^0\) is computed using \( {\textbf{H}}^0=( {\textbf{W}}^T {\textbf{W}})^{-1} {\textbf{W}}^T {\textbf{X}}.\) Since the initial values of \(\textbf{W}\) are random, we run the algorithm thirty times and report the averages in Table 2. The first thing to note is that in Algorithm 1 extra computation is needed to convert \(\textbf{H}\) to binary in each iteration. This computation increases the time needed for factorization relative to ONMF and NMF and is linear in the size of \(\textbf{H}\). However, given the factorization, the classification phase is more efficient, and \(\textbf{H}\) is sparse.
4.1 Classification
In the basic NMF approach given by update rules (3) and (4) (as in [15]), the number of steps needed for the classification of new data (column vector of \(\textbf{X}\)) is proportional to the number of columns in the factorization \(\mathbf{W, H.}\) We need to calculate the angle between the coefficient vector for the new data and all the columns of \(\textbf{H}\) (as many as the columns in \(\textbf{X}\)) to determine the label for the data. Algorithm 1 does not share this disadvantage. We can compute the angle of the sample to every column of \(\textbf{W}\) (a low-rank matrix) and use the closest column to determine the label. This observation is reflected in data in the row labelled “TT (s)" in Table 2.
4.2 Accuracy
The accuracy of the five methods is presented in Table 2. The entries in bold font indicate that a particular method had the most accuracy). Six of eight data sets (except pendigits and banking) have improved accuracy for classification when \(\cos \) angles are used to measure similarity, and the utility of using the angle (12) is evident. The method presented here has the best accuracy on six of the eight data sets (expect optdigits, phishing). Regarding classification time, it performs best on seven of the eight datasets. Note that the other method ONMF+\(\cos \) which is as competitive as Algorithm 1 uses O(nk) space to store \(\textbf{H}\) whereas Algorithm 1 uses O(n) space even in the intermediate stages of the calculations.
4.3 Running Time and Space
Table 2 shows the running time for the training and classification phase of the five algorithms. The entries in bold signify that the running time is the smallest. The proposed Algorithm 1 is as fast and accurate as the other best method in the table, which is the modified version of ONMF in which angles are used for classification. However, we cannot directly compare the factorization returned by the two algorithms, as ONMF returns \(\textbf{H}\) with orthogonal rows in which entries are real. In contrast, the method proposed here returns a sparse \(\textbf{H}\), which is binary and has orthogonal rows. Algorithm 1 is space efficient compared to the ONMF+cos method, which is an essential consideration for large data sets.
Based on the data in Table 2, we can conclude that Algorithm 1 has competitive accuracy and leads to better clustering with a natural interpretation and a sparse representation. It also classifies new data faster.
5 Conclusion
This paper gives a new geometric approach for binary orthogonal non-negative matrix factorization. The proposed method is space efficient. We also compared the proposed Algorithm with three other methods [15, 32, 34] for accuracy and time on representative datasets in machine learning. Our experiments show that the method is fast and accurate on the data sets tested.
Notes
- 1.
\(|| {{\textbf {A}}}||_F = \sqrt{tr( {{\textbf {A}}^{\textbf {T}} \times {\textbf {A}}})} = \sum _{i,j} |a_{ij}|^2\).
References
Albright, R., et al.: Algorithms, initializations, and convergence for the nonnegative matrix factorization. Technical report. 919. NCSU Technical Report Math 81706 (2006). http://meyer.math.ncsu.edu/Meyer/Abstracts/Publications.html
Arora, S., et al.: A practical algorithm for topic modeling with provable guarantees. In: International Conference on Machine Learning, pp. 280–288. PMLR (2013)
Bauckhage, C.: K-means clustering is matrix factorization. arXiv preprint arXiv:1512.07548 (2015)
Bertsekas, D.P.: Projected Newton methods for optimization problems with simple constraints. SIAM J. Control. Optim. 20(2), 221–246 (1982)
Cichocki, A., Zdunek, R., Amari, S.: Hierarchical ALS algorithms for nonnegative matrix and 3D tensor factorization. In: Davies, M.E., James, C.J., Abdallah, S.A., Plumbley, M.D. (eds.) ICA 2007. LNCS, vol. 4666, pp. 169–176. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74494-8_22
Fu, X., et al.: Nonnegative matrix factorization for signal and data analytics: identifiability, algorithms, and applications. IEEE Signal Process. Mag. 36(2), 59–80 (2019)
Farhi, E., et al.: Quantum computation by adiabatic evolution. arXiv preprint quant-ph/0001106 (2000)
Gao, M., et al.: Feature fusion and non-negative matrix factorization based active contours for texture segmentation. Signal Process. 159, 104–118 (2019)
Golden, J., O’Malley, D.: Reverse annealing for nonnegative/binary matrix factorization. PLoS ONE 16(1), e0244026 (2021)
Haddock, J., et al.: Semi-supervised nonnegative matrix factorization for document classification. In: 2021 55th Asilomar Conference on Signals, Systems, and Computers, pp. 1355–1360. IEEE (2021)
Han, L., Neumann, M., Prasad, U.: Alternating projected Barzilai-Borwein methods for nonnegative matrix factorization. Electron. Trans. Numer. Anal. 36(6), 54–82 (2009)
Ho, N.-D.: Nonnegative matrix factorization algorithms and applications. Ph.D. thesis. Citeseer (2008)
Ho, N.-D., Van Dooren, P., Blondel, V.D.: Descent methods for nonnegative matrix factorization. In: Van Dooren, P., Bhattacharyya, S., Chan, R., Olshevsky, V., Routray, A. (eds.) Numerical Linear Algebra in Signals, Systems and Control. Lecture Notes in Electrical Engineering, vol. 80, pp. 251–293 (2011). https://doi.org/10.1007/978-94-007-0602-6_13
Kim, D., Sra, S., Dhillon, I.S. Fast Newton-type methods for the least squares nonnegative matrix approximation problem. In: Proceedings of the 2007 SIAM International Conference on Data Mining, pp. 343–354. SIAM (2007)
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Lee, H., Yoo, J., Choi, S.: Semi-supervised nonnegative matrix factorization. IEEE Signal Process. Lett. 17(1), 4–7 (2009)
Li, B., Zhou, G., Cichocki, A.: Two efficient algorithms for approximately orthogonal nonnegative matrix factorization. IEEE Signal Process. Lett. 22(7), 843–846 (2014)
Lin, C.-J.: Projected gradient methods for nonnegative matrix factorization. Neural Comput. 19(10), 2756–2779 (2007)
MacQueen, J.: Classification and analysis of multivariate observations. In: 5th Berkeley Symp. Math. Statist. Probability, pp. 281–297 (1967)
Paatero, P., Tapper, U.: Positive matrix factorization: a nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2), 111–126 (1994)
Pauca, V.P., et al.: Text mining using non-negative matrix factorizations. In: Proceedings of the 2004 SIAM International Conference on Data Mining, pp. 452–456. SIAM (2004)
Ran, X., et al.: A differentially private nonnegative matrix factorization for recommender system. Inf. Sci. 592, 21–35 (2022)
Seung, D., Lee, L.: Algorithms for non-negative matrix factorization. Adv. Neural. Inf. Process. Syst. 13, 556–562 (2001)
Shahnaz, F., et al.: Document clustering using nonnegative matrix factorization. Inf. Process. Manage. 42(2), 373–386 (2006)
Stražar, M., et al.: Orthogonal matrix factorization enables integrative analysis of multiple RNA binding proteins. Bioinformatics 32(10), 1527–1535 (2016)
Vanschoren, J., et al.: OpenML: networked science in machine learning. ACM SIGKDD Explor. Newsl. 15(2), 49–60 (2014)
Vavasis, S.A.: On the complexity of nonnegative matrix factorization. SIAM J. Optim. 20(3), 1364–1377 (2010)
Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometr. Intell. Lab. Syst. 2(1–3), 37–52 (1987)
Ye, Z., et al.: CDCN: a new NMF-based community detection method with community structures and node attributes. Wireless Commun. Mobile Comput. 2021, 1–12 (2021)
Yoo, J.H., Choi, S.J.: Nonnegative matrix factorization with orthogonality constraints. J. Comput. Sci. Eng. 4(2), 97–109 (2010)
Zdunek, R.: Data clustering with semi-binary nonnegative matrix factorization. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2008. LNCS (LNAI), vol. 5097, pp. 705–716. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-69731-2_68
Zhang, M., et al.: Non-negative matrix factorization for binary space learning. In: 2021 13th International Conference on Advanced Computational Intelligence (ICACI), pp. 215–219. IEEE (2021)
Zhu, Y.-L., et al.: Ensemble adaptive total variation graph regularized NMF for single cell RNA-seq data analysis. Curr. Bioinform. 16(8), 1014–1023 (2021)
Ding, C., et al.: Orthogonal nonnegative matrix t-factorizations for clustering. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 126–135 (2006)
Acknowledgements
The authors would like to thank Chirag Wadhwa for discussion and comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Fathi Hafshejani, S., Gaur, D., Hossain, S., Benkoczi, R. (2023). Binary Orthogonal Non-negative Matrix Factorization. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1792. Springer, Singapore. https://doi.org/10.1007/978-981-99-1642-9_3
Download citation
DOI: https://doi.org/10.1007/978-981-99-1642-9_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1641-2
Online ISBN: 978-981-99-1642-9
eBook Packages: Computer ScienceComputer Science (R0)