1 Introduction

The nonnegative matrix factorization (NNMF) has become a very important tool in fields such as document clustering, data mining, machine learning, data analysis, image analysis, audio source separation or bioinformatics [16]. NNMF consists on approximating a matrix \( A \in \mathbb {R}^{m \times n} \) by the product of two matrices \(W\) and \(H\), with some conditions: the matrix \(A\) has nonnegative elements, and \( W \in \mathbb {R}^{m \times k} \) and \( H \in \mathbb {R}^{k \times n}\) with \( k \le \min (m,n)\) are two lower rank matrices with nonnegative elements too, such that \( A \approx WH \). The problem is addressed as the computation of two matrices \(W_{0}\), \(H_{0}\) such that

$$\begin{aligned} \Vert W_{0}H_{0} - A \Vert _{F} = \min _{W, H\ge 0} \Vert WH - A \Vert _{F}. \end{aligned}$$
(1)

Other norms can be used instead of the Frobenius norm (see, for instance, [7], where the NNMF is also defined in terms of the Kullback–Leibler divergence). Also, many algorithms have been proposed for NNMF calculation (see [5, 711]).

The relevance of this factorization is the accomplished dimensionality reduction that effectively works as a compression tool for many applications, because the matrix \(A\) is approximated as the product of two lower rank matrices. Though useful, NNMF is also a computationally demanding task, what encourages to develop efficient routines capable of reducing its high execution time.

Besides, modern computer architectures have evolved from CPUs with a reasonable number of cores (multi-core) to heterogeneous systems where CPUs are aided by hardware accelerators with a huge number of computational cores (many-core). Therefore, these complex architectures (heterogeneous architectures) are an essential tool for tackling the NNMF when we cope with large scale matrices.

Previous results on the parallelization of some algorithms for calculating the NNMF can be found in [12]. In [13] a first approach to a library, called NnmfPack, for NNMF calculation was presented. NnmfPack is an efficient numerical library conceived for shared memory heterogeneous parallel systems, and it supports, from its conception, both conventional multi-core processors and (many-core) coprocessors such as Intel Xeon Phi and CUDA compatible Graphics Processing Units (GPUs). Its routines are also invocable from MATLAB/Octave through MEX interfaces, what increases NnmfPack’s usability. In [13] only some initial aspects of the library were depicted.

Although initially a wide variety of audio signal processing applications (see, for example, [3, 14, 15]) has inspired the development of the library, our interest is to provide a general tool (not constrained to audio problems) which can be used in any other field where the NNMF decomposition is required. For example, if digital image processing (face recognition, optical character recognition, content based image retrieval, etc.) is considered, each monochrome digital image is a rectangular array of pixels and each pixel is represented by its light intensity. Since the light intensity is measured by a nonnegative value, we can represent each image as a nonnegative matrix, where each element is a pixel. Colour images can be coded in the same way but with several nonnegative matrices.

Therefore, in this article the library is extended to cope with different metrics options (\(\beta \)-divergence), and some other parallel algorithms have been added and tested. As a result, we include the evaluation of the overhead performance of the new routines on several parallel architectures (GPU, multi-core, etc.).

The remainder of the article is as follows. In Sect. 2 we define the different metrics used in the library, and the computational cost of the implemented algorithms is shown. The experimental advantages of the implemented algorithms are analysed on different architectures in Sect. 3, together with an example related to image reconstruction. The article ends with a section devoted to conclusions.

2 Approximation and cost functions

To find an approximate factorization \( A \approx WH \), it is necessary to define cost functions that quantifies the quality of the approximation. The factorization is usually sought after through the minimization problem

$$\begin{aligned} \min _{W, H \ge 0} D(A | WH), \end{aligned}$$
(2)

where \(D(A | WH)\) is a cost function defined by

$$\begin{aligned} D(A | WH)= \sum _{i=1}^m \sum _{j=1}^n d([A]_{ij} | [WH]_{ij}), \end{aligned}$$
(3)

and \(d(x | y)\) is a scalar cost function.

Lee and Seung proposed in [7] the use of the Euclidean distance

$$\begin{aligned} d(x|y)=\frac{1}{2}(x-y)^2 \end{aligned}$$
(4)

and the Kullback–Leibler divergence defined by

$$\begin{aligned} d(x|y)=x \log \frac{x}{y} -x +y. \end{aligned}$$
(5)

Thus, for example, considering the Euclidean distance and using gradient descent algorithms to solve the minimization problem, they obtain the next update rules

$$\begin{aligned} H \leftarrow H \cdot \frac{W^T A}{(W^T W) H}, \quad W \leftarrow W \cdot \frac{A H^T}{W (H H^T)}, \end{aligned}$$
(6)

where the symbol \(\cdot \) and the fraction bar denote entrywise matrix product and division, respectively.

In our previous work (see [12]), the multiplicative algorithms used exclusively the Frobenius norm as a measure of the goodness of the approximation of the matrices \(A\) and \(WH\). In many applications of the NNMF it is more useful the utilization of another metrics to measure the closeness between these matrices. It can be empirically checked that the \(\beta \)-divergence metrics (introduced by Eguchi and Minami, see [16]) provide more accurate results for specific problems, while maintaining, under certain conditions, its convergence.

The new built-in algorithms are a generalization of that presented in [12], and they are parameterized and optimized to use different types of metrics based on the value of the parameter \(\beta \): the Frobenius norm (\(\beta =2\)), the Kullback–Leibler divergence (\(\beta =1\)), the Itakura–Saito divergence (\(\beta =0\)) or any other for different values of parameter \(\beta \).

In the NnmfPack library we have implemented subroutines that use as cost function the \(\beta \)-divergence, that can to be defined as (see, e.g. [17])

$$\begin{aligned} d_{\beta } (x | y ) := \left\{ \begin{array}{ll} \frac{1}{\beta (\beta -1)} ( x^{\beta } + (\beta -1) y^{\beta } - \beta x y^{\beta -1}), &{} \quad \mathrm{if } \; \beta \in \mathbb {R}{\setminus }\{0,1\}, \\ x (\log x - \log y )+ (y -x), &{} \quad \mathrm{if } \; \beta =1 \\ \frac{x}{y} - \log \frac{x}{y} -1, &{} \quad \mathrm{if } \; \beta =0. \end{array} \right. \end{aligned}$$
(7)

The previous cost function is defined for all real number, but usually when is used in applications, \(\beta \) takes values between 0 and 2. Thus, in our experiments we tested only values for \(\beta \) varying in this range.

Taking into account [17] and using the gradient criterion, it is possible to obtain the following rules to update the matrices \(H\) and \(W\):

$$\begin{aligned} H \leftarrow H \cdot \frac{W^T((W H)^{.\beta -2} \cdot A)}{W^T(W H)^{.\beta -1}}, \quad W \leftarrow W \cdot \frac{(W H)^{.\beta -2} \cdot A) H^T}{(W H)^{.\beta -1} H^T}, \end{aligned}$$
(8)

where \(X^{.n}\) denotes the matrix with entries \(([X]_{ij})^n\).

In the routines implemented in the library, the parameter \(\beta \) can be chosen by the user, thus allowing the use of different metrics.

2.1 Computational costs

The algorithm described in Sect. 2 is a multiplicative uniform cost algorithm with approximately

$$\begin{aligned} 2(6mnk + 3mn +mk +nk) \end{aligned}$$
(9)

flops per iteration if \(\beta \in \mathbb {R}{\setminus }\{0,1\}\), where \(n\), \(m\) and \(k\) are the dimensions of the input matrix \( A \in \mathbb {R}^{m \times n} \) and the two lower rank matrices \( W \in \mathbb {R}^{m \times k} \) and \( H \in \mathbb {R}^{k \times n} \), with \( k \le \min (m,n)\). If the value of the parameter \(\beta \) is \(2\) (corresponding with the Frobenius norm) the overall computational cost in flops per iteration will be

$$\begin{aligned} 2(6mnk + mk + nk). \end{aligned}$$
(10)

The cost of the NNMFPAck’s MLSA algorithm (see [13]), an efficient implementation of the Lee and Seung algorithm [7], is

$$\begin{aligned} 4kmn + 2k(m+n)(2k+1), \end{aligned}$$
(11)

flops per iteration, which is lower than (10).

This cost can be obtained by assuming the flop definition given in [18] and considering that in expression (6) the cost of basic operations such as matrix–matrix multiplication (\(2mnk\) flops for \(X=Y Z\), with \(X \in \mathbb {R}^{m \times n}\), \(Y \in \mathbb {R}^{m \times k}\), \(Z \in \mathbb {R}^{k \times n}\)) or point-wise multiplication/division (\(2mn\) flops for \(X=Y \cdot Z/T\), with \(X, Y, Z \in \mathbb {R}^{m \times n}\)) are known.

Thus, in the updating of (6) the following operations must be done:

$$\begin{aligned} (W^T W) \;&\rightarrow \; 2mk^2 \; \mathrm{flops}\end{aligned}$$
(12)
$$\begin{aligned} (W^T W)H \;&\rightarrow \; 2k^2n \; \mathrm{flops}\end{aligned}$$
(13)
$$\begin{aligned} W^T A \;&\rightarrow \; 2mnk \; \mathrm{flops}\end{aligned}$$
(14)
$$\begin{aligned} H \cdot (W^T A)/((W^T W)H) \;&\rightarrow \; 2kn \; \mathrm{flops}\end{aligned}$$
(15)
$$\begin{aligned} (H H^T) \;&\rightarrow \; 2nk^2 \; \mathrm{flops}\end{aligned}$$
(16)
$$\begin{aligned} W(H H^T) \;&\rightarrow \; 2k^2m \; \mathrm{flops}\end{aligned}$$
(17)
$$\begin{aligned} A H^T \;&\rightarrow \; 2mnk \;\mathrm{flops}\end{aligned}$$
(18)
$$\begin{aligned} W \cdot (A H^T)/(W(H H^T)) \;&\rightarrow \; 2mk \; \mathrm{flops} \end{aligned}$$
(19)

The overall cost per iteration, (11), is obtained by adding all these partial costs.

Although MLSA and \(\beta \)-divergence with \(\beta =2\) are mathematically equivalent, MLSA’s cost is lower due to a rearrangement of its operations. Therefore, when \(\beta \) is equal to \(2\), MLSA is used instead of the \(\beta \)-divergence version.

Finally, if \(\beta \) is \(1\) some matrix operations can be substituted by vector operations and, therefore, the computational cost can be approximated by

$$\begin{aligned} 2(4mnk + 2mk + mn+ nk) \end{aligned}$$
(20)

flops per iteration.

3 Experimental results

In [13NnmfPack’s design principles and general outline of its functionality are given. In addition, [13] shows the installation procedure and gives some examples of use. A complete description of NnmfPack can be found in its website [19]. The algorithms presented in this work are included in NnmfPack according to the specifications given in [13, 19].

Furthermore, for better comparison with previous empirical results, the experiments carried out in this article were obtained using the same test bed system. That is, an ASUS server with

  • Two Intel Xeon E5-2650 CPUs @ 2.0 GHz with 64 GB of RAM. Their peak performance in double precision is 256 GFLOPS.

  • One NVIDIA Tesla K40m GPU with 2880 cores @ 745 MHz, 12 GB of DDR5 RAM and a peak performance of 1.43 TFLOPS in double precision.

  • One Intel Xeon Phi 5110P coprocessor with 60 cores @ 1.053 GHz, 8 GB RAM and a peak performance of 1.01 TFLOPS in double precision.

  • CPU and Xeon Phi codes were compiled with the Intel C compiler, version 14.0.2.144, and Intel MKL, version 11.1.2.

  • For the Tesla GPU, a NVIDIA NVCC compiler, version 6.0.37, and MAGMA, version 1.4.1 have been used.

  • The Matlab version used is the 8.1.0.604 (R2013a), 64-bit, and the GNU Octave version is the 3.4.3 64-bit. Both libraries have been executed on the ASUS server cited above.

Since the aim of this work is to analyse the behaviour of the new algorithms and not to compare the efficiency of all NnmfPack’s kernels, the experiments are focused on the performance of the general case (\(\beta \in \mathbb {R}{\setminus }\{0,1\}\)) of the \(\beta \)-divergence algorithms (\(\beta \)Div in onwards).

The analysis performed has been carried out using as the input matrix \(A \in \mathbb {R}^{m \times n} \) a uniformly random generated positive square matrix (\( m=n \)). For a more comprehensive study, for each \(m\) (or \(n\)), value \(k\) (the inner dimension of the multiplication approximation) ranges from 10 to 100 % of \(m\) (or \(n\)) with step 10 %.

Figure 1 shows the performance, in terms of GFlops, of the \(\beta \)Div algorithm, and Table 1 presents the theoretical percentage achieved by empirical results over the different architectures for some values of \(m\), \(n\) and \(k\).

Fig. 1
figure 1

NnmfPack performance for \(\beta \)Div algorithm

Table 1 Theoretical performance percentage achieved by empirical results

Figure 1 and Table 1 show that when using the 16 cores available in the testing machine (CPUx16 in Fig. 1), the CPU implementation achieves about the 90 % of its peak performance (i.e. with \(m=n=k=10{,}000\)), while the sequential algorithm (CPUx1 legend in Fig. 1) achieves more than 95 % of it. Xeon Phi’s performance overcome 70 % of its peak performance, reaching more than 700 GFLOPS. In a similar vein, Tesla K40m reaches up to 1240 GFLOPS, more than the 85 % of its peak performance. These percentages are consistent with those presented in [13], slightly lower due to the greater number of non Level 3 BLAS operations per iteration than in NnmfPack’s MLSA algorithm.

It is noteworthy that, when establishing a comparison between MATLAB/Octave functions and \(\beta \)Div, an initial adjustment of the input parameters of MATLAB/Octave is needed, so equity conditions are guaranteed. In this way, MATLAB/Octave’s mult NNMF driver (call nnmf(A, k, ’alg’, ’mult’)) is equivalent to NnmfPack’s MLSA kernel. Table 2 shows the results obtained for three test examples when the number of iterations for all platforms is the same. As can be seen, Octave times are extremely high. MATLAB is outperformed by NnmfPack’s CPU implementation (CPUx16), even using the intrinsic MKL parallelism and being the computational cost per iteration of mult (MLSA) lower than the \(\beta \)Div (see Sect. 2.1).

Table 2 MATLAB/Octave and NnmfPack execution time for \(\beta \)Div algorithm (seconds)

Next we show some precision results of the implementations. The tests used a picture in grey scale as the problem matrix \(A\) with dimensions \(1{,}536\times 2{,}304\) pixels. The matrix was NNMF-factored using MATLAB and NnmfPack and reconstructed later. Table 3 shows the factorization error computed as

$$\begin{aligned} err_F = \frac{\Vert A-WH\Vert _{F}}{\sqrt{mn}}, \; err_{\beta } = \frac{\sqrt{2D_{\beta }(A|WH)}}{\sqrt{mn}}. \end{aligned}$$
(21)

These two expressions measure the error in terms of Frobenius norm (\(err_F\)) of \(A-WH\) and in terms of \(\beta \)Div (\(err_{\beta }\)) of \(A\) and \(WH\), respectively.

For the MATLAB’s mult NNMF function and for NnmfPack using the \(\beta \)Div with \(\beta =0, 1, 1.5\) and \(2\). MATLAB’s mult NNMF is adjusted in the same way as for the previuos comparison (Table 2) and always executes 100 iterations. \(\beta \)Div is executed with several iteration values (\(iter = 50, 100, 150\) and \(200\)), and several factorization inner dimension values, \(k = 154, 307\) and \(768\), where \(k=\mathtt round (\min (m,n)/d)\), with \(d=10,5\) and \(2\).

For the sake of completeness we offer results for the Frobenius norm error (\(err_F\)) and for \(\beta \)Div error (\(err_{\beta }\)). Obviously, measuring the error in terms of \(\beta \)Div error gives better results than in terms of Frobenius norm error, since the algorithm used is specific for minimizing the \(\beta \)Div of \(A\) and \(WH\).

The results show that the error in NnmfPack is lower than the MATLAB version when the number of iterations is higher than certain threshold independently of the inner dimension \(k\). The difference in the error values between the MATLAB’s mult NNMF and any NnmfPack \(\beta \)-divergence variation is higher when the number of iterations increases, as expected.

Table 3 Error in MATLAB’s NNMF and NnmfPack by using the Frobenius norm error (\(err_F\)) and the \(\beta \)Div error (\(err_{\beta }\))

Figure 2 shows the original picture (Fig. 2a) and the reconstructions from the factorization \(A_{\text {reconstructed}}=WH\approx A_{\text {original}}\) using MATLAB (Fig. 2b) and NnmfPack (Fig. 2c–h), observing a better subjective reconstruction for NnmfPack when \(\beta \) is lower.

Fig. 2
figure 2

Original picture and reconstructions using MATLAB’s mult NNMF and NnmfPack

The results for \(\beta =2\) using \(\beta \)Div with 100 iterations and MATLAB’s mult NNMF (100 iterations) are subjectively and objectively similar as shown in Fig. 2b, f and Table 3. This must be so because \(\beta \)Div with \(\beta =2\) is the same algorithm that NnmfPack’s MLSA and that MATLAB’s mult NNMF with the baseline adjustment made. Potential perceived differences come from the way of MATLAB and NnmfPack initialize the matrices \(W\) and \(H\). When the number of iterations doubles (\(iter=200\)), a higher quality in the reconstruction is noticeable with better results for \(\beta =0\) (Fig. 2g) than for the other extreme, \(\beta =2\) (Fig. 2h).

Note that the optimal value of \(\beta \) in a concrete problem, represented by matrix \(A\), is problem-dependent. For the case of image reconstruction, used in this paper, the \(\beta \)Div approach with \(\beta =0\) provides the best results since, as it can be in Table 3, the errors corresponding to \(\beta =0\) are lower than those corresponding to other values of \(\beta \). This may be different for other type of problems represented for a different type of matrices, or even for random matrices.

4 Conclusions

We have presented an improved numerical library, NnmfPack, that provides efficient algorithms to compute the NNMF. More specifically, different metrics (\(\beta \)-divergence) to assess the quality of the approximation of matrices are incorporated to NnmfPack.

The design of algorithms presented in this work has been made according to the specifications of the NnmfPack library, this is, guided by an efficiency target in current parallel computers. The features available make it an attractive alternative for the NNMF resolution in current multi-core and many-core architectures, providing some interesting performance figures from a computational point of view.

Although we have presented a simple case of image reconstruction application as a precision example, the library must be considered a generic tool which can be used in any field where the NNMF decomposition is required.

It is also worth noting that this work will lead to future versions extending current features such as more efficient algorithms and the treatment of sparse or structured matrices.