1 Introduction

A cursory readout of the literature confirms that high-dimensional models have increasingly appeared in the data mining procedures, in the current age of social networks, bioinformatics, digital communications, and quantum computing. This fact places great importance on the necessity of diversifying the strategies for managing the difficulties that need to be prevailed when working with the complex, massive data sets.

A well-known plan to handle the high-dimensional models has been mainly centered on the compact representation of the input data sets [16]. In this regard, data reduction principally targets decreasing the size of the data sets while maintaining the important information, sometimes by data encoding procedures [20]. Meanwhile, when the data sets are given in the matrix forms, classic tools of the linear algebra such as nonnegative matrix factorization (NMF) may be greatly and influentially helpful [10, 17]. As known, a wide range of the real-world data sets are inherently nonnegative and so, we should technically try to rule out the generation of the negative entries while managing and processing such data. Nowadays, NMF is repeatedly and purposefully applied in practical studies such as pattern recognition [11], recommendation systems [21] and face detection [29].

In a common framework, various NMF techniques take a matrix with nonnegative entries as the input, and deliver two lower dimension matrices with nonnegative entries as the output [16], in a way that multiplying the output matrices yields an accurate approximation for the input matrix. As a matter of fact, well-conditioning the intermediary consecutive approximations of the factorization elements may influentially enhance the computational stability [30], and as a result, make it possible to gain more appropriate output matrices as well.

Researchers have recently also pushed to devise memoryless versions of the classic algorithms as another move to handle the high-dimensional optimization models. To contrive a memoryless technique for a general minimization model, we should tactfully benefit the differential features of the cost function as well as the constraints. Meanwhile, the algorithmic steps should be simply performed, not being so time-consuming and labor-intensive, alongside keeping the accuracy at an acceptable level and ensuring the convergence of the solution trajectory. These features can be aggregately seen in the conjugate gradient (CG) algorithms which have been traditionally shaped in the vector forms [28]. Especially, the Dai–Liao (DL) method is nowadays labeled as an efficient CG algorithm due to flexibly incorporating the conjugacy and the quasi–Newton aspects in general circumstances [8, 13].

Here, we plan to address possible model revisions as well as algorithmic modifications of some classic strategies for managing the large-scale data sets. To summarize the organization of our study, firstly we deal with a revised form of the classic measure function proposed by Dennis and Wolkowicz [14] in Section 2, to be embedded to the optimization model of the NMF problem, by penalizing the ill-conditioned intermediary approximations of the factorization elements. Then, in Section 3, we focus on determining adaptive formulas for the DL parameter as the solutions of a least-squares model formulated based on the ellipsoid vector norm [28]. We carry out numerical tests to mirror the value of our theoretical efforts in Section 4, on the CUTEr problems [18] as well as a set of randomly generated NMF cases. Finally, we summarize some results for better understanding of the progress level in Section 5.

2 A revised model for the nonnegative matrix factorization problem

Dimensionality reduction methodologies are naturally understood as influential approaches for analyzing large data sets. As known, high-dimensional data analysis is an integral part of the digital era due to recent developments in sensor technology. As mentioned in Section 1, NMF is one such techniques that has caught researchers’ imagination thanks to the interpretability, simplicity, flexibility and generality [11, 21, 24, 27, 29].

Extracting hidden and important features from data gives rise to the NMF popularity in which the data matrix is approximated by the product of two matrices, usually much smaller than the original data matrix. All the input and output matrices of NMF (often) should be component wisely nonnegative. Mathematically speaking, for a given component wisely nonnegative matrix \(\textbf{A}\in \mathbb {R}^{m\times n}\) (or \(\textbf{A}\ge 0\) for short) and a positive integer \(r\ll \min \{m,n\}\), NMF entails finding component wisely nonnegative matrices \(\textbf{W}\in \mathbb {R}^{m\times r}\) and \(\textbf{Z}\in \mathbb {R}^{r\times n}\) (or \(\textbf{W}\ge 0\) and \(\textbf{Z}\ge 0\) for short), by solving the following minimization problem:

$$\begin{aligned} \min _{\textbf{W}\ge 0,\ \textbf{Z}\ge 0} \mathfrak {F}(\textbf{W},\textbf{Z})=\dfrac{1}{2}\Vert \textbf{A}-\textbf{W}\textbf{Z}\Vert _F^{2}, \end{aligned}$$
(2.1)

where \(\Vert .\Vert _F\) stands for the Frobenius norm. In an efficient approach to address (2.1), the alternating nonnegative least-squares (ANLS) technique targets the following two subproblems [22]:

$$\begin{aligned} {\textbf{Z}}^{k+1}=\arg \displaystyle \min _{\textbf{Z}\ge 0}\mathfrak {F}(\textbf{W}^{k},\textbf{Z}), \end{aligned}$$
(2.2)
$$\begin{aligned} {\textbf{W}}^{k+1}=\arg \displaystyle \min _{\textbf{W}\ge 0}\mathfrak {F}(\textbf{W},\textbf{Z}^{k+1}), \end{aligned}$$
(2.3)

for all \(k\in \mathbb {Z}^+=\mathbb {N}\bigcup \{0\}=\{0,1,2,\dots \}\).

As known, in the computational and analytical studies of the matrix spaces, a great deal of concern is devoted to the matrix condition number, an influential factor that is in a straight connection with the collinearity between the rows or the columns of the matrix [30]. Experiential efforts of the literature show that ill-conditioning may significantly deflect the solution process and yield misleading results. So, it is a classic matter of routine to devise a plan for having control over the condition number of the matrices that iteratively generate in an algorithmic procedure.

A cursory glimpse of the NMF literature shows a lack of analytical will as well as structural tendency to dealing with well-conditioning of the NMF outputs. It should be noted that various modified NMF models mainly target the orthogonality or symmetrization of the decomposition elements [17], being helpful in special applications of the data mining such as sparse recovery and clustering. Such extensions of the classic NMF model have been devised by imposing extra constraints to push the solution path toward the desired outputs. As a results, the solution process of the mentioned models can be to some extent challenging and sometimes, the workload may get heavy.

To depict the effect of ill-conditioning on the NMF model, here we report the outputs of the MATLAB function ‘nnmf’ on the well-known Hilbert matrix. Defined by

$$\begin{aligned} \mathcal {H}_{ij} = \dfrac{1}{i+j-1},\ i,j=1,2,...,n, \end{aligned}$$

the Hilbert matrix \(\mathcal {H}\in \mathbb {R}^{n\times n}\) has been classically recognized as an ill-conditioned matrix, being also (symmetric) positive definite. By setting \(n=20\) and \(r=6\), and then investigating the NMF outputs on \(\mathcal {H}\) obtained by 10000 different implementations of the MATLAB function ‘nnmf’, we observed that for more than 46% of the implementations, at least three columns (and rows) of \(\textbf{W}\) (and \(\textbf{Z}\)) were equal to zero. That means for more than 46% of the implementations the outputs for \(r=4,5,6\) were quite the same. So, in such situations, the NMF cannot serve as a reliable tool in a recommender system for which filling the zero entries (empty positions) is of great importance. On the other hand, we observed that for at least 34% of the outputs the relative error was more than one. These observations could motivate us to deal with collinearity in the NMF model.

Combating the collinearity between the columns of \(\textbf{W}\) or the rows of \(\textbf{Z}\), in order to take computational stability attitude toward the NMF model prompted us to plug condition number of the matrices \(\mathcal {W}=\textbf{W}^T\textbf{W}\) and \(\mathcal {Z}=\textbf{Z}\textbf{Z}^T\) of the dimension \(r\times r\) into the model (2.1). Note that the existence of sufficient (numerical) linear independency between the columns of \(\textbf{W}\) or the rows of \(\textbf{Z}\), makes the matrices \(\mathcal {W}\) and \(\mathcal {Z}\) acceptably well-conditioned and positive definite. While, the mentioned collinearity pushes \(\mathcal {W}\) and \(\mathcal {Z}\) toward ill-conditioning and positive semidefiniteness. So, to be cautious about such troubling issues, the following revised version of the NMF model (2.1) can be proffered:

$$\begin{aligned} \mathfrak {\hat{F}}(\textbf{W},\textbf{Z})= & \dfrac{1}{2}\Vert \textbf{A}-\textbf{W}\textbf{Z}\Vert _F^{2}+ \lambda _1\kappa (\textbf{W}^T\textbf{W})+\lambda _2\kappa (\textbf{Z}\textbf{Z}^T)\\ \nonumber= & \dfrac{1}{2}\Vert \textbf{A}-\textbf{W}\textbf{Z}\Vert _F^{2}+ \lambda _1\kappa (\mathcal {W})+\lambda _2\kappa (\mathcal {Z}) \\ \nonumber= & \dfrac{1}{2}\Vert \textbf{A}-\textbf{W}\textbf{Z}\Vert _F^{2}+ \lambda _1\dfrac{\text {maxmag}(\mathcal {W})}{\text {minmag}(\mathcal {W})}+ \lambda _2\dfrac{\text {maxmag}(\mathcal {Z})}{\text {minmag}(\mathcal {Z})}, \end{aligned}$$
(2.4)

where \(\lambda _1\ge 0\) and \(\lambda _2\ge 0\) are the penalty parameters [25, 26] and the maximum magnification (maxmag) and the minimum magnification (minmag) by an arbitrary matrix \(P\in \mathbb {R}^{m\times n}\) are respectively defined in Watkins [30] as

$$\begin{aligned} \text {maxmag}(P)=\max _{x\ne 0}\dfrac{\Vert Px\Vert }{\Vert x\Vert }, \text{ and } {\text {minmag}}(P)=\min _{x\ne 0}\dfrac{\Vert Px\Vert }{\Vert x\Vert }. \end{aligned}$$

As seen, ill-conditioned choices for \(\textbf{W}\) and \(\textbf{Z}\) meaningfully impose penalty to the model. Meanwhile, although seldom occurs in practice, \(\mathfrak {\hat{F}}(\textbf{W},\textbf{Z})\) is not well-defined when \(\textbf{W}\) or \(\textbf{Z}\) are rank deficient.

In the model (2.4) well-conditioning has been brought up by straightly embedding penalty terms to the cost function. So, in this respect, since we made the solution process away from the possible troubling consequences resulted by imposing an extra set of constraints, finding approximate solutions of the model may be less challenging. However, we should not overlook the complexity of doing computations by the spectral condition number in the model, especially in large-scale cases. It is generally a matter of fact that carrying out calculations with high-dimensional dense matrices causes extra CPU time and may increase the numerical errors as well. So, developing sparse approximations of such matrices in the data analysis has recently attracted special attentions [25, 26].

Among the fundamental sparse structures for the symmetric matrices, there exist the diagonal and the (banded) symmetric tridiagonal matrices [7] as well as the symmetric rank-one or rank-two updates of the (scaled) identity matrix [28]. In essence, we should conduct a cost-benefit analysis to select a special sparse matrix structure which is of enough suitability in the relevant application. Driven by this issue, because of the presence of the spectral condition number in the augmented model (2.4) which is directly linked to the eigenvalues of the matrix, to tackle some precarious situations stemming from a great deal of time-consuming for calculating \(\mathcal {W}\) and \(\mathcal {Z}\), it may be preferable to use diagonal approximations of \(\mathcal {W}\) and \(\mathcal {Z}\) in the model (2.4) by

$$\begin{aligned} & \mathcal {W}\approx \hat{\textbf{D}}=\text {diag}(\hat{\textbf{D}}_1^*,\hat{\textbf{D}}_2^*,...,\hat{\textbf{D}}_r^*),\\ & \mathcal {Z}\approx \check{\textbf{D}}=\text {diag}(\check{\textbf{D}}_1^*,\check{\textbf{D}}_2^*,...,\check{\textbf{D}}_r^*), \end{aligned}$$

where

$$\begin{aligned} \hat{\textbf{D}}^*_j =\sum _{i=1}^m\textbf{W}_{ij}^2,\ j=1,2,...,r, \text{ and } \check{\textbf{D}}^*_i =\sum _{j=1}^n\textbf{Z}_{ij}^2,\ i=1,2,...,r. \end{aligned}$$

Notably, the above diagonal estimations are derived from

$$\begin{aligned} \hat{\textbf{D}}^*=\arg \displaystyle \min _{\mathcal {D}\in \textbf{D}^+}\Vert \mathcal {W}-\mathcal {D}\Vert _F^2, \text{ and } \check{\textbf{D}}^*=\arg \displaystyle \min _{\mathcal {D}\in \textbf{D}^+}\Vert \mathcal {Z}-\mathcal {D}\Vert _F^2, \end{aligned}$$

where \(\textbf{D}^+\) denotes the collection of all diagonal matrices with the nonnegative elements in \(\mathbb {R}^{r\times r}\).

As known, measure functions provide helpful tools to evaluate and analyze well-conditioning of the square matrices. They often target the distribution of the matrix eigenvalues [25]. Among them, as a fundamental study to analyze the scaling and sizing of the quasi–Newton updates, Dennis and Wolkowicz [14] proposed the following measure function:

$$\begin{aligned} \psi (\textbf{A})=\dfrac{\text {tr}(\textbf{A})}{r\root r \of {\text {det}(\textbf{A})}}, \end{aligned}$$
(2.5)

for an arbitrary positive definite matrix \(\textbf{A}\in \mathbb {R}^{r\times r}\). As a factor to evaluate well-conditioning, \(\psi (\textbf{A})\) considers all the eigenvalues of \(\textbf{A}\), rather than, as occurs in the spectral condition number, only taking the extreme eigenvalues of the matrix [30]. So, by employing \(\psi (.)\) instead of \(\kappa (.)\) in (2.4), it is more likely possible to obtain NMF elements with well-distributed eigenvalues. However, the matrix function (2.5) would be accompanied by some kinds of complexity due to its denominator.

Mathematical inequalities have been widely and purposefully employed by the researchers to turn a dense or complicated formula into something manageable. For this aim, the first and foremost point in accordance with the norm of the literature is to rise the level of interpretability of the targeted formula or model. Here, for the sake of a well-planned simplicity that is a crucial issue in the high-dimensional data analysis, we organize assistance from the first part of the mean inequality chain that is related to the algebraic ties between the harmonic, geometric, arithmetic, and quadratic means. To proceed, firstly note that \(\text {det}(\textbf{A})=\displaystyle \prod _{i=1}^{r}\zeta _i\), in which \(\{\zeta _i\}_{i=1}^r\) is the set of the eigenvalues of \(\textbf{A}\). Therefore, bearing the relation between the geometric and the harmonic means in mind, here in the sense of

$$\begin{aligned} \dfrac{r}{\displaystyle \sum _{i=1}^{r}\dfrac{1}{\zeta _i}}\le \root r \of {\displaystyle \prod _{i=1}^{r}\zeta _i}, \end{aligned}$$

and noting that the trace of a (square) matrix is equal to the sum of its eigenvalues, the following simple bound for \(\psi (\textbf{A})\) can be obtained:

$$\begin{aligned} \psi (\textbf{A})\le \varphi (\textbf{A})=\dfrac{1}{r^2}\text {tr}(\textbf{A}) \text {tr}(\textbf{A}^{-1}). \end{aligned}$$

This gives rise compelling motivations to employ \(\varphi (.)\) instead of \(\kappa (.)\) in (2.4), to possibly gain NMF elements with well-distributed eigenvalues. So, the modified model is given by

$$\begin{aligned} \mathfrak {\breve{F}}(\textbf{W},\textbf{Z})= & \dfrac{1}{2}\Vert \textbf{A}-\textbf{W}\textbf{Z}\Vert _F^{2} +\lambda _1\varphi (\hat{\textbf{D}})+\lambda _2\varphi (\check{\textbf{D}}) \\= & \dfrac{1}{2}\Vert \textbf{A}-\textbf{W}\textbf{Z}\Vert _F^{2} +\lambda _1\dfrac{1}{r^2}\text {tr}(\hat{\textbf{D}}) \text {tr}(\hat{\textbf{D}}^{-1})+\lambda _2\dfrac{1}{r^2}\text {tr}(\check{\textbf{D}}) \text {tr}(\check{\textbf{D}}^{-1})\nonumber \\= & \dfrac{1}{2}\Vert \textbf{A}-\textbf{W}\textbf{Z}\Vert _F^{2} +\dfrac{\lambda _1}{r^2}\left( \sum _{j=1}^r\sum _{i=1}^m\textbf{W}_{ij}^2\right) \left( \sum _{j=1}^r\dfrac{1}{\sum _{i=1}^m\textbf{W}_{ij}^2}\right) \nonumber \\ & + \dfrac{\lambda _2}{r^2}\left( \sum _{i=1}^r\sum _{j=1}^n\textbf{Z}_{ij}^2\right) \left( \sum _{i=1}^r\dfrac{1}{\sum _{j=1}^n\textbf{Z}_{ij}^2}\right) .\nonumber \end{aligned}$$
(2.6)

Inherited from the measure function (2.5), the penalty terms of the model (2.6) control the condition number by engaging in all the diagonal elements of the relevant matrices, not only considering the extreme ones. Also, emerging polynomial terms makes the model easier to handle with respect to determining the gradient of the cost function.

The major defect of the cost function of the model (2.6) is that it is not differentiable everywhere due to the extra penalty terms. Especially, if \(\textbf{W}\) has a zero column or \(\textbf{Z}\) has a zero row, then \(\mathfrak {\breve{F}}\) in (2.6) is not well-defined. Moreover, small magnitudes of the columns of \(\textbf{W}\) or the rows of \(\textbf{Z}\) are computationally troublesome. Backed by these arguments and in favor of simplicity, our revised ANLS (RANLS) method is founded upon the following modified version of the model (2.6):

$$\begin{aligned} \mathfrak {\tilde{F}}(\textbf{W},\textbf{Z})= & \dfrac{1}{2}\Vert \textbf{A}-\textbf{W}\textbf{Z}\Vert _F^{2} +\dfrac{\lambda _1}{r^2}\left( \sum _{j=1}^r\sum _{i=1}^m\textbf{W}_{ij}^2\right) \left( \sum _{j=1}^r\dfrac{1}{\gamma +\sum _{i=1}^m\textbf{W}_{ij}^2}\right) \\ & + \dfrac{\lambda _2}{r^2}\left( \sum _{i=1}^r\sum _{j=1}^n\textbf{Z}_{ij}^2\right) \left( \sum _{i=1}^r\dfrac{1}{\gamma +\sum _{j=1}^n\textbf{Z}_{ij}^2}\right) ,\nonumber \end{aligned}$$
(2.7)

with some constant \(\gamma >0\). As seen, \(\mathfrak {\tilde{F}}\) is well-defined and also, it is differentiable everywhere. Thus, the next revised versions of the least-squares models (2.2) and (2.3) should alternately be solved:

$$\begin{aligned} {\textbf{Z}}^{k+1}= & \arg \displaystyle \min _{\textbf{Z}\ge 0}\mathfrak {\tilde{F}}(\textbf{W}^{k},\textbf{Z}),\end{aligned}$$
(2.8)
$$\begin{aligned} {\textbf{W}}^{k+1}= & \arg \displaystyle \min _{\textbf{W}\ge 0}\mathfrak {\tilde{F}}(\textbf{W},\textbf{Z}^{k+1}),\end{aligned}$$
(2.9)

for all \(k\in \mathbb {Z}^+\).

3 Adaptive optimal choices for the Dai–Liao parameter based on the ellipsoid norm

Among the fundamental techniques for solving the unconstrained minimization problem \(\displaystyle \min _{x\in \mathbb {R}^n} f(x)\), the CG methods are iteratively defined by

$$\begin{aligned} x_{k+1} = x_k + \alpha _k d_k,\ d_{k+1}=-g_{k+1}+\beta _kd_k,\ \forall k\in \mathbb {Z}^+, \end{aligned}$$
(3.1)

starting by some \(x_0\in \mathbb {R}^n\) and \(d_0=-g_0\), in which \(g_k=\triangledown f(x_k)\) and \(\beta _k\in \mathbb {R}\) is the CG parameter [3]. Also, the scalar \(\alpha _k>0\), called the step length, is customarily determined as the output of an approximate line search, popularly to meet the (strong) Wolfe conditions [28]. Here, we assume that the cost function f is smooth and its gradient is analytically available. Also, \(\Vert .\Vert \) signifies the \(\ell _2\) (Euclidean) norm and our analysis undergoes with the Wolfe conditions for which \(s_k^Ty_k>0\), where \(s_k=x_{k+1}-x_k=\alpha _k d_k\).

In the initial years of the current century, the worthy study of Dai and Liao [13] brought considerable attention to the CG techniques in various guidelines [4]. Recently, Babaie–Kafaki [8] conducted an expository review on the DL method to provide a better understanding of the capabilities of the method from several standpoints. For the DL method, \(\beta _k\) in its original form is set to

$$\begin{aligned} \beta _k^{\text {DL}} = \dfrac{g_{k+1}^Ty_k}{d_k^Ty_k}-t\dfrac{g_{k+1}^Ts_k}{d_k^Ty_k}, \end{aligned}$$
(3.2)

with \(y_k=g_{k+1}-g_k\), where the scalar \(t>0\) is called the DL parameter. It is valuable to note that if

$$\begin{aligned} t\ge t_k^{(\eta )} :=\eta \dfrac{\Vert y_k\Vert ^2}{s_k^Ty_k}, \text{ with } \text{ the } \text{ constant } \eta >\dfrac{1}{4}, \end{aligned}$$
(3.3)

then the DL directions satisfy the sufficient descent condition which is an important ingredient of the convergence [5].

Among the analytical attempts to seek an appropriate formula for t as a classic open problem [4, 8], Babaie–Kafaki and Ghanbri [9] offered a least-squares model, i.e.

$$\begin{aligned} \min _{t>0}{\Vert ts_k-y_k\Vert }^{2}, \end{aligned}$$
(3.4)

by pushing the DL direction to the direction of the three-term CG method proposed by Zhang, Zhou and Li (ZZL) [31]. As known, the ZZL directions satisfy a strong form of the sufficient descent condition. Moreover, they benefit the consecutive gradient differences vector \(y_k\) as an element of the search direction, besides the vectors \(g_{k+1}\) and \(d_k\) in the framework of a linear combination, rather than the DL directions that are just linear combination of \(g_{k+1}\) and \(d_k\). As a result of their plan, Babaie–Kafaki and Ghanbri [9] obtained the following formula for t:

$$\begin{aligned} t:=t_k^{\text {ZZL}}=\dfrac{s_k^Ty_k}{\Vert s_k\Vert ^2}. \end{aligned}$$
(3.5)

Here, we organize assistance from the ellipsoid vector norm to diversify the adaptive choices for the DL parameter. As an extended form of the \(\ell _2\) norm in the sense of

$$\begin{aligned} \Vert x\Vert _\mathcal {M}=\sqrt{x^T\mathcal {M}x}, \end{aligned}$$

where \(\mathcal {M}\in \mathbb {R}^{n\times n}\) is a (symmetric) positive definite matrix, ellipsoid norm has been pivotally employed to analyze the convergence of the steepest descent and the quasi–Newton methods, and particularly, to devise the scaled trust region algorithms [28]. In our strategy, we plan to set several choices for \(\mathcal {M}\) using the quasi–Newton updating formulas [28].

Quasi–Newton methods have been traditionally devised to tactfully estimate the (inverse) Hessian in order to determine the search direction in the iterative continuous optimization techniques. Mostly being positive definite, the given matrix approximations classically should satisfy the (standard) secant equation, i.e. \(\textbf{B}_{k+1}s_k=y_k\), where \(\textbf{B}_{k+1}\approx \nabla ^2f(x_{k+1})\) [3, 28]. The methods benefit enough flexibility to effectively address the large-scale models. For this aim, the memoryless versions of the well-known BFGS and DFP quasi–Newton updating formulas can be applied [6]; that is,

$$\begin{aligned}\textbf{H}_{k+1}^{\text {MLBFGS}}=\textbf{I}-\dfrac{y_ks_k^T+s_ky_k^T}{s_k^Ty_k}+\left( 1+\dfrac{y_k^Ty_k}{s_k^Ty_k}\right) \dfrac{s_ks_k^T}{s_k^Ty_k}, \end{aligned}$$

and

$$\begin{aligned}\textbf{H}^{\text {MLDFP}}_{k+1}=\textbf{I}+\dfrac{s_ks_k^T}{s_k^Ty_k}-\dfrac{y_ky_k^T}{y_k^Ty_k}, \end{aligned}$$

both are positive definite approximations of \(\nabla ^2f(x_{k+1})^{-1}\), for all \(k\in \mathbb {Z}^+\). Here, MLBFGS and MLDFP are respectively shortened forms of the ‘memoryless BFGS’ and the ‘memoryless DFP’.

It is a matter of tradition that reform is always needed in the available algorithms to answer the great need of diversity and inclusion. There are a lot of evidence in the methodology literature that such efforts evolved the algorithmic schemes over time. So, we should not neglect effects of these evolutionary plans on the hybrid CG algorithms as well. By this fact at the forefront, here we consider the ellipsoid extension of the least-squares model (3.4) as follows:

$$\begin{aligned}\min _{t>0}{\Vert ts_k-y_k\Vert }^{2}_\mathcal {M}, \end{aligned}$$

which yields

$$\begin{aligned}t:=t_k^{\mathcal {M}}=\dfrac{s_k^T\mathcal {M}y_k}{s_k^T\mathcal {M}s_k}. \end{aligned}$$

So, \(t_k^{\text {ZZL}}\) given by (3.5) is the solution of (3.4) by setting \(\mathcal {M}\) as the identity matrix. Also, if we let \(\mathcal {M}=\textbf{B}_{k+1}\) given by a quasi–Newton update for the Hessian, then, because of the standard secant equation we have

$$\begin{aligned} t:=t_k^{\text {DK}}=\dfrac{\Vert y_k\Vert ^2}{s_k^Ty_k}, \end{aligned}$$
(3.6)

which is an effective formula already suggested by Dai and Kou (DK) [12]. This salient fact places great importance on the effectiveness of the given extended least-squares model. The setting \(t=t_k^{\text {DK}}\) in the DL method ensures (3.3) that squarely leads to the sufficient descent property. Moreover, if we let \(\mathcal {M}=\textbf{H}_{k+1}^{\text {MLBFGS}}\) or \(\mathcal {M}=\textbf{H}_{k+1}^{\text {MLDFP}}\), then we respectively obtain

$$\begin{aligned} t:=t_k^{\text {MLBFGS}}= \left( {\dfrac{\Vert s_k\Vert ^2}{s_k^Ty_k}+\dfrac{\Vert y_k\Vert ^2\Vert s_k\Vert ^2}{(s_k^Ty_k)^2}-1}\right) ^{-1}, \end{aligned}$$
(3.7)

or

$$\begin{aligned} t:=t_k^{\text {MLDFP}}= \left( {\dfrac{\Vert s_k\Vert ^2}{s_k^Ty_k}-\dfrac{(s_k^Ty_k)^2}{\Vert y_k\Vert ^2\Vert s_k\Vert ^2}+1}\right) ^{-1}. \end{aligned}$$
(3.8)

From the Cauchy–Schwarz inequality, it can be seen that \(t_k^{\text {MLBFGS}}>0\) and \(t_k^{\text {MLDFP}}>0\). To gain the sufficient descent property in light of (3.3), here we propose the following restricted versions of (3.7) and (3.8):

$$\begin{aligned} t_k^{\text {MLBFGS}}\leftarrow \max \left\{ t_k^{\text {MLBFGS}},t_k^{(\eta )}\right\} , \end{aligned}$$
(3.9)

and

$$\begin{aligned} t_k^{\text {MLDFP}}\leftarrow \max \left\{ t_k^{\text {MLDFP}}, t_k^{(\eta )}\right\} . \end{aligned}$$
(3.10)

As a result, global convergence of the DL method with the given formulas for t can be proved following the analysis of [2, 13].

4 Computational experiments

We offer here some computational confirmation for the veracity of our theoretical analyses, starting with some numerical tests on the CUTEr library [18] with \(n\ge 50\), comprising of 96 problems. All the tests were performed by MATLAB version 7.14.0.739 (R2012a), installed on the Centos 6.2 server Linux operation system, in a computer AMD FX–9800P RADEON R7 with 12 COMPUTE CORES 4C+8G 2.70 GHz of CPU and 8 GB of RAM. The effectuality of the parametric choices (3.6), (3.9), (3.10) and the Hager–Zhang (HZ) formula [19], i.e.

$$\begin{aligned} t:=t_k^{\text {HZ}}=2\dfrac{\Vert y_k\Vert ^2}{s_k^Ty_k}, \end{aligned}$$
(4.1)

is appraised for the DL+ method with

$$\begin{aligned} \beta _k^{\text {DL+}}=\max \left\{ \dfrac{g_{k+1}^Ty_k}{d_k^Ty_k},0\right\} -t\dfrac{g_{k+1}^Ts_k}{d_k^Ty_k}, \end{aligned}$$
(4.2)

proposed for establishing convergence for general cost functions [13]. In our tests, DK+, DL–BFGS+, DL–DFP+ and HZ+, stand for the iterative method (3.1) with the CG parameter (4.2), in which t is respectively computed by (3.6), (3.7), (3.8) and (4.1). Since in rare iterations the DL+ method may fail to generate descent direction, restart (by the negative gradient vector) has been also employed as suggested in Dai and Liao [13].

For the algorithms, we used the approximate Wolfe conditions of Hager and Zhang [19] with the similar settings, and let the stopping criteria as \(k>10000\) or \(\Vert g_k\Vert <10^{-6}(1+|f_k|)\). Also, we set \(\eta =0.26\) in (3.9) and (3.10), to enhance the possibility of employing (3.7) and (3.8). To visually assess the algorithmic results, we applied the Dolan–Moré performance profile [15], by comparisons based on the TNFGE and CPUT metrics, being acronyms for the ‘total number of function and gradient evaluations’ (as outlined in Hager and Zhang [19]) and the ‘CPU time’, respectively. Figure 1 represents the results, by which it can be seen that DL–BFGS+ and DL–DFP+ are generally preferable to DK+ and HZ+, especially with respect to TNFGE. Meanwhile, with respect to CPUT, at times DK+ and HZ+ are competitive with DL–BFGS+ and DL–DFP+. This observation is mainly related to the structure of the formulas (3.7) and (3.8) which is to some extent more complex rather than (3.6) and (4.1). Also, since DL–DFP+ is slightly preferable to DL–BFGS+, we can conclude that the setting (3.8) for the DL+ method is more effective than the setting (3.7).

Fig. 1
figure 1

Performance profile plots for DK+, DL–BFGS+, DL–DFP+ and HZ+ based on TNFGE (A) and CPUT (B)

To bring the validity of the given revised NMF model to light, in this part of our computational experiments we investigate the efficiency of DL–DFP+ for ANLS by solving the least-squares subproblem (2.2)–(2.3) of the minimization model (2.1), and for RANLS by solving the least-squares subproblem (2.8)–(2.9) of the minimization model (2.7). To handle the nonnegativity constraints in the subproblems, we followed the suggestion of Li et al. [22] and employed a proximal scheme in the sense of setting the negative entries of the iterative outputs equal to zero. For RALNS, we set \(\lambda _1=\lambda _2=1\) and \(\gamma =10^{-10}\) in (2.7), and for both ANLS and RANLS, we adopted the termination condition of Liu and Li [23] as well; which is

$$\begin{aligned} & \Vert [\nabla _{\textbf{Z}} \mathcal {F}(\textbf{W}^{k},\textbf{Z}^{k}),\nabla _{\textbf{W}} \mathcal {F}(\textbf{W}^{k},\textbf{Z}^{k})]\Vert _F \\ & \hspace{3cm}\le \nu \Vert [\nabla _{\textbf{Z}} \mathcal {F}(\textbf{W}^{0},\textbf{Z}^{0}),\nabla _{\textbf{W}} \mathcal {F}(\textbf{W}^{0},\textbf{Z}^{0})]\Vert _F, \end{aligned}$$

with \(\mathcal {F}=\mathfrak {F}\) and \(\mathcal {F}=\tilde{\mathfrak {F}}\), respectively, and \(\nu =10^{-2}\). By using the uniform distribution, the test matrices were generated randomly with various dimensions, together with the initial estimates of the NMF elements, as declared in Ahookhosh et al. [1]. Outputs have been outlined in Table 1, including the spectral condition number (Cond) and the relative error (RelErr), calculated by

$$\begin{aligned} \text {RelErr}=\dfrac{\Vert \textbf{A}-\textbf{W}\textbf{Z}\Vert _F}{\Vert \textbf{A}\Vert _F}. \end{aligned}$$

To recapitulate the results, we can observe that RANLS and ANLS are approximately competitive with respect to the accuracy. While, in the condition number viewpoint which is the main target of this study, RANLS is generally preferable to ANLS. Hence, capability of delivering well-conditioned NMF elements with satisfactory accuracy can therefore be considered a success by RANLS.

Table 1 The outputs of DL–DFP+ for NMF

5 Conclusions

We have mainly concentrated on the modifying a classic optimization model of the nonnegative matrix factorization problem, frequently arising in a wide range of practical fields. Avoiding the possibility of ill-conditioning in the results of the decomposition motivated us to revise the model by embedding a measurement for condition numbers of the diagonalized types of the output matrices. What embedded as the well-conditioner (penalty) term has been extracted from the Dennis–Wolkowicz measure function [14]. Then, based on an ellipsoid norm-oriented least-squares model, some optimal choices for the Dai–Liao parameter have been suggested. Driven by the great need for algorithmic tools with the low memory consumption of the machine, the ellipsoid norms have been centered on the memoryless BFGS and DFP formulas. The approach in terms of which the method’s influential parameter has been computed is tending the Dai–Liao search direction to that of a well-functioning three-term conjugate gradient algorithm. Then, to examine the performance of the Dai–Liao method when it is equipped with the given formulas as the parametric settings, some computational tests were performed on the CUTEr functions. The findings were evaluated leveraged on the well-known Dolan–Moré benchmark. The results demonstrated the positive impact of our suggestions for the Dai–Liao parameter. Furthermore, the quality of the given revised nonnegative matrix factorization model has been assessed in several random cases. The results showed that the revised model can produce more well-conditioned factorization elements with reasonable relative errors. Thus, in practical terms, computational experiments have supported our theoretical assertions.