1 Introduction

In the last decade, estimating low-rank matrices has attracted increasing attention in the machine learning community owing to its successful applications in a wide range of fields including subspace clustering (Liu et al. 2010), collaborative filtering (Foygel et al. 2012) and robust dimensionality reduction (Candès et al. 2011), to name a few. Suppose that we are given an observed data matrix Z in \(\mathbb {R}^{p\times n}\), i.e., n observations in p ambient dimensions, we aim to learn a prediction matrix X with a low-rank structure so as to approximate the observation. This problem, together with its many variants, typically involves minimizing a weighted combination of the residual error and a penalty for the matrix rank.

Generally speaking, it is intractable to optimize a matrix rank (Recht et al. 2010). To tackle this challenge, researchers suggested alternative convex relaxations to the matrix rank. The two most widely used convex surrogates are the nuclear normFootnote 1 (Recht et al. 2010) and the max-norm (a.k.a. \(\gamma _2\)-norm) (Srebro et al. 2004). The nuclear norm is defined as the sum of the matrix singular values. Like the \(\ell _1\) norm in the vector case that induces sparsity, the nuclear norm was proposed as a rank minimization heuristic and was able to be formulated as a semi-definite programming (SDP) problem (Fazel et al. 2001). By combining the SDP formulation and the matrix factorization technique, Srebro et al. (2004) showed that the collaborative filtering problem can be effectively solved by optimizing a soft-margin based program. Another interesting work on the nuclear norm comes from the data compression community. In real-world applications, due to possible sensor failure and background clutter, the underlying data can easily be corrupted. In this case, estimates produced by Principal Component Analysis (PCA) may be deviated far from the true subspace (Jolliffe 2005). To handle the (gross) corruption, in the seminal work, Candès et al. (2011) proposed a new formulation termed Robust PCA (RPCA), and proved that under mild conditions, solving a convex optimization problem consisting of a nuclear norm regularization and a weighted \(\ell _1\) norm penalty can exactly recover the low-rank component of the underlying data even if a constant fraction of the entries are arbitrarily corrupted.

The max-norm variant was developed as another convex relaxation to the rank function (Srebro et al. 2004), where Srebro et al. formulated the max-norm regularized problem as an SDP and empirically showed the superiority to the nuclear norm. The main theoretical study on the max-norm comes from Srebro and Shraibman (2005), where Srebro and Shraibman considered collaborative filtering as an example and proved that the max-norm scheme enjoys a lower generalization error than the nuclear norm. Following these theoretical foundations, Jalali and Srebro (2012) improved the error bound for the clustering problem. Another important contribution from Jalali and Srebro (2012) is that they partially characterized the subgradient of the max-norm, which is a hard mathematical entity and cannot be fully understood to date. However, since SDP solver is not scalable, there is a large gap between the theoretical progress and the practical applicability of the max-norm. To bridge the gap, a number of follow-up works attempted to design efficient algorithms to solve max-norm regularized or constrained problems. For example, Rennie and Srebro (2005) devised a gradient-based optimization method and empirically showed promising results on large collaborative filtering datasets. Lee et al. (2010) presented large-scale optimization methods for max-norm constrained and max-norm regularized problems and showed a convergence to stationary point.

Nevertheless, algorithms presented in prior works (Srebro et al. 2004; Rennie and Srebro 2005; Lee et al. 2010; Orabona et al. 2012) require to access all the data when the objective function involves a max-norm regularization. In the large-scale setting, the applicability of such batch optimization methods will be hindered by the memory bottleneck. In this paper, henceforth, we propose an online algorithm to solve max-norm regularized problems. The main advantage of online algorithms is that the memory cost is independent of the sample size, which makes it a good fit for the big data era.

To be more detailed, we are interested in a general max-norm regularized matrix decomposition (MRMD) problem. Suppose that the observed data matrix Z can be decomposed into a low-rank component X and some structured noise E, we aim to simultaneously and accurately estimate the two components, by solving the following convex program:

$$\begin{aligned} \text {(MRMD)}\quad \min _{X, E}\quad \frac{1}{2} \left\| Z-X-E \right\| _{F}^2 + \frac{\lambda _1}{2} \left\| X \right\| _{\max }^2 + \lambda _2 h(E). \end{aligned}$$
(1.1)

Here, \(\left\| \cdot \right\| _{F}\) denotes the Frobenius norm which is a commonly used metric for evaluating the residual, \(\left\| \cdot \right\| _{\max }\) is the max-norm (which promotes low-rankness), and \(\lambda _1\) and \(\lambda _2\) are two non-negative parameters. h(E) is some (convex) regularizer that can be adapted to various kinds of noise. We require that it can be represented as a summation of column norms. Formally, there exists some regularizer \(\tilde{h}(\cdot )\), such that

$$\begin{aligned} h(E) = \sum _{i=1}^{n} \tilde{h}(\varvec{e}_i), \end{aligned}$$
(1.2)

where \(\varvec{e}_i\) is the ith column of E. Classical examples include:

  • \(\left\| E \right\| _{1}\). That is, the \(\ell _1\) norm of the matrix E seen as a long vector, which is used to handle sparse corruption. In this case, \(\tilde{h}(\cdot )\) is the \(\ell _1\) vector norm. Note that when equipped with this norm, the above problem reduces to the well-known RPCA formulation (Candès et al. 2011), but with the nuclear norm being replaced by the max-norm.

  • \(\left\| E \right\| _{2,1}\). This is defined as the summation of the \(\ell _2\) column norms, which is effective when a small fraction of the samples are contaminated (recall that each column of Z is a sample). The matrix \(\ell _{2,1}\) norm is typically used to handle outliers and interestingly, the above program becomes Outlier PCA (Xu et al. 2013) in this case.

  • \(\left\| E \right\| _{F}^2\) or \(E=0\). The formulation of (1.1) works as a large-margin based program, with the hinge loss replaced by the squared loss (Srebro et al. 2004).

Hence, (MRMD) (1.1) is general enough and our algorithmic and theoretical results hold for such a general form, covering important problems including max-norm regularized RPCA, max-norm regularized Outlier PCA and maximum margin matrix factorization. Furthermore, with a careful design, the above formulation (1.1) can be extended to address the matrix completion problem (Candès and Recht 2009), as we will show in Sect. 5.

Considering the connection between max-norm and nuclear norm, one might be interested in an alternative formulation as follows:

$$\begin{aligned} \min _{X, E}\quad \frac{1}{2} \left\| Z-X-E \right\| _{F}^2 + \frac{\lambda _1'}{2} \left\| X \right\| _{\max } + \lambda _2 h(E). \end{aligned}$$
(1.3)

First, we would like to point out that the above formulation is equivalent to (1.1), in the sense that if we choose proper parameter \(\lambda _1'\) for (1.3) and some parameter \(\lambda _1\) for (1.1), they produce same solutions. To see this, we note that (1.3) is equivalent to the following constrained program:

$$\begin{aligned} \min _{X, E}\ \frac{1}{2}\left\| Z- X -E \right\| _{F}^2 + \lambda _2 h(E), \quad \mathrm {s.t.}\,\ \left\| X \right\| _{\max } \le \kappa , \end{aligned}$$

for some parameter \(\kappa \). Taking the square on both sides of the inequality constraint gives

$$\begin{aligned} \min _{X, E}\ \frac{1}{2}\left\| Z- X -E \right\| _{F}^2 + \lambda _2 h(E), \quad \mathrm {s.t.}\,\ \left\| X \right\| _{\max }^2 \le \kappa ^2. \end{aligned}$$

Again, we know that for some proper choice of \(\lambda _1\), the above program is equivalent to (1.1). The reason we choose (1.1) is for a convenient computation of the solution. We defer a more detailed discussion to Sect. 3.

1.1 Contributions

In summary, our main contributions is two-folds: (1) We are the first to develop an online algorithm to solve a family of max-norm regularized problems (1.1), which admits a wide range of applications in machine learning. We also show that our approach can be used to solve other popular max-norm regularized problems such as matrix completion. (2) We prove that the sequence of solutions produced by our algorithm converges to a stationary point of the expected loss function asymptotically (see Sect. 4).

Compared to our earlier work (Shen et al. 2014), the formulation (1.1) considered here is more general and a complete proof is provided. In addition, we illustrate by an extensive study on the subspace recovery task that the max-norm always performs better than the nuclear norm in terms of convergence rate and robustness.

1.2 Related works

Here we discuss some relevant works in the literature. Most previous works on max-norm focused on showing that it is empirically superior to the nuclear norm in real-world problems, such as collaborative filtering (Srebro et al. 2004), clustering (Jalali and Srebro 2012) and hamming embedding (Neyshabur et al. 2014). Other works, for instance, Salakhutdinov and Srebro (2010) studied the influence of data distribution with the max-norm regularization and observed good performance even when the data are sampled non-uniformly. There are also interesting works which investigated the connection between the max-norm and the nuclear norm. A comprehensive study on this problem, in the context of collaborative filtering, can be found in Srebro and Shraibman (2005), which established and compared the generalization bound for the nuclear norm regularization and the max-norm, showing that the latter one results in a tighter bound. More recently, Foygel et al. (2012) attempted to unify them to gain insightful perspective.

Also in line with this work is matrix decomposition. As we mentioned, when we penalize the noise E with \(\ell _1\) matrix norm, it reverts to the well known RPCA formulation (Candès et al. 2011). The only difference is that Candès et al. (2011) analyzed the RPCA problem with the nuclear norm, while (1.1) employs the max-norm. Owing to the explicit form of the subgradient of the nuclear norm, Candès et al. (2011) established a dual certificate for the success of their formulation, which facilitates their theoretical analysis. In contrast, the max-norm is a much harder mathematical entity (even its subgradient has not been fully characterized). Henceforth, it still remains challenging to understand the behavior of the max-norm regularizer in the general setting (1.1). Studying the conditions for the exact recovery of MRMD is out of the scope of this paper. We leave this as a future work.

From a high level, the goal of this paper is similar to that of Feng et al. (2013). Motivated by the celebrated RPCA problem (Candès et al. 2011; Xu et al. 2013, 2012), Feng et al. (2013) developed an online implementation for the nuclear-norm regularized matrix decomposition. Yet, since the max-norm is a more complicated mathematical entity, new techniques and insights are needed in order to develop online methods for the max-norm regularization. For example, after converting the max-norm to its matrix factorization form, the data are still coupled and we propose to transform the problem to a constrained one for stochastic optimization.

The main technical contribution of this paper is converting max-norm regularization to an appropriate matrix factorization problem that is amenable to online implementation. Compared to Mairal et al. (2010) which also studies online matrix factorization, our formulation contains an additional structured noise that brings the benefit of robustness to contamination. Some of our proof techniques are also different. For example, to prove the convergence of the dictionary and to well define their problem, Mairal et al. (2010) assumed that the magnitude of the learned dictionary is constrained. In contrast, we prove that the optimal basis is uniformly bounded, and hence our problem is naturally well-defined.

Our algorithm can be viewed as a majorization-minimization scheme, for which Mairal (2013) derived a general analysis on the convergence behavior. However, we find that Algorithm 1 in Mairal (2013) requires the knowledge of the Lipschitz constant to obtain a surrogate function. In our work, we use a suboptimal solution to derive the surrogate function (see Step 3 and Step 5 in our Algorithm 1 to be introduced). Due to such a different mechanism, it remains an open question whether one can apply their algorithm and theoretical analysis to the problem considered here. It is also worth mentioning that in order to establish their theoretical results, Mairal (2013) assumed that the iterates and the empirical loss function are uniformly bounded (see Assumption (C) and Assumption (D) therein). For our problem, we can virtually prove this property (see Proposition 3 and Corollary 1 to follow). Finally, we note that our algorithm is different from block coordinate descent, see, e.g., Wang and Banerjee (2014). In fact, block coordinate descent randomly and independently picks a mini-batch of samples and updates a block variable, whereas we in each iteration update only the variables associated with the revealed sample. Another key difference is that Wang and Banerjee (2014) considered a strongly convex objective function, while we are working with a non-convex case.

1.3 Roadmap

The rest of the paper is organized as follows. Section 2 begins with the problem setting, followed by a reformulation of the MRMD problem that is amenable for online optimization. Section 3 then elaborates the online implementation of MRMD and Sect. 4 establishes the convergence guarantee under some mild assumptions. In Sect. 5, we show that our framework can easily be extended to other max-norm regularized problems, such as matrix completion. Numerical performance of the proposed algorithm is presented in Sect. 6. Finally, we conclude this paper in Sect. 7. All the proofs are deferred to the “Appendix”.

1.4 Notation

Before delivering the algorithm and the analysis, let us first instate several pieces of notation that are involved throughout the paper. We use bold lowercase letters, e.g., \(\varvec{v}\), to denote a column vector. The \(\ell _1\) norm and \(\ell _2\) norm of a vector \(\varvec{v}\) are denoted by \(\left\| \varvec{v} \right\| _{1}\) and \(\left\| \varvec{v} \right\| _{2}\), respectively. Capital letters, such as M, are used to denote matrices. In particular, the letter \(I_n\) is reserved for the n-by-n identity matrix. For a matrix M, the ith row and jth column are written as \(\varvec{m}(i)\) and \(\varvec{m}_j\), respectively, and the (ij)th entry is denoted by \(m_{ij}\). There are four matrix norms that will be heavily used in the paper: \(\left\| M \right\| _{F}\) for the Frobenius norm, \(\left\| M \right\| _{1}\) for the \(\ell _1\) matrix norm seen as a long vector, \(\left\| M \right\| _{\max }\) for the max-norm induced by the product of \(\ell _{{2, \infty }}\) norm on the factors of M. Here, the \(\ell _{{2, \infty }}\) norm is defined as the maximum \(\ell _2\) row norm. The trace of a square matrix M is denoted as \(\mathrm{Tr}\,(M)\). Finally, for a positive integer n, we use [n] to denote the integer set \(\{1, 2, \ldots , n\}\).

2 Problem setup

We are interested in developing an online algorithm for the MRMD problem (1.1) so as to mitigate the memory issue. To this end, we utilize the following definition of the max-norm (Srebro et al. 2004):

$$\begin{aligned} \left\| X \right\| _{\max } \mathop {=}\limits ^{\text {def}}\min _{L, R}\ \Big \{ \left\| L \right\| _{2,\infty } \cdot \left\| R \right\| _{2,\infty }:\ X = L R^{\top }, L \in \mathbb {R}^{p\times d}, R \in \mathbb {R}^{n\times d}\Big \}, \end{aligned}$$
(2.1)

where d is an upper bound on the intrinsic dimension of the underlying data. Plugging the above back to (1.1), we obtain an equivalent form:

$$\begin{aligned} \min _{L, R, E}\quad \frac{1}{2} \left\| Z-LR^{\top }-E \right\| _{F}^2 + \frac{\lambda _1}{2} \left\| L \right\| _{2,\infty }^2 \left\| R \right\| _{2,\infty }^2 + \lambda _2 h(E). \end{aligned}$$
(2.2)

In this paper, if not specified, “equivalent” means we do not change the optimal value of the objective function. Intuitively, the variable L serves as a (possibly overcomplete) basis for the clean data while correspondingly, the variable R works as a coefficients matrix with each row being the coefficients for each sample (recall that we organize the observed samples in a column-wise manner). In order to make the new formulation (2.2) equivalent to MRMD (1.1), the quantity of d should be sufficiently large due to (2.1).

At a first sight, the problem can only be optimized in a batch manner for which the memory cost is prohibitive. To see this, note that we are considering the regime of \(d < p \ll n\) and the size of the coefficients R is proportional to n. In order to optimize the above program over the variable R, we have to compute the gradient with respect to it. Recall that the \(\ell _{{2, \infty }}\) norm counts the largest \(\ell _2\) row norm of R, hence coupling all the samples (each row of R associates with a sample).

Fortunately, we have the following proposition that alleviates the inter-dependency among the rows of R, hence facilitating an online algorithm where the rows of R can be optimized sequentially.

Proposition 1

Problem (2.2) is equivalent to the following constrained program:

$$\begin{aligned} \min _{{L}, {R}, E}\ \frac{1}{2} \left\| Z-LR^{\top }-E \right\| _{F} + \frac{\lambda _1}{2} \left\| L \right\| _{2,\infty }^2 + \lambda _2 h(E),\quad \mathrm {s.t.}\,\ \left\| R \right\| _{2,\infty }^2 \le 1. \end{aligned}$$
(2.3)

Moreover, there exists an optimal solution \((L^*, R^*, E^*)\) attained at the boundary of the feasible set, i.e., \(\left\| R^* \right\| _{2,\infty }^2\) is equal to the unit.

Remark 1

Proposition 1 is crucial for the online implementation. It states that our primal MRMD problem (1.1) can be transformed to an equivalent constrained program (2.3) where the coefficients of each individual sample (i.e., a row of the matrix R) is uniformly and separately constrained.

Consequently, we can, equipped with Proposition 1, rewrite the original problem in an online fashion, with each sample being separately processed:

$$\begin{aligned} \min _{L, R, E}\ \frac{1}{2} \sum _{i=1}^n \left\| \varvec{z}_i - L\varvec{r}_i - \varvec{e}_i \right\| _{2}^2 + \frac{\lambda _1}{2} \left\| L \right\| _{2,\infty }^2 + \lambda _2 \sum _{i=1}^n \tilde{h}(\varvec{e}_i),\ \mathrm {s.t.}\,\left\| \varvec{r}_i \right\| _{2}^2 \le 1,\ \forall \ i \in [n], \end{aligned}$$
(2.4)

where \(\varvec{z}_i\) is the ith observation, \(\varvec{r}_i\) is the coefficients and \(\varvec{e}_i\) is some structured error penalized by the (convex) regularizer \(\tilde{h}(\cdot )\) (recall that we require h(E) can be decomposed column-wisely). Merging the first and third term above gives a compact form:

$$\begin{aligned} \min _{L}\ \min _{R,E}\ \sum _{i=1}^n \tilde{\ell }(\varvec{z}_i, L, \varvec{r}_i, \varvec{e}_i) + \frac{\lambda _1}{2} \left\| L \right\| _{2,\infty }^2,\quad \mathrm {s.t.}\,\ \left\| \varvec{r}_i \right\| _{2}^2 \le 1,\ \forall i \in [n], \end{aligned}$$
(2.5)

where

$$\begin{aligned} \tilde{\ell }(\varvec{z}, L, \varvec{r}, \varvec{e}) \mathop {=}\limits ^{\text {def}}\frac{1}{2} \left\| \varvec{z}- L \varvec{r}- \varvec{e} \right\| _{2}^2 + \lambda _2 \tilde{h}(\varvec{e}). \end{aligned}$$
(2.6)

This is indeed equivalent to optimizing (i.e., minimizing) the empirical loss function:

$$\begin{aligned} \min _L\ f_n(L), \end{aligned}$$
(2.7)

where

$$\begin{aligned} f_n(L) \mathop {=}\limits ^{\text {def}}\frac{1}{n} \sum _{i=1}^n \ell ( \varvec{z}_i, L) + \frac{\lambda _1}{2n} \left\| L \right\| _{2,\infty }^2, \end{aligned}$$
(2.8)

and

$$\begin{aligned} \ell (\varvec{z}, L ) = \min _{\varvec{r}, \varvec{e}, \left\| \varvec{r} \right\| _{2}^2 \le 1} \tilde{\ell }(\varvec{z}, L, \varvec{r}, \varvec{e}). \end{aligned}$$
(2.9)

Note that by Proposition 1, as long as the quantity of d is sufficiently large, the program (2.7) is equivalent to the primal formulation (1.1), in the sense that both of them could attain the same minimum. Compared to MRMD (1.1), which is solved in a batch manner by prior works, the formulation (2.7) paves a way for stochastic optimization procedure since all the samples are decoupled.

3 Algorithm

Based on the derivation in the preceding section, we are now ready to present our online algorithm to solve the MRMD problem (1.1). The implementation is outlined in Algorithm 1. Here we briefly explain the underlying intuition. We optimize the coefficients \(\varvec{r}\), the structured noise \(\varvec{e}\) and the basis L in an alternating manner, with only the basis L and two accumulation matrices being kept in memory. At the tth iteration, given the basis \(L_{t-1}\) produced by the previous iteration, we can optimize (2.9) by examining the Karush–Kuhn–Tucker (KKT) conditions. To obtain a new iterate \(L_t\), we then minimize the following objective function:

$$\begin{aligned} g_t(L) \mathop {=}\limits ^{\text {def}}\frac{1}{t}\sum _{i=1}^t \tilde{\ell }(\varvec{z}_i, L, \varvec{r}_i, \varvec{e}_i) + \frac{\lambda _1}{2t} \left\| L \right\| _{2,\infty }^2, \end{aligned}$$
(3.1)

where \(\{\varvec{r}_i\}_{i=1}^t\) and \(\{\varvec{e}_i\}_{i=1}^t\) are already on hand. It can be verified that (3.1) is a surrogate function of the empirical loss \(f_t(L)\) (2.8), since the obtained \(\varvec{r}_i\)’s and \(\varvec{e}_i\)’s are suboptimal. Interestingly, instead of recording all the past \(\varvec{r}_i\)’s and \(\varvec{e}_i\)’s, we only need to store two accumulation matrices whose sizes are independent of n, as shown in Algorithm 1. In the sequel, we elaborate each step.

figure a

3.1 Update the coefficients and noise

Given a sample \(\varvec{z}\) and a basis L, we are able to estimate the optimal coefficients \(\varvec{r}\) and the noise \(\varvec{e}\) by minimizing \(\tilde{\ell }(\varvec{z}, L, \varvec{r}, \varvec{e})\). That is, we are to solve the following program:

$$\begin{aligned} \min _{\varvec{r}, \varvec{e}}\ \frac{1}{2} \left\| \varvec{z}- L \varvec{r}- \varvec{e} \right\| _{2}^2 + \lambda _2 \tilde{h}(\varvec{e}),\quad \mathrm {s.t.}\,\ \left\| \varvec{r} \right\| _{2} \le 1. \end{aligned}$$
(3.2)

We notice that the constraint only involves the variable \(\varvec{r}\), and in order to optimize \(\varvec{r}\), we only need to consider the residual term in the objective function. This motivates us to employ a block coordinate descent algorithm. Namely, we alternatively optimize one variable with the other fixed, until some stopping criteria is fulfilled. In our implementation, when the difference between the current and the previous iterate is smaller than \(10^{-6}\), or the number of iterations exceeds 100, our algorithm will terminate and return the optimum.

3.1.1 Optimize the coefficients \(\varvec{r}\)

Now it remains to show how to compute a new iterate for one variable when the other one is fixed. According to Bertsekas (1999), when the objective function is strongly convex with respect to (w.r.t.) each block variable, we are guaranteed that the block coordinate minimization algorithm converges. In our case, we observe that such a condition holds for \(\varvec{e}\) but not necessary for \(\varvec{r}\). In fact, the strong convexity w.r.t. \(\varvec{r}\) holds if and only if the basis L is with full rank. When L is not full rank, we may compute the Moore Penrose pseudo inverse to solve \(\varvec{r}\). However, for computational efficiency, we append a small jitter \(\frac{\epsilon }{2}\left\| \varvec{r} \right\| _{2}^2\) to the objective if necessary, so as to guarantee the convergence (\(\epsilon =0.01\) in our experiments). In this way, we obtain a potentially admissible iterate for \(\varvec{r}\) as follows:

$$\begin{aligned} \varvec{r}_0^{} = (L^{\top }L + \epsilon I_d )^{-1} L^{\top }(\varvec{z}- \varvec{e}). \end{aligned}$$
(3.3)

Here, \(\epsilon \) is set to be zero if and only if L is full rank.

Next, we examine if \(\varvec{r}_0^{}\) violates the inequality constraint in (3.2). If it happens to be a feasible solution, i.e., \(\left\| \varvec{r}_0^{} \right\| _{2} \le 1\), we have found the new iterate for \(\varvec{r}\). Otherwise, we conclude that the optimum of \(\varvec{r}\) must be attained on the boundary of the feasible set, i.e., \(\left\| \varvec{r} \right\| _{2}=1\), for which the minimizer can be computed by the method of Lagrangian multipliers:

$$\begin{aligned} \max _{\eta } \min _{\varvec{r}}\ \frac{1}{2} \left\| \varvec{z}- L \varvec{r}- \varvec{e} \right\| _{2}^2 + \frac{\eta }{2}\left( \left\| \varvec{r} \right\| _{2}^2 - 1 \right) ,\quad \mathrm {s.t.}\,\ \eta > 0,\quad \left\| \varvec{r} \right\| _{2} = 1. \end{aligned}$$
(3.4)

By differentiating the objective function with respect to \(\varvec{r}\), we have

$$\begin{aligned} \varvec{r}= \left( L^{\top }L + \eta I_{d} \right) ^{-1} L^{\top }(\varvec{z}- \varvec{e}). \end{aligned}$$
(3.5)

The following argument helps us to efficiently search the optimal solution.

Proposition 2

Let \(\varvec{r}\) be given by (3.5), where L, \(\varvec{z}\) and \(\varvec{e}\) are assumed to be fixed. Then, the \(\ell _2\) norm of \(\varvec{r}\) is strictly monotonically decreasing with respect to the quantity of \(\eta \).

Proof

For simplicity, let us denote

$$\begin{aligned} \varvec{r}(\eta ) = \left( L^{\top }L + \eta I_{d} \right) ^{-1} \varvec{b}, \end{aligned}$$

where \(\varvec{b}= L^{\top }(\varvec{z}- \varvec{e})\) is a fixed vector. Suppose we have a full singular value decomposition (SVD) on \(L=USV^{\top }\), where the singular values \(\{s_1, s_2, \ldots , s_p\}\) (i.e., the diagonal elements in S) are arranged in a decreasing order and at most d number of them are non-zero. Substituting L with its SVD, we obtain the squared \(\ell _2\) norm for \(\varvec{r}(\eta )\):

$$\begin{aligned} \left\| \varvec{r}(\eta ) \right\| _{2}^2 = \varvec{b}^{\top }\left( VS^2V^{\top }+ \eta I_{d} \right) ^{-2} \varvec{b}= \varvec{b}^{\top }V S_{\eta } V^{\top }\varvec{b}, \end{aligned}$$

where \(S_{\eta }\) is a diagonal matrix whose ith diagonal element equals \((s_i^2 + \eta )^{-2}\).

For any two entities \(\eta _1 > \eta _2\), it is easy to see that the matrix \(S_{\eta _1} - S_{\eta _2}\) is negative definite. Hence, it always holds that

$$\begin{aligned} \left\| \varvec{r}(\eta _1) \right\| _{2}^2 - \left\| \varvec{r}(\eta _2) \right\| _{2}^2 = \varvec{b}^{\top }V (S_{\eta _1} - S_{\eta _2}) V^{\top }\varvec{b}< 0, \end{aligned}$$

which concludes the proof.

The above proposition offers an efficient computation scheme, i.e., bisection method, for searching the optimal \(\varvec{r}\) as well as the dual variable \(\eta \). To be more detailed, we can maintain a lower bound \(\eta _1\) and an upper bound \(\eta _2\), such that \(\left\| \varvec{r}(\eta _1) \right\| _{2} \ge 1\) and \(\left\| \varvec{r}(\eta _2) \right\| _{2} \le 1\). According to the monotonic property shown in Proposition 2, the optimal \(\eta \) must fall into the interval \([\eta _1, \eta _2]\). By evaluating the value of \(\left\| \varvec{r} \right\| _{2}\) at the middle point \((\eta _1 + \eta _2)/2\), we can sequentially shrink the interval until \(\left\| \varvec{r} \right\| _{2}\) is close or equal to one. Note that we can initialize \(\eta _1\) with zero (since \(\left\| \varvec{r}_0^{} \right\| _{2} > 1\) implies the optimal \(\eta ^* > \epsilon \ge 0\)). The bisection routine is summarized in Algorithm 2.

figure b

3.1.2 Optimize the noise \(\varvec{e}\)

We have clarified the technique used for solving \(\varvec{r}\) in Problem (3.2) when \(\varvec{e}\) is fixed. Now let us turn to the phase where \(\varvec{r}\) is fixed and we want to find the optimal \(\varvec{e}\). Since \(\varvec{e}\) is an unconstrained variable, generally speaking, it is much easier to solve, although one may employ different strategies for various regularizers \(\tilde{h}(\cdot )\). Here, we discuss the solutions for popular choices of the regularizer.

  1. 1.

    \(\tilde{h}(\varvec{e}) = \left\| \varvec{e} \right\| _{1}\). The \(\ell _1\) regularizer results in a closed form solution for \(\varvec{e}\) as follows:

    $$\begin{aligned} \varvec{e}= \mathcal {S}_{\lambda _2}[\varvec{z}- L\varvec{r}], \end{aligned}$$
    (3.6)

    where \(\mathcal {S}_{\lambda _2}[\cdot ]\) is the soft-thresholding operator (Donoho 1995).

  2. 2.

    \(\tilde{h}(\varvec{e}) = \left\| \varvec{e} \right\| _{2}\). The solution in this case can be characterized as follows (see, for example, Liu et al. 2010):

    $$\begin{aligned} \varvec{e}= \left\{ \begin{array}{l@{\quad }l} \frac{\left\| \varvec{z}- L\varvec{r} \right\| _{2}}{\left\| \varvec{z}- L\varvec{r} \right\| _{2} - \lambda _2} (\varvec{z}- L\varvec{r}),&{}\text {if}\ \lambda _2 < \left\| \varvec{z}- L\varvec{r} \right\| _{2},\\ \mathbf {0}, &{}\text {otherwise}. \end{array}\right. \end{aligned}$$
    (3.7)

Finally, for completeness, we summarize the routine for updating the coefficients and the noise in Algorithm 3. The readers may refer to the preceding paragraphs for details.

figure c

3.2 Update the basis

With all the past filtration \(\mathcal {F}_t = \{ \varvec{z}_i, \varvec{r}_i, \varvec{e}_i \}_{i=1}^{t}\) on hand, we are able to compute a new basis \(L_t\) by minimizing the surrogate function (3.1). That is, we are to solve the following program:

$$\begin{aligned} \min _L\quad \frac{1}{t}\sum _{i=1}^t \tilde{\ell }(\varvec{z}_i, L, \varvec{r}_i, \varvec{e}_i) + \frac{\lambda _1}{2t} \left\| L \right\| _{2,\infty }^2. \end{aligned}$$
(3.8)

By a simple expansion, for any \(i \in [t]\), we have

$$\begin{aligned} \tilde{\ell }(\varvec{z}_i, L, \varvec{r}_i, \varvec{e}_i) = \frac{1}{2} \mathrm{Tr}\,\left( L^{\top }L \varvec{r}_i \varvec{r}_i^{\top }\right) - \mathrm{Tr}\,\left( L^{\top }(\varvec{z}_i - \varvec{e}_i ) \varvec{r}_i^{\top }\right) + \frac{1}{2} \left\| \varvec{z}_i - \varvec{e}_i \right\| _{2}^2 + \lambda _2 \tilde{h}(\varvec{e}_i). \end{aligned}$$
(3.9)

Substituting back into (3.8), putting \(A_t = \sum _{i=1}^{t} \varvec{r}_i \varvec{r}_i^{\top }\), \(B_t = \sum _{i=1}^{t} (\varvec{z}_i - \varvec{e}_i ) \varvec{r}_i^{\top }\) and removing constant terms, we obtain

$$\begin{aligned} L_t = \mathop {\hbox {arg min}}\limits _L \frac{1}{t} \left( \frac{1}{2} \mathrm{Tr}\,\left( L^{\top }L A_t \right) - \mathrm{Tr}\,\left( L^{\top }B_t \right) \right) + \frac{\lambda _1}{2t} \left\| L \right\| _{2,\infty }^2. \end{aligned}$$
(3.10)

In order to derive the optimal solution, firstly, we need to characterize the subgradient of the squared \(\ell _{{2, \infty }}\) norm. In fact, let Q be a positive semi-definite diagonal matrix, such that \(\mathrm{Tr}\,(Q)=1\). Denote the set of row index which attains the maximum \(\ell _2\) row norm of L by \(\mathcal {I}\). In this way, the subgradient of \(\frac{1}{2}\left\| L \right\| _{2,\infty }^2\) is given by

$$\begin{aligned} \partial \left( \frac{1}{2} \left\| L \right\| _{2,\infty }^2 \right) = QL,\ Q_{ii} \ne 0\ \text {if and only if}\ i \in \mathcal {I},\ Q_{ij} = 0\ \text {for}\ i \ne j. \end{aligned}$$
(3.11)

Equipped with the subgradient, we may apply block coordinate descent to update each column of L sequentially. We assume that the objective function (3.10) is strongly convex w.r.t. L, implying that the block coordinate descent scheme can always converge to the global optimum (Bertsekas 1999).

We summarize the update procedure in Algorithm 4. In practice, we find that after revealing a large number of samples, performing one-pass update for each column of L is sufficient to guarantee a desirable accuracy, which matches the observation in Mairal et al. (2010).

figure d

As we discussed in Sect. 1, one may prefer the formulation (1.3)–(1.1), although in some sense they are equivalent. It is worth mentioning that our algorithm can easily be tailored to solve (1.3) by modifying Step 5 of Algorithm 1 as follows:

$$\begin{aligned} L_t = \mathop {\hbox {arg min}}\limits _L\ \frac{1}{t}\left( \frac{1}{2}\mathrm{Tr}\,\left( L^{\top }L A_t \right) - \mathrm{Tr}\,\left( L^{\top }B_t \right) \right) + \frac{\lambda _1}{2t} {\left\| L \right\| _{2,\infty }}. \end{aligned}$$
(3.12)

Again, we are required to derive the optimal solution by examining the subgradient of the last term, which is given by

$$\begin{aligned} \partial \left\| L \right\| _{2,\infty } = QW,\ q_{ii} \ne 0\ \text {if and only if}\ i \in \mathcal {I},\ q_{ij} = 0\ \text {for}\ i\ne j, \end{aligned}$$
(3.13)

where each row of W is as follows:

$$\begin{aligned} \varvec{w}(i) = \frac{1}{\left\| \varvec{l}(i) \right\| _{2}} \varvec{l}(i),\ \forall \ 1\le i \le p. \end{aligned}$$
(3.14)

3.3 Memory and computational cost

As one of the main contributions of this paper, our OMRMD algorithm (i.e., Algorithm 1) is appealing for large-scale problems (the regime \(d < p \ll n\)) since the memory cost is independent of n. To see this, note that when computing the optimal coefficients and noise, only \(\varvec{z}_t\) and \(L_{t-1}\) are accessed, which costs O(pd) memory. To store the accumulation matrix \(A_t\), we need \(O(d^2)\) memory while that for \(B_t\) is O(pd). Finally, we find that only \(A_t\) and \(B_t\) are needed for the computation of the new iterate \(L_t\). Therefore, the total memory cost of OMRMD is O(pd), i.e., independent of n. In contrast, the SDP formulation introduced by Srebro et al. (2004) requires \(O((p+n)^2)\) memory usage, the local-search heuristic algorithm (Rennie and Srebro 2005) needs \(O(d(p+n))\) and no convergence guarantee was derived. Even for a recently proposed algorithm (Lee et al. 2010), they require to store the entire data matrix and thus the memory cost is O(pn).

In terms of computational efficiency, our algorithm can be fast. One may have noticed that the computation is dominated by solving Problem (3.2). The computational complexity of (3.5) involves an inverse of a \(d\times d\) matrix followed by a matrix-matrix and a matrix-vector multiplication, totally \(O(pd^2)\). For the basis update, obtaining a subgradient of the squared \(\ell _{{2, \infty }}\) norm is O(pd) since we need to calculate the \(\ell _2\) norm for all rows of L followed by a multiplication with a diagonal matrix (see (3.11)). A one-pass update for the columns of L, as shown in Algorithm 4 costs \(O(pd^2)\). Note that the quadratic dependency on d is acceptable in the low-rank setting.

4 Theoretical analysis and proof sketch

In this section we present our main theoretical result regarding the validity of the proposed algorithm. We first discuss some necessary assumptions.

4.1 Assumptions

(A1):

The observed samples are independent and identically distributed (i.i.d.) with a compact support \(\mathcal {Z}\). This is a very common scenario in real-world applications.

(A2):

The surrogate functions \(g_t(L)\) in (3.1) are strongly convex. In particular, we assume that the smallest singular value of the positive semi-definite matrix \(\frac{1}{t}A_t\) defined in Algorithm 1 is not smaller than some positive constant \(\beta _1\).

(A3):

The minimizer for (2.9) is unique. Notice that \(\tilde{\ell }(\varvec{z}, L, \varvec{r}, \varvec{e})\) is strongly convex w.r.t. \(\varvec{e}\) and convex w.r.t. \(\varvec{r}\). We can enforce this assumption by adding a jitter \(\frac{\epsilon }{2} \Vert \varvec{r}\Vert _2^2\) to the objective function, where \(\epsilon \) is a small positive constant.

4.2 Main results

It is easy to see that Algorithm 1 is devised to optimize the empirical loss function (2.8). In stochastic optimization, we are mainly interested in the expected loss function, which is defined as the averaged loss incurred when the number of samples goes to infinity. If we assume that each sample is independently and identically distributed (i.i.d.), we have

$$\begin{aligned} f(L) \mathop {=}\limits ^{\text {def}}\lim _{n \rightarrow \infty }f_n(L) = \mathbb {E}_{\varvec{z}}[\ell ( \varvec{z}, L)]. \end{aligned}$$
(4.1)

The main theoretical result of this work is stated as follows.

Theorem 1

(Convergence to a stationary point of the expected loss function) Let \(\{L_t\}_{t=1}^{\infty }\) be the sequence of solutions produced by Algorithm 1. Then, the sequence converges to a stationary point of the expected loss function (4.1) when t tends to infinity.

Remark 2

The theorem establishes the validity of our algorithm. Note that on one hand, the transformation (2.1) facilitates an amenable way for the online implementation of the max-norm. On the other hand, due to the non-convexity of our new formulation (2.3), it is generally hard to desire a local, or a global minimizer (Bertsekas 1999). Although Burer and Monteiro (2005) showed that any local minimum of an SDP is also the global optimum under some conditions (note that the max-norm problem can be transformed to an SDP (Srebro et al. 2004), it is not clear how to determine that a solution is a local optimum or a stationary point. Very recently, Bhojanapalli et al. (2016) showed that global convergence is possible for a family of batch methods. Yet, it is not clear how to apply their results in the stochastic setting. From the empirical study in Sect. 6, we find that the solutions produced by our algorithm always converge to the global optimum when the samples are drawn from a i.i.d. Gaussian distribution.

4.3 Proof outline

The essential tools for our analysis are from stochastic approximation (Bottou 1998) and asymptotic statistics (Vaart 2000). There are four key stages in our proof and one may find the full proof in “Appendix”.

Stage I We first show that all the stochastic variables \(\{ L_t, \varvec{r}_t, \varvec{e}_t \}_{t=1}^{\infty }\) are uniformly bounded. The property is crucial because it justifies that the problem we are solving is well-defined. Also, the uniform boundedness will be heavily used for deriving subsequent important results, e.g., the Lipschitz property of the surrogate function.

Proposition 3

(Uniform bound of all stochastic variables) Let \(\{\varvec{r}_t, \varvec{e}_t, L_t\}_{t=1}^{\infty }\) be the sequence of the solutions produced by Algorithm 1. Then,

  1. 1.

    For any \(t > 0\), the optimal solutions \(\varvec{r}_t\) and \(\varvec{e}_t\) are uniformly bounded.

  2. 2.

    For any \(t > 0\), the accumulation matrices \(\frac{1}{t}A_t\) and \(\frac{1}{t}B_t\) are uniformly bounded.

  3. 3.

    There exists a compact set \(\mathcal {L}\), such that for any \(t > 0\), we have \(L_t \in \mathcal {L}\).

Proof

(Sketch) The uniform bound of \(\varvec{e}_t\) follows by constructing a trivial solution \((\mathbf {0}, \mathbf {0})\) for  (2.6), which results in an upper bound for the optimum of the objective function. Notably, the upper bound here only involves a quantity on \(\left\| \varvec{z}_t \right\| _{2}\), which is assumed to be uniformly bounded. Since \(\varvec{r}_t\) is always upper bounded by the unit, the first claim follows. The second claim follows immediately by combining the first claim and Assumption (A1). In order to show that \(L_t\) is uniformly bounded, we utilize the first order optimality condition of the surrogate (3.1). Since \(\frac{1}{t}A_t\) is positive definite, we can represent \(L_t\) in terms of \(\frac{1}{t}B_t\), \(U_t\) and the inverse of \(\frac{1}{t}A_t\), where \(U_t\) is the subgradient, whose Frobenius norm is in turn bounded by that of \(L_t\). Hence, it follows that \(L_t\) can be uniformly bounded.

Remark 3

Note that Mairal et al. (2010) and Feng et al. (2013) assumed that the dictionary (or basis) is uniformly bounded. Here, we prove that such a condition naturally holds in our case.

Corollary 1

(Uniform bound and Lipschitz of the surrogate) Following the notation in Proposition 3, we have for all \(t > 0\),

  1. 1.

    \(\tilde{\ell }\left( \varvec{z}_t, L_t, \varvec{r}_t, \varvec{e}_t\right) \) (2.6) and \(\ell \left( \varvec{z}_t, L_t\right) \) (2.9) are both uniformly bounded.

  2. 2.

    The surrogate function, i.e., \(g_t(L)\) defined in (3.1) is uniformly bounded over \(\mathcal {L}\).

  3. 3.

    Moreover, \(g_t(L)\) is uniformly Lipschitz over the compact set \(\mathcal {L}\).

Stage II We next show that the positive stochastic process \(\{g_t(L_t)\}_{t=1}^{\infty }\) converges almost surely. To establish the convergence, we verify that \(\{g_t(L_t)\}_{t=1}^{\infty }\) is a quasi-martingale (Bottou 1998) that converges almost surely. To this end, we illustrate that the expectation of the discrepancy of \(g_{t+1}(L_{t+1})\) and \(g_t(L_t)\) can be upper bounded by a family of functions \(\ell (\cdot , L)\) indexed by \(L \in \mathcal {L}\). Then we show that the family of the functions is P-Donsker (Vaart 2000), the summands of which concentrate around its expectation within an \(O(1/\sqrt{n})\) ball almost surely. Therefore, we conclude that \(\{g_t(L_t)\}_{t=1}^{\infty }\) is a quasi-martingale and converges almost surely.

Proposition 4

Let \(L \in \mathcal {L}\) and denote the minimizer of \(\tilde{\ell }(\varvec{z}, L, \varvec{r}, \varvec{e})\) as:

$$\begin{aligned} \{\varvec{r}^*, \varvec{e}^*\} = \mathop {\hbox {arg min}}\limits _{\varvec{r}, \varvec{e}, \left\| \varvec{r} \right\| _{2} \le 1} \frac{1}{2} \left\| \varvec{z}- L \varvec{r}- \varvec{e} \right\| _{2}^2 + \lambda _2 \tilde{h}(\varvec{e}). \end{aligned}$$

Then, the function \(\ell (\varvec{z}, L)\) defined in Problem (2.9) is continuously differentiable and

$$\begin{aligned} \nabla _L \ell (\varvec{z}, L) = (L\varvec{r}^* + \varvec{e}^* - \varvec{z}) \varvec{r}^{*\top }. \end{aligned}$$

Furthermore, \(\ell (\varvec{z}, \cdot )\) is uniformly Lipschitz over the compact set \(\mathcal {L}\).

Proof

The gradient of \(\ell (\varvec{z}, \cdot )\) follows from Lemma 2. Since each term of \(\nabla _L \ell (\varvec{z}, L)\) is uniformly bounded, we conclude the uniform Lipschitz property of \(\ell (\varvec{z}, L)\) w.r.t. L.

Corollary 2

(Uniform bound and Lipschitz of the empirical loss) Let \(f_t(L)\) be the empirical loss function defined in (2.8). Then \(f_t(L)\) is uniformly bounded and Lipschitz over the compact set \(\mathcal {L}\).

Corollary 3

(P-Donsker of \(\ell (\varvec{z}, L)\) ) The set of measurable functions \(\{ \ell (\varvec{z}, L),\ L \in \mathcal {L} \}\) is P-Donsker (see definition in Lemma 1).

Proposition 5

(Concentration of the empirical loss) Let \(f_t(L)\) and f(L) be the empirical and expected loss functions we defined in (2.8) and (4.1). Then we have

$$\begin{aligned} \mathbb {E}[\sqrt{t}\left\| f_t - f \right\| _{\infty }] = O(1). \end{aligned}$$

Proof

Since \(\ell (\varvec{z}, L)\) is uniformly upper bounded (Corollary 1) and is always non-negative, its square is uniformly upper bounded, hence its expectation. Together with Corollary 3, Lemma 1 applies.

Theorem 2

(Convergence of the surrogate) The sequence \(\{g_t(L_t)\}_{t=1}^{\infty }\) we defined in (3.1) converges almost surely, where \(\{L_t\}_{t=1}^{\infty }\) is the solution produced by Algorithm 1. Moreover, the infinite summation \(\sum _{t=1}^{\infty } \left| \mathbb {E}[g_{t+1}(L_{t+1}) - g_t(L_t) \mid \mathcal {F}_t] \right|\) is bounded almost surely.

Proof

The theorem follows by showing that the sequence of \(\{ g_t(L_t) \}_{t=1}^{\infty }\) is a quasi-martingale, and hence converges almost surely. To see this, we note that for any \(t > 0\), the expectation of the difference \(g_{t+1}(L_{t+1}) - g_t(L_t)\) conditioned on the past information \(\mathcal {F}_t\) is bounded by \(\sup _L (f(L) - f_t(L)) / (t+1)\), which is of order \(O(1/(\sqrt{t}(t+1)))\) due to Proposition 5. Hence, Lemma 3 applies.

Stage III Now we prove that the sequence of the empirical loss function, \(\{f_t(L_t)\}_{t=1}^{\infty }\) defined in (2.8) converges almost surely to the same limit of its surrogate \(\{g_t(L_t)\}_{t=1}^{\infty }\). According to the central limit theorem, we assert that \(f_t(L_t)\) also converges almost surely to the expected loss \(f(L_t)\) defined in (4.1), implying that \(g_t(L_t)\) and \(f(L_t)\) converge to the same limit almost surely.

We first establish the numerical convergence of the basis sequence \(\{L_t\}_{t=1}^{\infty }\), based on which we show the convergence of \(\{ f_t(L_t) \}_{t=1}^{\infty }\) by applying Lemma 4.

Proposition 6

(Numerical convergence of the basis component) Let \(\{L_t\}_{t=1}^{\infty }\) be the basis sequence produced by the Algorithm 1. Then, for any \(t > 0\), we have

$$\begin{aligned} \left\| L_{t+1} - L_t \right\| _{F} = O\left( \frac{1}{t}\right) . \end{aligned}$$
(4.2)

Theorem 3

(Convergence of the empirical and expected loss) Let \(\{f(L_t)\}_{t=1}^{\infty }\) be the sequence of the expected loss where \(\{L_t\}_{t=1}^{\infty }\) is the sequence of the solutions produced by the Algorithm 1. Then, we have

  1. 1.

    The sequence of the empirical loss \(\{f_t(L_t)\}_{t=1}^{\infty }\) converges almost surely to the same limit of the surrogate.

  2. 2.

    The sequence of the expected loss \(\{f(L_t)\}_{t=1}^{\infty }\) converges almost surely to the same limit of the surrogate.

Proof

Let \(b_t = g_t(L_t) - f_t(L_t)\). We show that infinite series \(\sum _{t=1}^{\infty } b_t / (t+1)\) is bounded by applying the central limit theorem to \(f(L_t) - f_t(L_t)\) and the result of Theorem 2. We further prove that \(\left| b_{t+1} - b_t \right|\) can be bounded by O(1 / t), due to the uniform boundedness and Lipschitz of \(g_t(L_t)\), \(f_t(L_t)\) and \(\ell (\varvec{z}_t, L_t)\). According to Lemma 4, we conclude the convergence of \(\{b_t\}_{t=1}^{\infty }\) to zero. Hence the first claim. The second claim follows immediately owing to the central limit theorem.

Final stage According to Claim 2 of Theorem 3 and the fact that \(\mathbf {0}\) belongs to the subgradient of \(g_t(L)\) evaluated at \(L = L_t\), we are to show the gradient of f(L) taking at \(L_t\) vanishes as t tends to infinity, which establishes Theorem 1. To this end, we note that since \(\{L_t\}_{t=1}^{\infty }\) is uniformly bounded, the non-differentiable term \(\frac{1}{2t} \left\| L \right\| _{2,\infty }^2\) vanishes as t goes to infinity, implying the differentiability of \(g_{\infty }(L_{\infty })\), i.e. \(\nabla g_{\infty }(L_{\infty })=\mathbf {0}\). On the other hand, we show that the gradient of f(L) and that of \(g_t(L)\) are always Lipschitz on the compact set \(\mathcal {L}\), implying the existence of their second order derivative even when \(t \rightarrow \infty \). Thus, by taking a first order Taylor expansion and let t go to infinity, we establish the main theorem.

5 Connection to matrix completion

While we mainly focus on the matrix decomposition problem, our method can be extended to the matrix completion (MC) problem (Cai et al. 2010; Candès and Recht 2009) with max-norm regularization (Cai and Zhou 2013, 2016)—another popular topic in machine learning and signal processing. We focus on the max-norm regularized MC problem with squared Frobenius loss widely considered in the literature, which can be described as follows:

$$\begin{aligned} \min _X\ \frac{1}{2}\left\| \mathcal {P}_{\varOmega }\left( Z - X\right) \right\| _{F}^2 + \frac{\lambda }{2} \left\| X \right\| _{\max }^2, \end{aligned}$$

where \(\varOmega \) is the set of indices of observed entries in Z and \(\mathcal {P}_{\varOmega }(M)\) is the orthogonal projection onto the span of matrices vanishing outside of \(\varOmega \) so that the (ij)th entry of \(\mathcal {P}_{\varOmega }(M)\) is equal to \(M_{ij}\) if \((i, j) \in \varOmega \) and zero otherwise. Interestingly, the max-norm regularized MC problem can be cast into our framework. To see this, let us introduce an auxiliary matrix M, with \(M_{ij} = c > 0\) if \((i, j) \in \varOmega \) and \(M_{ij}=1/c\) otherwise. The reformulated MC problem,

$$\begin{aligned} \min _{X, E}\ \frac{1}{2} \left\| Z-X-E \right\| _{F}^2 + \frac{\lambda }{2} \left\| X \right\| _{\max }^2 + \left\| M \circ E \right\| _{1}, \end{aligned}$$
(5.1)

where “\(\circ \)” denotes the entry-wise product, is similar to our MRMD formulation (1.1). And it is easy to show that when c tends to infinity, the reformulated problem converges to the original MC problem.

Online implementation We now derive a stochastic implementation for the max-norm regularized MC problem. Note that the only difference between the Problem (5.1) and Problem (1.1) is the \(\ell _1\) regularization on E, which results a new penalty on \(\varvec{e}\) for \(\tilde{\ell }(\varvec{z}, L, \varvec{r}, \varvec{e})\) (which is originally defined in (2.6)):

$$\begin{aligned} \tilde{\ell }(\varvec{z}, L, \varvec{r}, \varvec{e}) = \frac{1}{2} \left\| \varvec{z}- L\varvec{r}- \varvec{e} \right\| _{2}^2 + \left\| \varvec{m}\circ \varvec{e} \right\| _{1}. \end{aligned}$$
(5.2)

Here, \(\varvec{m}\) is a column of the matrix M in (5.1). According to the definition of M, \(\varvec{m}\) is a vector with element value being either c or 1 / c. Let us define two support sets as follows:

$$\begin{aligned}&\varOmega _1 \mathop {=}\limits ^{\text {def}}\ \left\{ i \mid m_i = c, 1 \le i \le p \right\} ,\\&\varOmega _2 \mathop {=}\limits ^{\text {def}}\ \left\{ i \mid m_i = 1/c, 1 \le i \le p\right\} , \end{aligned}$$

where \(m_i\) is the ith element of vector \(\varvec{m}\). In this way, the newly defined \(\tilde{\ell }(\varvec{z}, L, \varvec{r}, \varvec{e})\) can be written as

$$\begin{aligned} \tilde{\ell }(\varvec{z}, L, \varvec{r}, \varvec{e})= & {} \left( \frac{1}{2} \left\| \varvec{z}^{}_{\varOmega _1}- (L\varvec{r})^{}_{\varOmega _1} - \varvec{e}^{}_{\varOmega _1} \right\| _{2}^2 + c \left\| \varvec{e}^{}_{\varOmega _1} \right\| _{1} \right) \nonumber \\&+ \left( \frac{1}{2} \left\| \varvec{z}^{}_{\varOmega _2}- (L\varvec{r})^{}_{\varOmega _2} - \varvec{e}^{}_{\varOmega _2} \right\| _{2}^2 + \frac{1}{c} \left\| \varvec{e}^{}_{\varOmega _2} \right\| _{1} \right) . \end{aligned}$$
(5.3)

Notably, as \(\varOmega _1\) and \(\varOmega _2\) are disjoint, given \(\varvec{z}\), L and \(\varvec{r}\), the variable \(\varvec{e}\) in (5.3) can be optimized by soft-thresholding in a separate manner:

$$\begin{aligned} \varvec{e}^{}_{\varOmega _1} = \mathcal {S}^{}_{c}\left[ \varvec{z}^{}_{\varOmega _1}- (L\varvec{r})^{}_{\varOmega _1}\right] ,\quad \varvec{e}^{}_{\varOmega _2} = \mathcal {S}^{}_{1/c}\left[ \varvec{z}^{}_{\varOmega _2}- (L\varvec{r})^{}_{\varOmega _2}\right] . \end{aligned}$$
(5.4)

Hence, we obtain Algorithm 5 for the online max-norm regularized matrix completion (OMRMC) problem. The update principle for \(\varvec{r}\) is the same as we described in Algorithm 3 and that for \(\varvec{e}\) is given by (5.4). Note that we can use Algorithm 4 to update L as usual.

figure e

\(\ell _{\infty }\) -norm constrained variant In some matrix completion applications, one may have to take another \(\ell _{\infty }\)-norm constraint into account, i.e.,

$$\begin{aligned} \left\| X \right\| _{\infty } \le \tau ,\ \text {for } \text {some}\ \tau > 0. \end{aligned}$$
(5.5)

For example, the rating value of the Netflix dataset is not greater than 5. In the 1-bit setting, the entries of a matrix can either be 1 or \(-1\) (Davenport et al. 2014). Other examples can be found in, e.g., Klopp (2014). Interestingly, Algorithm 5 can be adjusted to such a constraint.

To see this, we observe that the constraint \(\left\| X \right\| _{\infty } \le \tau \) amounts to restricting

$$\begin{aligned} \left|x_{ij} \right| \le \tau \end{aligned}$$

for all entries \(x_{ij}\) of X. Due to the matrix factorization \(X = LR^{\top }\), we know that it requires

$$\begin{aligned} \left|\varvec{l}(i) \varvec{r}(j)^{\top } \right| \le \tau ,\ \forall \ i \in \ [p],\ \forall \ j \in [n], \end{aligned}$$
(5.6)

where we recall that \(\varvec{l}(i)\) and \(\varvec{r}(j)\) are the ith row of L and the jth row of R, respectively. Proposition 1 already ensures

$$\begin{aligned} \left\| \varvec{r}(j) \right\| _{2} \le 1,\ \forall \ j \in [n]. \end{aligned}$$

Since \(\left|\varvec{l}(i) \varvec{r}(j)^{\top } \right| \le \left\| \varvec{l}(i) \right\| _{2} \cdot \left\| \varvec{r}(j) \right\| _{2}\), we obtain a sufficient condition for (5.6):

$$\begin{aligned} \left\| \varvec{l}(i) \right\| _{2} \le \tau ,\ \forall \ i \in [n]. \end{aligned}$$

That is,

$$\begin{aligned} \left\| L \right\| _{2,\infty } \le \tau , \end{aligned}$$

which can easily be fulfilled by an orthogonal projection onto the \(\ell _2\) ball with radius \(\tau \), i.e., if \(\left\| L_t \right\| _{2,\infty } > \tau \), we set \(L_t \leftarrow \frac{\tau }{\left\| L_t \right\| _{2,\infty }} L_t\).

Other types of loss functions We in this paper emphasize on the squared Frobenius loss for the max-norm regularized problems. There is also solid theoretical analysis for other formulations, e.g., logistic regression and probit regression (Cai and Zhou 2013). Unfortunately, it seems that one cannot trivially extend the proposed online algorithms to a general loss function. To be more precise, for Frobenius (or \(\ell _2\)) loss, we are guaranteed with a nice property that minimizing the surrogate (3.8) is equivalent to solving (3.10), for which only O(pd) memory is needed. For general models, such a property does not hold and we conjecture that more technique is needed to find a good approximation to (3.8).

6 Experiments

In this section, we report numerical results on synthetic data to demonstrate the effectiveness and robustness of our online max-norm regularized matrix decomposition (OMRMD) algorithm. Some experimental settings are used throughout this section, as elaborated below.

Data generation The simulation data are generated by following a similar procedure in Candès et al. (2011). The clean data matrix X is produced by \(X = UV^{\top }\), where \(U\in \mathbb {R}^{ p \times d}\) and \(V\in \mathbb {R}^{ n \times d}\). The entries of U and V are i.i.d. sampled from the normal distribution \(\mathcal {N}(0, 1)\). We choose sparse corruption in the experiments, and introduce a parameter \(\rho \) to control the sparsity of the corruption matrix E, i.e., a \(\rho \)-fraction of the entries are non-zero whose locations are uniformly sampled and the magnitude follows a uniform distribution over \([-1000, 1000]\). Finally, the observation matrix Z is produced by \(Z=X+E\).

Baselines We mainly compare with two methods: Principal Component Pursuit (PCP) and online robust PCA (OR-PCA). PCP is the state-of-the-art batch method for subspace recovery, which was presented as a robust formulation of PCA in Candès et al. (2011). OR-PCA is an online implementation of PCP,Footnote 2 which also achieves state-of-the-art performance over the online subspace recovery algorithms. Sometimes, to show the robustness, we will also report the results of online PCA (Artač et al. 2002), which incrementally learns the principal components without taking the noise into account.

Evaluation metric Our goal is to estimate the correct subspace for the underlying data. Here, we evaluate the fitness of our estimated subspace basis L and the ground truth basis U by the Expressed Variance (EV) (Xu et al. 2010):

$$\begin{aligned} \text {EV}(U, L) \mathop {=}\limits ^{\text {def}}\frac{\mathrm{Tr}\,(L^{\top }UU^{\top }L)}{\mathrm{Tr}\,(UU^{\top })}. \end{aligned}$$
(6.1)

The values of EV range in [0, 1] and a higher value indicates a more accurate recovery.

Other settings Throughout the experiments, we set the ambient dimension \(p=400\), the total number of samples \(n=5000\) and pick the value of d as the true rank unless otherwise specified. We fix the tunable parameter \(\lambda _1 = \lambda _2 = 1 / \sqrt{p}\), and use default parameters for all baselines we compare with. Each experiment is repeated 10 times and we report the averaged EV as the result.

6.1 Robustness

We first study the robustness of OMRMD, measured by the EV value of its output after accessing the last sample, and compare it to the nuclear norm based OR-PCA and the batch algorithm PCP. In order to make a detailed examination, we vary the true rank from 0.02p to 0.5p, with a step size 0.04p, and the corruption fraction \(\rho \) from 0.02 to 0.5, with a step size 0.04.

The general results are illustrated in Fig. 1 where a brighter color means a higher EV (hence better performance). We observe that for easy tasks (i.e., few corruption and low rank case), both OMRMD and OR-PCA perform comparably. However, for more difficult cases, OMRMD outperforms OR-PCA. In order to further investigate this phenomenon, we plot the EV curve against the fraction of corruption under a given matrix rank. In particular, we group the results into two parts, one with relatively low rank (Fig. 2) and the other with middle level of rank (Fig. 3). Figure 2 indicates that when manipulating a low-rank matrix, OR-PCA works as well as OMRMD under a low level of noise. For instance, the EV produced by OR-PCA is as close as that of OMRMD for rank less than 40 and \(\rho \) no more than 0.26. However, when the rank becomes larger, OR-PCA degrades quickly compared to OMRMD. This is possibly because the max-norm is a tighter approximation to the matrix rank. Since PCP is a batch formulation and accesses all the data in each iteration, it always achieves the best recovery performance.

Fig. 1
figure 1

Performance of subspace recovery under different rank and corruption fraction. Brighter color means better performance. As we observed, the max-norm based algorithm OMRMD always performs comparably or better than OR-PCA which is based on nuclear norm formulation. Since PCP is a batch method, it always achieves the best recovery performance

Fig. 2
figure 2

EV value against corruption fractions when the matrix has a relatively low rank (note that the ambient dimension p is 400). The EV value is computed for the obtained basis after accessing the last sample. When the rank is extremely low (\(\hbox {rank} = 8\)), OMRMD and OR-PCA works comparably. In other cases, OMRMD is always better than OR-PCA addressing a large fraction of corruption

Fig. 3
figure 3

EV value against corruption fractions when the matrix has a middle level of rank (note that the ambient dimension p is 400). The EV value is computed for the basis after accessing the last sample. In these cases, OR-PCA degrades as soon as the corruption is tuned to be higher than 0.02

Fig. 4
figure 4

EV value against number of samples under different corruption fractions. PCP outperforms all the online algorithms before they converge since PCP accesses all the data to estimate the basis. The performance of Online PCA is significantly degraded even when there is little corruption. For hard tasks (\(\rho \) equal to 0.3 or higher), we again observe the superiority of the max-norm over the nuclear norm

Fig. 5
figure 5

EV value against number of samples under different ambient dimensions. The intrinsic dimension \(d = 0.1p\) and the corruption fraction \(\rho = 0.3\)

6.2 Convergence rate

We next study the convergence of OMRMD by plotting the EV curve against the number of samples. Besides OR-PCA and PCP, we also add online PCA (Artač et al. 2002) as a baseline algorithm. The results are illustrated in Fig. 4 where we set \(p=400\) and the true rank as 80. As expected, PCP achieves the best performance since it is a batch method and needs to access all the data during optimization. Online PCA degrades significantly even with low corruption (Fig. 4a). OMRMD is comparable to OR-PCA when the corruption is low (Fig. 4a), and converges significantly faster when the data is grossly corrupted (Fig. 4c and 4d). This observation agrees with Fig. 1, and again suggests that in the noisy scenario, max-norm may be a better fit than the nuclear norm.

Indeed, OMRMD converges much faster even in large scale problems. In Fig. 5, we compare the convergence rate of OMRMD and OR-PCA under different ambient dimensions. The rand of the data are set with 0.1p, indicating a low-rank structure of the underlying data. Again, we assume the rank is known so \(d = 0.1p\). The error corruption \(\rho \) is fixed to 0.3 – a difficult task for recovery. We observe that for high dimensional cases (\(p = 1000\) and \(p = 3000\)), OMRMD significantly outperforms OR-PCA. For example, in Fig. 5b, OMRMD achieves the EV value of 0.8 only with accessing about 2000 samples, whereas OR-PCA needs to reveal 60, 000 samples to obtain the same accuracy!

Fig. 6
figure 6

EV value against time under different ambient dimensions. The intrinsic dimension d is set as 0.1p and the corruption fraction \(\rho \) equals 0.3

6.3 Computational complexity

We note that OMRMD is a little bit inferior to OR-PCA in terms of computation per iteration, as our algorithm may solve a dual problem to optimize \(\varvec{r}\) (see Algorithm 3) if the initial solution \(\varvec{r}_0\) violates the constraint. We plot the EV curve with respect to the running time in Fig. 6. It shows that, OR-PCA is about 3 times faster than OMRMD when processing a data point. However, we point out here that we emphasize on the convergence rate. That is, given an EV value, how much time the algorithm will take to achieve it. In Fig. 6c, for example, OMRMD takes 50 minutes to achieve the EV value of 0.6, while OR-PCA uses nearly 900 minutes. From Figs. 5 and 6, it is safe to conclude that OMRMD is superior to OR-PCA in terms of convergence rate in the price of a little more computation per sample.

6.4 Influence of d

Finally, we remark that it is important to pick a sufficiently large value for d. As Burer and Monteiro (2005) suggested, d should be chosen no smaller than the true rank. In the simulation studies, we always pick d as the rank of the underlying data. Here we examine the influence of d in Fig. 7, where we set the ambient dimension \(p=400\), the sample size \(n = 5000\) and the true rank is 40. As expected, if the value of d is smaller than the true rank, we have no hope to recover the subspace.

Fig. 7
figure 7

Influence of the choice of d. The true matrix rank is 40. We observe that as long as d is no smaller than the true rank, the algorithm always recovers the subspace

7 Conclusion

In this paper, we have developed an online algorithm for the max-norm regularized matrix decomposition problems. Using the matrix factorization form of the max-norm, we converted the original problem to a constrained one which facilitates an online implementation for solving the batch problem. We have established theoretical guarantees that the sequence of the solutions converges to a stationary point of the expected loss function asymptotically. Moreover, we empirically compared our proposed algorithm with OR-PCA, which is a recently proposed online algorithm for nuclear-norm based matrix decomposition. The simulation results have suggested that the proposed algorithm is more robust than OR-PCA, in particular for hard tasks (i.e., when a large fraction of entries are corrupted). We also have investigated the convergence rate for both OMRMD and OR-PCA, and have shown that OMRMD converges much faster than OR-PCA even in large-scale problems. When acquiring sufficient samples, we observed that our algorithm converges to the batch method PCP, which is a state-of-the-art formulation for subspace recovery. Our experiments, to an extent, suggest that the max-norm might be a tighter relaxation of the rank function compared to the nuclear norm.