Adaptive $$L_p$$ $$(0<p<1)$$ Regularization: Oracle Property and Applications

Shi, Yunxiao; He, Xiangnan; Wu, Han; Jin, Zhong-Xiao; Lu, Wenlian

doi:10.1007/978-3-319-70087-8_2

Yunxiao Shi¹⁸,
Xiangnan He¹⁸,
Han Wu¹⁸,
Zhong-Xiao Jin¹⁹ &
…
Wenlian Lu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10634))

Included in the following conference series:

International Conference on Neural Information Processing

4784 Accesses

Abstract

In this paper, we propose adaptive $L_p$ ($0<p<1$) estimators in sparse, high-dimensional, linear regression models when the number of covariates depends on the sample size. Other than the case of the number of covariates is smaller than the sample size, in this paper, we prove that under appropriate conditions, these adaptive $L_p$ estimators possess the oracle property in the case that the number of covariates is much larger than the sample size. We present a series of experiments demonstrating the remarkable performance of this estimator with adaptive $L_p$ regularization, in comparison with the $L_{1}$ regularization, the adaptive $L_{1}$ regularization, and non-adaptive $L_{p}$ regularization with $0<p<1$, and its broad applicability in variable selection, signal recovery and shape reconstruction.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Correction for Optimisation Bias in Structured Sparse High-Dimensional Variable Selection

$$\ell _1$$ Regularized Robust and Sparse Linear Modeling Using Discrete Optimization

Sparse Learning for Large-Scale and High-Dimensional Data: A Randomized Convex-Concave Optimization Approach

Keywords

1 Introduction

High prediction accuracy and discovering relevant predictive variables are two fundamental problems in statistical learning. Variable selection is particularly important when the underlying model has a sparse representation, especially in high-dimensional and massive data analysis. It has been argued by [1] that a good estimator should have oracle property, namely, the estimator

correctly selects covariates with nonzero coefficients with probability converging to one, as the sample size goes to infinity, and
has the same asymptotic distribution as if the zero coefficients were known in advance.

Consider the linear regression model $\varvec{y} = \varvec{X\beta } + \varvec{\epsilon }$, where $\mathbf X \in R^{n\times l_n}$ is a design matrix, $\varvec{\beta }\in R^{l_n}$ is the vector of unknown coefficients, and $\varvec{\epsilon }\in R^n$ is the vector of i.i.d. random variables with mean zero and finite variance $\sigma ^2$. Note that $l_n$, the length of $\varvec{\beta }$ depends on the sample size n and may go to infinity as $n\rightarrow \infty $. Without loss of generality, we assume that the response vector $\varvec{y}\in R^n$ and the covariates are centered so that the intercept term can be excluded. In many situations we are to recover $\varvec{\beta }$ from observation $\varvec{y}$ such that $\varvec{\beta }$ is of the most sparse structure, that is, $\varvec{\beta }$ has the fewest nonzero components. A direct approach is to formulate this problem as $\min _{\varvec{\beta }\in R^n}||\varvec{\beta }||_0\text { such that }\varvec{y} = \varvec{X\beta } + \varvec{\epsilon }$, which can be transformed into $\min _{\varvec{\beta }\in R^n}||\varvec{y}- \varvec{X\beta }||^2+\lambda ||\varvec{\beta }||_0$, which is called an $L_0$ regularization problem, where $||\varvec{\beta }||_0$ is the number of nonzero components of $\varvec{\beta }$ and $\lambda $ is the regularization parameter. Indeed this method can recover sparse solutions even in situations in which $l_n\gg n$, in fact, it can perfectly recover all the sparse $\varvec{\beta }$ obeying $||\varvec{\beta }||_0\le n/2$. However this is of little practical use since generally solving an $L_0$ regularization problem usually requires an intractable number of combinatorial searches. To conquer this difficulty, several approximations to the $L_0$ problem have been proposed, such as $L_1$ regularization [2,3,4,5], the adaptive Lasso [6], the $L_p$ $(0<p<1)$ regularization [7, 8] and the adaptive $L_p$ $(0<p<1)$ regularization [9].

Among the proposed techniques above, $L_1$ regularization (or Lasso) overcame the huge computational cost for large problems of the $L_0$ but may introduce inconsistent estimations [6] and extra bias [10]. The adaptive Lasso and $L_p$ regularization solved the above problems and their oracle property were established in both low and high dimensional scenarios [6,7,8, 11]. Meanwhile it has been claimed that the $L_p$ $(0<p<1)$ regularization yields more sparse solutions than both the Lasso and the adaptive Lasso [12, 13], but sometimes its sparsity would lead to unstable estimation [9], who therefore proposed the adaptive $L_p$ ($0<p<1$) regularization, $\min _{\varvec{\beta }\in R^n}||\varvec{y}-\varvec{X\beta }||^2 + \lambda \sum _{j=1}^{l_{n}}\omega _j|\beta _j|^p$, and proved its oracle property when the number of covariates is fixed.

In this paper, we continue to investigate the adaptive $L_{p}$ ($0<p<1$) regularization when the number of covariates depends on the sample size and can go to infinity as the sample size goes to infinity. We prove that under a series of mild conditions, the adaptive $L_p$ ($0<p<1$) estimator enjoys the oracle property in high-dimensional settings even when $l_n\gg n$, and proposed algorithms that can efficiently solve the adaptive $L_p$. Finally we demonstrate the superior performance of the adaptive $L_p$ in variable selection, signal recovery and image shape reconstruction by a series of numerical experiments, in comparison to the $L_{1}$ estimator, the adaptive $L_{1}$ estimator and the $L_{1/2}$ estimator.

2 Preliminaries

The symbol $\rightarrow $ stands for convergence in the common sense, $\rightarrow _p$ for convergence in probability, and $\rightarrow _d$ for convergence in distribution. $\mathbb P(\cdot )$ stands for the probability. $X_n=O_p(1)$ stands for some stochastically bounded sequence, and $X_n=o_p(1)$ for $X_n\rightarrow _p 0 \text { as }n\rightarrow \infty $. Meanwhile $\varvec{\beta }_0=[\varvec{\beta }_{01}^{\top }, \varvec{\beta }_{00}^{\top }]^{\top }\in R^{l_n}$, where $\varvec{\beta }_{01}\in R^{k_n}$ consists of the nonzero terms of $\varvec{\beta }_0$ and $\varvec{\beta }_{00}\in R^{m_n}$ are the zero ones, note that $k_n+m_n=l_n$. We center the response vector $\varvec{y}=[y_1, \cdots , y_n]^{\top }$ and standardize the design matrix $X=(x_{ij})_{n\times p_n}$ so that $ \sum _{i=1}^n y_i=0,\text { } \sum _{i=1}^n x_{ij}=0,\text { } \frac{1}{n}\sum _{i=1}^nx_{ij}^2=1, \text { }j=1,\cdots ,l_n$.

Let $J_{1n}=\{j\;|\;\beta _{0j}\ne 0\}$ and set $\varvec{X}_n^{\top }=[\varvec{x}_1, \varvec{x}_2, \cdots , \varvec{x}_n]\in R^{l_n\times n}$, $\varvec{X}_{n1}=[\varvec{x}_{11}, \varvec{x}_{12}, \cdots \varvec{x}_{1n}]\in R^{k_n\times n}$, accrodingly we define $\varSigma _n=\frac{1}{n}\varvec{X}_n^{\top }\varvec{X}_n,\text { }\varSigma _{n1}=\frac{1}{n}\varvec{X}_{n1}^{\top } \varvec{X}_{n1}$. We denote $\rho _{1n}$ and $\tau _{1n}$ as the smallest eigenvalue of $\varSigma _n$ and $\varSigma _{n1}$ respectively, and $\rho _{2n}$, $\tau _{2n}$ the largest eigenvalue of $\varSigma _n$ and $\varSigma _{n1}$. We consider the oracle property which was proposed by [1].

Definition 2.1

(Oracle Property). Let $\hat{\varvec{\beta }}_n^{\top }=[\hat{\varvec{\beta }}_{n1}^{\top }, \hat{\varvec{\beta }}_{n0}^{\top }]^{\top }$ be the estimator of the true parameter $\varvec{\beta }_0=[\varvec{\beta }_{01}^{\top }, \varvec{\beta }_{00}^{\top }]^{\top }$. Then $\hat{\varvec{\beta }}_n$ is said to possess oracle property if the two conditions below are satisfied: (1). (Consistency) $\lim _{n\rightarrow \infty }\mathbb {P}(\hat{\varvec{\beta }}_{n0}=\varvec{0})=1$; (2). (Asymptotic Normality) Let $s_n^2=\sigma ^2\varvec{\alpha }_n^{\top }\sigma _{n1}^{-1}\varvec{\alpha }_n$, where $\alpha _n$ is any $k_n\times 1$ vector satisfying $\Vert \varvec{\alpha }_n\Vert _2\le 1$ such that

$$\begin{aligned} n^{-\frac{1}{2}}s_n^{-1}\varvec{\alpha }_n^{\top }(\hat{\varvec{\beta }}_{n1}-\varvec{\beta }_{01} )=n^{-\frac{1}{2}}s_n^{-1}\sum _{i=1}^n\epsilon _i\varvec{\alpha }_n^{\top } \varSigma _{n1}^{-1}\varvec{x}_{i1}+o_p(1)\rightarrow _dN(0,1). \end{aligned}$$

(2.1)

Let $b_{1n}=\min _{j\in {J_{1n}}}\{|\beta _{0j}|\},\text { } b_{2n}=\max _{j\in {J_{1n}}}\{|\beta _{0j}|\}$.

Definition 2.2

(Zero Consistency). The estimator $\tilde{\varvec{\beta }}_n$ is said to be zero consistent if it satisfies the two conditions: (1). $\max _{j\in J_{0n}}|\tilde{\beta }_{nj}|=o_p(1)$; (2). There exists some constant $c>0$ such that for any $\epsilon >0$ when n is sufficiently large the following inequality holds

$$\begin{aligned} \mathbb {P}(\min _{j\in J_{1n}}|\tilde{\beta }_{nj}|\ge cb_{1n})>1-\epsilon . \end{aligned}$$

(2.2)

where $\tilde{\beta }_{nj}$ is the marginal regression coefficient [11]. Furthermore, if for a certain constant $C>0$, the following

$$\begin{aligned} \mathbb {P}(R_n\max _{j\in J_{0n}}|\tilde{\beta }_{nj}|>C)\rightarrow 0,\quad \text {as }n\rightarrow \infty \end{aligned}$$

(2.3)

where $\lim _{n\rightarrow \infty }R_n=\infty $, then $\tilde{\varvec{\beta }}_n$ is said to be zero consistent with rate $R_n$.

3 Methods

Now we present the conditions for the oracle property of the adaptive $L_p$ regularization. Due to the limit of space, we omit the proofs of all theorems which will be seen in our future paper. We consider the estimator [9]

$$\begin{aligned} U_n(\varvec{\beta })=\sum _{i=1}^n\sum _{i=1}^{l_n}(Y_i-x_{ij}\beta _j)^2+\lambda _n\sum _{i=1}^{l_n}\omega _{nj}|\beta _j|^p, \end{aligned}$$

(3.1)

where $0<p<1$. Let $\bar{\varvec{\beta }}_n=\mathrm {arg}\min _{\varvec{\beta }}U_n({\varvec{\beta }})= [\bar{\varvec{\beta }}_{n1}^{\top }, \bar{\varvec{\beta }}_{n0}^{\top }]^{\top }$, where $\bar{\varvec{{\beta }}}_{n1}^{\top }$ and $\bar{\varvec{\beta }}_{n0}^{\top }$ corresponds to the estimates of nonzero and zero coefficients of the true parameter $\varvec{\beta }_0$ respectively. We give the following assumptions.

(B1)
(i) $\{\epsilon _i\}_{i=1}^n$ is a sequence of i.i.d. random variables, with mean 0 and a finite variance $\sigma ^2$; (ii). $\epsilon _i$ is sub-Gaussian, that is $ \mathbb {P}(|\epsilon _i|>x)\le K\exp (-Cx^2)$, for all $i\in \mathbb {N}$, some $K>0$ and $C>0$.
(B2)
$\tilde{\varvec{\beta }}_n$ defined by Definition 2.4 is zero-consistent with rate $R_n$.
(B3)
(i) There exists a constant $c_9>0$ such that $ \left| n^{-\frac{1}{2}}\sum _{i=1}^nx_{ij}x_{ik}\right| \le c_9$ for all $j\in J_{0n}$, all $k\in J_{1n}$ and sufficiently large n; (ii). Let $\xi _{nj}=n^{-1}\mathrm {E}(\sum _{i=1}^nY_ix_{ij})=n^{-1}\sum _{i=1}^n(\varvec{x}_{i1}^{\top }\varvec{\beta }_{01}x_{ij})$. There exists a $\xi _0>0$, such that $ \min _{j\in J_{1n}}|\xi _{nj}|>2\xi _0b_{1n}>0$.
(B4)
(i) $\frac{\lambda _n}{n}\rightarrow 0$, $\lambda _nn^{-\frac{p}{2}}R_n^{\alpha }k_n^{p-2}\rightarrow \infty $; (ii) $\log (m_n)=o(1)(\lambda _nn^{-\frac{p}{2}}R_n^{-\alpha })^{\frac{2}{2-p}}.$
(B5)
(i) There exists some constants $0<b_1<b_2<\infty $, such that $b_1<b_{1n}<b_{2n}<b_2$; (ii) $\lim _{n\rightarrow \infty }k_n\exp (-Cn)\rightarrow 0.$

Theorem 3.3

Suppose the conditions (B1)–(B5) hold. Then $\bar{\varvec{\beta }}_n$ is consistent in variable selection, namely

$$\begin{aligned} \mathbb {P}(\bar{\varvec{\beta }}_{n0}=\varvec{0})\rightarrow 1,\quad \mathbb {P}(\bar{\varvec{\beta }}_{n1j}\ne 0,\,j\in J_{1n})\rightarrow 1, \quad \text {as }n\rightarrow \infty . \end{aligned}$$

(3.2)

It can be seen from Theorem 3.3 that under appropriate conditions, the adaptive $L_p$ ($0<p<1$) correctly selects nonzero covariates with probability converging to one. Towards the oracle property, we denote the nonzero terms as $\bar{\varvec{\beta }}_n$ and consider optimizing the following objective function

$$\begin{aligned} \tilde{U}_n(\varvec{\beta }_1)=\sum _{i=1}^n(Y_i-\varvec{x}_{i1}^{\top }\varvec{\beta }_1)^2+\lambda _n^{*}\sum _{i=1}^{k_n}\omega _{nj}|\beta _{1j}|^p, \end{aligned}$$

(3.3)

where $k_n$ is number of nonzero terms of $\bar{\varvec{\beta }}_n$.

Let $\hat{\varvec{\beta }}_{n1}$ be the nonzero terms of $\hat{\varvec{\beta }}_0=\mathrm {arg\min _{\varvec{\beta }}}\tilde{U}_n (\varvec{\beta })$. We further give the following conditions.

(B6)
(i) There exists constants $0<\tau _1<\tau _2<\infty $, such that $0<\tau _1<\tau _{1n}<\tau _{2n}<\tau _2$; (ii) $n^{-\frac{1}{2}}\max _{1\le i\le n}\varvec{x}_{i1}^{\top }\varvec{x}_{i1}$.
(B7)
(i) $k_n(1+\lambda _n^{*})/n\rightarrow 0,\quad $(ii) $\lambda _n^{*}(k_n/l_n\sqrt{n})^{\frac{1}{2}}\rightarrow 0.$

Therefore we have

Theorem 3.4

$\hat{\varvec{\beta }}_{n1}$ is the estimate of the true non-zero parameter $\hat{\varvec{\beta }}_{01}$. Suppose condition (B1)–(B7) hold, then

$$\begin{aligned} n^{\frac{1}{2}}s_n^{-1}\varvec{\alpha }_n^{\top }(\hat{\varvec{\beta }}_{n1}- \hat{\varvec{\beta }}_{01})=n^{\frac{1}{2}}s_n^{-1}\sum _{i=1}^n\epsilon _i\varvec{\alpha }_n^{\top }\varSigma _{n1}^{-1} \varvec{x}_{i1}+o_p(1)\rightarrow _d\mathrm {N}(0,1). \end{aligned}$$

where $s_n^2=\sigma ^2\varvec{\alpha }_n^{\top }\varSigma _{n1}^{-1}\varvec{\alpha }_n$ and $\varvec{\alpha }_n$ is an arbitrary $k_n\times 1$ vector with $||\varvec{\alpha }_n||_2\le 1$.

The assumption that $\tilde{\varvec{\beta }}_n$ is zero-consistent with rate $R_n$ is critical in establishing the oracle property of the adaptive $L_p$ $(0<p<1)$ regularizer. [11] points out that when $l_n$ is fixed or of the order $o(\sqrt{n})$, the OLS estimator $\tilde{\varvec{\beta }}_{ols} = (\varvec{X}^{\top }\varvec{X})^{-1}\varvec{X}^{\top }\varvec{y}$ is feasible as $\tilde{\varvec{\beta }}_n$. But when $l_n>O(\sqrt{n})$, the OLS estimator is no longer zero-consistent. Here we follow the work of [11] but with necessary modification, i.e., $\exists b_1>0\text { such that }b_{1n}>b_1>0$ to present initial estimator. Refer to Sect. 3 of [11] for the rest of the details of discussions. For the case $l_{n}<n$, refer to [9] for details.

However,for the case $l_n>n$, the OLS is no longer feasible as an initial estimator. By [11] and Theorem 3.3, we perform variable selection first to obtain the nonzero $\beta _j$, which induces the following Algorithm 1.

4 Results

In this section, we give three application examples of variable selection, signal recovery and image shape reconstruction respectively. We note that by Algorithm 1 solving the adaptive $L_p$ $(0<p<1)$ is equivalent to solving a series of adaptive $L_1$ which is very quick on a modern computer. We take $p=1/2$ in the experiments, which was recommended by [12]. To show the performance of the present algorithm, we compare with the $L_{1}$ (Lasso), adaptive $L_{1}$ and (nonadaptive) $L_{1/2}$ regularized estimators. In practice, we use the algorithm proposed in [6] to compute the adaptive Lasso and apply the iterative $L_1$ algorithm in [12] to the $L_{1/2}$ regularization. The parameter $\lambda _n$ in $L_{1/2}$ is selected by using the generalized cross-validation(GCV) method described in [1, 16].

For comparison with the adaptive $L_{1}$ regularizer and the adaptive $L_{1/2}$ regularizer, two-dimensional cross-validation is used and selected $\gamma $ from $\{0.5, 1.0, 1.5\}$. For fixed $\gamma $ and $\lambda $, we apply Algorithm 1 to obtain a numerical solution $\hat{\beta }=\hat{\beta }_{\gamma , \lambda }$. Note that $\hat{\beta }_{\gamma , \lambda }$ is the minimum of $L_{1/2}$ regularizer [12], namely $\hat{\beta }_{\gamma , \lambda } = (X^{\top }X+\lambda _n WD^{*})^{-1}X^{\top }y$, where W is a diagonal matrix with elements $\omega _{nj}$ and D = diag$\{|\hat{\beta }_j|^{3/2}\}$. Here $D^{*}$ of D. Meanwhile the number of nonzero components of $\hat{\beta }_{\gamma , \lambda }$ can be approximated by (refer to [16, 17]) $P_{\gamma , \lambda } = \text {tr}(X(X^{\top }X+\lambda WD^{*})^{-1}X^{\top })$. Thus the generalized cross-validation statistic is given by $\text {GCV}_{\gamma , \lambda } = \frac{1}{l_n}\frac{\text {RSS}_{\gamma , \lambda }}{(1-\text {P}_{\gamma , \lambda }/l_n)^2}$,where RSS stands for the residual sum of squared errors: $\sum _{i=1}^n\big (y_i-x_i^{\top }\beta \big )^2$.Therefore we obtain the solution of adaptive $L_{1/2}$ regularizer by the minimization problem $\{\hat{\gamma }, \hat{\lambda }\} = \arg \min _{\gamma , \lambda }\text {GCV}_{\gamma , \lambda }$, that is, $\hat{\beta }=\hat{\beta }_{\hat{\gamma }, \hat{\lambda }}$. All the simulation codes are written in Python with using the package SPAMS [15].

4.1 Variable Selection

Consider the following linear regression model mentioned in [1, 12, 16] $y = X\beta ^{*}+\sigma \epsilon $, where the true values of $\beta ^{*}$ are $[3,1.5,0,0,2,0,0,0]^{\top }$, $\epsilon $ is the i.i.d. noise following certain distribution, and $\sigma $ is the strength of the noise. We first take $\sigma =1$ and second $\sigma =3$ with $\epsilon $ following the standard normal distribution. Finally take $\sigma =1$ but $\epsilon $ follows the linear mixture of 30% standard Cauchy distribution and 70% standard normal distribution (denoted as MIXTURE in the result table). The correlation between each $x_i$ and $x_j$ is equal to $(1/2)^{|i-j|}$.

For each type of noise, we simulate 100 datasets of $n=100$ observations respectively, and the relative model error in each dataset is defined as $||\hat{y}-X\beta ^{*}||_2/||y-X\beta ^{*}||_2$, where $\hat{y} = X\hat{\beta }$, $||\hat{y} - X\beta ^{*}||_2$ is the model error and $||y-X\beta ^{*}||_2$ is the inherent prediction error due to the noise. The results shown in Tables 1 and 2 illustrated that adaptive $L_{1/2}$ is more accurate and sparse than the other three regularizers.

Table 1. Median value of relative model error ($n=100$) (with min/max)

Full size table

Table 2. Average number of zero coefficients ($n=100$) (with standard deviation)

Full size table

4.2 Signal Recovery

In this and the next experiment, we show the application of adaptive $L_p$ regularization in compressed sensing [2, 19, 20]. Consider a real-valued and finite length signal $\text {x}\in R^N$, which is represented by an orthonormal basis $\{\psi _i\}_{i=1}^N$ of $R^N$. Let $\varPsi = [\psi _1, \cdots , \psi _N]$. There exists $s\in R^{N}$ such that $\text {x} = \varPsi s = \sum _{i=1}^N\psi _i s_i$.

Consider $y = \varPhi x + \epsilon $, where $\epsilon $ is a noise term which is either stochastic or deterministic and $\varPhi $ is the “sensing matrix”. It was shown by [19, 21] that reconstruction of $\text {x}$ can be formulated to minimize the following $L_0$ problem $\min _{\text {x}\in R^N}\sum _{i=1}^N I_{x_i\ne 0}\text { such that }||y - \varPhi x||_2\le \delta $, where the parameter $\delta $ is adjustable so that the true signal $\text {x}$ can be feasible. According to the work of [2], if $\text {x}$ is sufficiently sparse and $\varPhi $ satisfies the Restricted Isometry Property [23], this $L_0$ problem is equivalent to $\min _{\text {x}\in R^N}\sum _{i=1}^N |x_i|\text { such that }||y - \varPhi x||_2\le \delta $, an $L_1$ regularization. Now we apply the adaptive $L_{1/2}$ regularizer to solve the original problem, that is, we consider the following minimization problem $\min _{\text {x}\in R^N}\sum _{i=1}^N \omega _i|x_i|^{1/2}\text { such that }||y - \varPhi x||_2\le \delta $, to reconstruct the signal.

As a numerical experiment, take $\text {x} = \sin (2\pi f_1t) + \cos (2\pi f_2t) + \epsilon $ with a fix signal length $N=512$, $t\in [0, 0.1]$ with fixed-length step, $f_1=16$ and $f_2=384$. We consider the noise $\epsilon $ follows the standard normal distribution with $\mu =0$ and $\sigma =0.01$. Discrete cosine transform (DCT) is used to obtain the sparse representation of $\text {x}$ (denoted as $\hat{\text {x}}$). Then we set $M = 128$ and sample a random $M\times N$ matrix $\varPhi $ with i.i.d. Gaussian entries. We first apply the $L_{1}$, adaptive $L_1$, $L_{1/2}$ and our adaptive $L_{1/2}$ regularization to recover $\hat{\text {x}}$ and then employ the inverse discrete cosine transform (idct) to obtain the reconstructed $\text {x}$ respectively. We run each estimator for 100 trials.

We compare the recovery performance of both $\hat{\text {x}}$ and $\text {x}$ (denoted as $\hat{\text {x}}_{\text {re}}$ and $\text {x}_{\text {re}}$ respectively). For $\hat{\text {x}}_{\text {re}}$, we measure the performance in terms of “sparseness”, by the ratio of the number of nonzero coefficients in $\hat{\text {x}}_{\text {re}}$ to signal length. and for $\text {x}_{\text {re}}$ we consider the relative error $||\text {x}_{\text {re}} - \text {x}||_2 / ||\text {x}||_2$. Tables 3 and 4 shows us that though in terms of accuracy (relative error) our adaptive $L_{1/2}$ performs slightly worse than adaptive $L_1$ but better than the others and yields the most sparse solution.

Table 3. Averaged sparseness of recovered $\hat{\text {x}}$ (with standard deviation(SD))

Full size table

Table 4. Averaged relative error under 2-norm of recovered $\text {x}$ (with SD)

Full size table

4.3 Shape Reconstruction

We use the example proposed by [24] to reconstruct an image from a set of parallel projections, acquired along different angles. Similar patterns are commonly seen in computed tomography (CT) data. Without prior knowledge on the sample, the number of projections that are required to reconstruct the image is of order O(N) (in pixels). Here, we consider the case of the sparse image with the objects that are basic shapes where only the boundary of objects have non-zero value. These images are artificially generated but still correspond to real-life applications including monitoring cellular material.

The sparse image we use here is of size $128\times 128$ and we added Gaussian noise with standard variance $\sigma =0.2$ (shown in Fig. 1(a)). In reconstruction, we stretch it into a $128\times 128=16384$ dimensional vector. The reconstruction results using the Lasso and our adaptive $L_{1/2}$ with N / 7 pixels and N / 10 sampled are shown in Fig. 1(a).

Both estimators recovered the original image with highly visible accuracy with N / 7 pixels sampled, while the adaptive $L_{1/2}$ regularizer has a better performance numerically which is demonstrated in Table 5. But when the sampling ratio drops to N / 10, the Lasso starts to fail (notice that the shapes break down), while our adaptive $L_{1/2}$ still gives reconstruction result with high accuracy (shown in Fig. 1(b)).

We use the Structural Similarity Image Metric (SSIM) [25, 26] to compare the reconstruction performance among the four estimators. The SSIM index can be viewed as a quality measure of one of the images being compared, provided the other image is regarded as of perfect quality and improve consistence with human visual perception, in comparison to the traditional indices such as peak signal-to-noise ratio (PSNR) and mean squared error (MSE) [27].

Table 5 shows that the adaptive $L_{1/2}$ regularizer has the best performance of reconstruction in both cases that the sampling ratio is N / 7 or N / 10.

Table 5. Performance comparison of reconstruction results measured by SSIM among the Lasso, adaptive $L_1$, $L_{1/2}$, adaptive $L_{1/2}$.

Full size table

5 Concluding Remarks

We have conducted a study of a specific framework of the adaptive $L_p\text { }(0<p<1)$ regularization, towards better performance for the estimation of sparsity problems. We have shown that the adaptive $L_p$ regularized estimators possess the oracle property when $l_n\gg n$. We also proposed a fast and efficient algorithm to solve the adaptive $L_p$ regularization problem. Our results offer new insights into the $L_p$ ($0<p<1$) related methods and reveals its potential application in diverse fields of compressed sensing.

References

Fan, J.Q., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)
Article MATH MathSciNet Google Scholar
Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52, 1289–1306 (2006)
Article MATH MathSciNet Google Scholar
Candes, E.J., Romberg, J., Tao, T.: Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure. Appl. Math. 59, 1207–1223 (2006)
Article MATH MathSciNet Google Scholar
Donoho, D.L.: Neighbourly polytypes and the sparse solution of under-determined systems of linear equations. IEEE Trans. Inf. Theory (2005, to appear)
Google Scholar
Donoho, D.L.: High-dimensional centrally symmetric polytypes with neighbour proportional to dimension. Discrete Comput. Geom. 35, 617–652 (2006)
Article MATH MathSciNet Google Scholar
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006)
Article MATH MathSciNet Google Scholar
Knight, K., Fu, W.: Asymptotics for Lasso-type estimators. Ann. Stat. 28, 1356–1378 (2000)
Article MATH MathSciNet Google Scholar
Huang, J., Horowitz, J., Ma, S.: Asymptotic properties for bridge estimators in sparse high-dimensional regression models. Ann. Stat. 36, 587–613 (2008)
Article MATH MathSciNet Google Scholar
He, X.N., Lu, W.L., Chen, T.P.: A note on adaptive $L_p$ regularization. In: The 2012 International Joint Conference on Neural Networks (2012)
Google Scholar
Meinshausen, N., Yu, B.: Lasso-type recovery of sparse representations for high-dimensional data. Ann. Stat. 37, 246–270 (2009)
Article MATH MathSciNet Google Scholar
Huang, J., Ma, S., Zhang, C.: Adaptive lasso for sparse high dimensional regression models. Stat. Sinica 18, 1603–1618 (2008)
MATH MathSciNet Google Scholar
Xu, Z.B., et al.: $L_{1/2}$ regularization. SCIENCE CHINA-Inf. Sci. 53, 1159–1169 (2010)
Article MathSciNet Google Scholar
Xu, Z.B., et al.: $L_{1/2}$ regularization: a thresholding representation theory and a fast solver. IEEE Trans. Neural Netw. Learn. Syst. 23, 1013–1027 (2012)
Article Google Scholar
Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes, pp. 16–28. Springer, New York (1996)
MATH Google Scholar
Marial, J.: SPArse Modeling Software: an optimization toolbox for solving various sparse estimation problems. http://spams-devel.gforge.inria.fr/
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Series B (Methodological) 58, 267–288 (1996)
MATH MathSciNet Google Scholar
Peter, C., Grace, W.: Smoothing noisy data with spline functions. Numer. Math. 31, 377–403 (1978)
Article MATH MathSciNet Google Scholar
Chen, S., Donoho, D.L., Saunders, M.: Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20, 33–61 (1998)
Article MATH MathSciNet Google Scholar
Candes, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52, 489–509 (2006)
Article MATH MathSciNet Google Scholar
Candes, E.J., Tao, T.: Near-optimal signal recovery from random projections: universal encoding strategies? IEEE Trans. Inf. Theory 52, 5406–5425 (2006)
Article MATH MathSciNet Google Scholar
Candes, E.J.: The restricted isometry property and its application for compressed sensing. C.R. Math. 346, 589–592 (2006)
Article MATH Google Scholar
Candes, E.J., Wakin, M.B., Boyd, S.P.: Enhancing sparsity by reweighted $\ell $1 minimization. J. Fourier Anal. Appl. 14, 877–905 (2008)
Article MATH MathSciNet Google Scholar
Candes, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory 51, 4203–4215 (2005)
Article MATH MathSciNet Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MATH MathSciNet Google Scholar
Wang, Z., et al.: Image quality assesment: From error visibility to structral similarity. IEEE Trans. Image Process. 13, 600–612 (2004)
Article Google Scholar
Wang, Z., Simoncelli, E.P.: Multiscale structural similarity for image quality assessment. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, pp. 1398–1402 (2004)
Google Scholar
Wang, Z., Bovik, A.C.: Mean squared error: love it or leave it? - A new look at signal fidelity measures. IEEE Signal Process. Mag. 26, 98–117 (2009)
Article Google Scholar

Download references

Acknowledgments

This work is jointly supported by Natural Science Foundation of China (NSFC) under Grant No. 61673119 and the Shanghai Committee of Science and Technology, China under Grant No. 14DZ1118700.

Author information

Authors and Affiliations

School of Mathematical Science, Fudan University, Shanghai, China
Yunxiao Shi, Xiangnan He, Han Wu & Wenlian Lu
SAIC Motor Corporation Limited, No. 489, Wei Hai Road, Shanghai, China
Zhong-Xiao Jin

Authors

Yunxiao Shi
View author publications
You can also search for this author in PubMed Google Scholar
Xiangnan He
View author publications
You can also search for this author in PubMed Google Scholar
Han Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhong-Xiao Jin
View author publications
You can also search for this author in PubMed Google Scholar
Wenlian Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunxiao Shi .

Editor information

Editors and Affiliations

Guangdong University of Technology, Guangzhou, China
Derong Liu
Guangdong University of Technology, Guangzhou, China
Shengli Xie
South China University of Technology, Guangzhou, China
Yuanqing Li
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Dongbin Zhao
King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
El-Sayed M. El-Alfy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shi, Y., He, X., Wu, H., Jin, ZX., Lu, W. (2017). Adaptive $L_p$ $(0<p<1)$ Regularization: Oracle Property and Applications. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science(), vol 10634. Springer, Cham. https://doi.org/10.1007/978-3-319-70087-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-70087-8_2
Published: 24 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70086-1
Online ISBN: 978-3-319-70087-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Adaptive \(L_p\) \((0<p<1)\) Regularization: Oracle Property and Applications

Abstract

Similar content being viewed by others

Correction for Optimisation Bias in Structured Sparse High-Dimensional Variable Selection

$$\ell _1$$ Regularized Robust and Sparse Linear Modeling Using Discrete Optimization

Sparse Learning for Large-Scale and High-Dimensional Data: A Randomized Convex-Concave Optimization Approach

Keywords

1 Introduction