Abstract
We study the nonasymptotic properties of a general norm penalized estimator, which include Lasso, weighted Lasso, and group Lasso as special cases, for sparse high-dimensional misspecified Cox models with time-dependent covariates. Under suitable conditions on the true regression coefficients and random covariates, we provide oracle inequalities for prediction and estimation error based on the group sparsity of the true coefficient vector. The nonasymptotic oracle inequalities show that the penalized estimator has good sparse approximation of the true model and enables to select a few meaningful structure variables among the set of features.
Similar content being viewed by others
1 Introduction
In recent years, high-throughput and nonparametric complex data have been frequently collected in gene-biology, signal processing, neuroscience, and other scientific fields. With massive data in regression problem, we encounter the situation that both the number of covariates p and the sample size n are increasing, and p is a function of n, i.e., \(p=:p(n)\). The curse of dimensionality with computational complexity forces us to make the variable selection since the true regression coefficients \(\beta^{*}\) often are sparse with few nonzero components. Thus only a subset of the variable is preferable as important feature. The sparse set of nonzero coordinates in \(\beta^{*}\) also aims to choose the best model. A popular approach is to penalize the log-likelihood by adding a penalty function, which will intuitively lead to choosing a sparse model. One popular proposed method is Lasso (least absolute shrinkage and selection operator), which was introduced in Tibshirani [23] as a modification of the least square method in linear models. With the development of data science, high-dimensional statistics, including various regularization methods (such as group Lasso, weighted Lasso) have been sprung up by statisticians’ efforts for over two decades.
Ever since the methodology of Lasso linear models, studying various penalty functions (from data independent to data-driven penalty) and loss functions (from smooth to non-smooth, from Lipschitz to non-Lipschitz) remains hot in high-dimensional statistics, even though Lasso regularization has been thoroughly analyzed. However, arising in much practical application, predictors may have group structures. Yuan and Lin [29] study the problem of selecting grouped variables for accurate prediction in linear regressions, and their proposed group Lasso is an extension of Lasso for the purpose of the accuracy of estimation. When considering the variable selection in Cox models, massive data sets bring researchers unprecedented computational challenges, see Tibshirani [24]. Fan and Li [9] study the SCAD penalized partial likelihood approach for the Cox models, and the proposed estimator enjoys the oracle property if a proper regularization parameter is chosen. Zhang and Lu [34] consider different penalties for different coefficients (the adaptive Lasso), and their idea is that “unimportant variables receive larger penalties than important ones so that important variables tend to be retained in the selection process, whereas unimportant variables are more likely to be dropped”. Theoretical properties, including consistency and rate of convergence of this estimator called adaptive Lasso, are also shown by Zhang and Lu [34] when the number of covariates is fixed.
A potential characterization, which appeared in large-scale gene data associated with survival time, is that we only have a few (maybe several) significant predictors among p (maybe thousands) covariates and \(p \gg n\) apparently. For example, the survival of patients with diffuse large-B-cell lymphoma(DLBCL) after chemotherapy is affected by molecular features of the tumors, which is measured by high-dimensional microarray gene expression. Rosenwald et al. [20] adopt Cox models to identify individual genes whose expression correlated with the outcome, and the data contain \(n = 240\) patients and \(p = 7399\) gene expression levels associated with a good or an adverse outcome. The main challenge is that directly utilizing low-dimensional (classical and traditional) statistical inference and computing methods for these data is prohibitive. Fortunately, the regularized partial likelihood method can perform parameter estimation and variable selection to enhance the prediction accuracy and interpretability of the Cox models.
There is the fact that the Lasso estimator is not asymptotically normal, and accurate and limit distribution of Lasso estimate is hard to derive and does not have explicit form, see Knight and Fu [15]. To avoid this trouble, a popular method is to derive the nonasymptotic oracle inequality based on some regularity conditions. Early in 2004, oracle inequalities for prediction error were derived without sparsity or restricted eigenvalue conditions for Lasso-type estimators [see Greenshtein and Ritov [10], Bartlett et al. [3]].
In the classical consistency analysis, the model size p is fixed, and the sample size n goes to infinity. While we need nonasymptotic error bounds in high-dimensional statistical consistency analysis when both model size p and sample size n go to infinity.
Let \({\beta ^{*}}\) be the true regression coefficient obtained from regression data \(\{{ X}_{i}, Y_{i}\} _{i = 1}^{n}\), where \({X}_{i}\) is p-dimensional covariates and \(Y_{i} \in \mathbb{R}\) is the response. A modern problem, which will be the focus of this paper, is the behavior of β̂ when its dimension grows with the number of samples. There are two types of statistical guarantees of a penalized estimate that are of interest in this setting (as mentioned by Bartlett et al. [3]):
-
1.
Prediction error (Persistence): β̂ performs well on future samples
$$ \bigl(\text{i.e., } {\mathrm{{E}}} {\bigl[{X}\bigl( \hat{\beta }- {\beta ^{*}} \bigr)\bigr]^{2}}\text{ (or its empirical version) is small, called persistence}\bigr). $$ -
2.
\(\ell _{1}\)-estimated error: β̂ approximates some “true” parameter \(\beta ^{*}\)
$$ \bigl(\text{i.e., } \bigl\Vert \hat{\beta }- {\beta ^{*}} \bigr\Vert _{1}\text{ is small with high probability}\bigr). $$
The two types of statistical guarantees can be obtained by following error bounds (say oracle inequalities)
where \({\lambda _{n}}\to 0\) is a tuning parameter and \(s:=\|\beta ^{*}\|_{0}\).
Deriving oracle inequalities is a powerful mathematical skill that provides deep insight into the nonasymptotic fluctuation of an estimator compared to the ideal unknown parameter (it is called an oracle). Under linear models with group sparsity covariables, Lounici et al. [18] show oracle inequalities for estimation error (in terms of mixed \((2,p)\)-norm) and prediction error (for fixed design). Blazere et al. [5] study the properties of group Lasso estimator in sparse high-dimensional generalized linear models (GLMs) with group sparsity of the covariates, and the oracle inequalities for the prediction and estimation error. Structured sparsity has recently attracted attention to the high-dimensional data. [36] focus on the oracle inequalities for GLMs with overlapping group structures. Zhou et al. There have been considerable developments in oracle inequalities, not limited to the linear models and GLMs. Lemler [17] introduces a data-driven weighted Lasso to estimate Cox models by approximating the intensity (without using partial likelihood), and oracle inequalities in terms of an appropriate empirical K-L divergence are obtained. By focusing on misspecified Cox models with their partial likelihood, Kong and Nan [16] derive the nonasymptotic oracle inequalities for the weighted Lasso penalized negative log partial likelihood function. Similar results have been proposed for Cox models with time-dependent covariances, see Huang et al. [13] for using martingale analysis of KKT conditions. Honda and Hardle [11] consider group SCAD-type and the adaptive group Lasso estimator to do variable selection for Cox models with varying coefficients, and the \(L_{2}\) convergence rate is obtained for increasing-dimension setting \(p/n \to 0\).
Contributions:
-
The existing work on weighted group Lasso penalized Cox models has little attention on theoretical results. Yan and Huang [28] propose a weighted group Lasso method that selects important time-dependent variables with a group structure. We propose the oracle inequalities for the prediction and estimation error under the random design, which is different to Huang et al. [13] and Kong and Nan [16] (they do not consider the random design and prediction error).
-
Huang et al. [13] do not give a clear definition of the true coefficient, our true coefficient in the oracle inequalities is defined by the minimizer of the expected loss function. It is applicable for misspecified Cox models.
-
We provide unified nonasymptotic results in terms of oracle inequalities for prediction and estimation error, and this provides a theoretical justification for the consistency of weighted group Lasso estimator in Cox models (time-dependent covariates and random design).
The sections are presented as follows. Section 2 gives a brief review of Cox models. Section 3 presents the weighted group Lasso penalty for misspecified Cox models. Section 4 shows the oracle inequalities for prediction and estimation for weighted group Lasso penalized partial likelihood for misspecified Cox models, while detailed proofs are included in Sect. 5.
2 A brief review of Cox models
The celebrated Cox models have provided a tremendously successful tool for exploring the association of covariates with failure time and survival distributions. In order to match the drop-out situation in clinical trails, we consider that the continuous survival time \(T_{i}^{*}\) is governed by random right censoring. For subject i, let \(T_{i}: = {T_{i}}^{*} \wedge {C_{i}}\) be the observed survival time which is right-censored by \({C_{i}}\). And the censored indicator is denoted by \({\Delta _{i}} = 1({T_{i}}^{*} \le {C_{i}})\). Let \(\{z_{i}(t)\}_{i=1}^{n}\) be the p-dimensional time-dependent covariates, where \({z_{i}}(t): = ({z_{i1}}(t), \ldots ,{z_{ip}}(t))^{\tau }\). Here we assume that the censoring is noninformative. The time-dependent covariates may degenerate to time-independent covariates, i.e., \({z_{ik}}(t)\equiv {z_{ik}}\) for some index k. For example, the CD4 count (relate to longitudinal process) is time-dependent. The time-independent covariates are baseline covariates (i.e., internal variables), which includes treatment indicator ages, sex, treatment indicator, and so on.
Suppose that we observe n independent and identically distributed (i.i.d.) data
which is sampling from the random population \((T,\Delta ,{\{ z(t)\} _{0 \le t \le \tau }})\).
Let \(S(t|{\mathcal{Z}}) = P ( {T > t|{\mathcal{Z}}} )\) be the conditional survival function, where \({\mathcal{Z}}\) is the sigma algebra generated by some covariate variables. The relation of conditional distribution function and \(S(t|{\mathcal{Z}})\) is \(F(t|{\mathcal{Z}}) = P ( {T \le t|{\mathcal{Z}}} ) = 1 - S(t|{\mathcal{Z}})\). Denote \(f(t|{\mathcal{Z}}) = \frac{d}{d t} F(t|{\mathcal{Z}})\) as the conditional probability density function. Different from the linear model for modeling conditional mean or the quantile regression for modeling conditional quantiles, the Cox models (also called proportional hazards regression or Cox regressions) aim to model the conditional hazard rate defined by
The \(h (t|{\mathcal{Z}})\) is the conditional hazard rate at time t conditional on survival until time t or later (i.e., \(T \ge t\)). From (2.2), the \(S(t|{\mathcal{Z}})\) can be represented as the exponential integral of the cumulative hazard function defined by \(H(t) = \int _{0}^{t} h(s)\,ds\), i.e., \(S(t|Z) = \exp \{ { - \int _{0}^{t} h(s)\,\mathrm{d}s} \} \equiv {e^{ -H(t)}}\).
Having obtained the covariates \(\{z_{i}(t)\}_{i=1}^{n}\), our aim is to model the conditional hazard function of survival time \(\{T_{i}\}_{i=1}^{n} \) in a finite time interval \([0,\tau ]\) by the following semi-parametric regressions:
where \(h_{0}(t)\) is an unknown baseline hazard function, and \(\beta ^{*} \in \mathbb{R}^{p}\) is an unknown parameter which needs to be estimated.
By profiling our the term \(h_{0}(t)\), Cox [6] suggests that the inference on \(\beta ^{*}\) is based on the random likelihood function
where \(R_{i}= \{ j : T_{j} \geq T_{i} \} \) is the risk set (set of individuals whose survival times are greater than \(T_{i}\)). In a later paper, Cox [7] strictly derives the so-called partial likelihood function.
Suppose that the observed time is a continuous variable, and there is no tie in the observation time. The joint likelihood for the i.i.d. data (2.1) can be written as follows:
which contains the unknown \({h_{0}}(\cdot )\).
The key to deriving (2.4) is by specifying a reasonable estimator \(\hat{h}_{0}(\cdot )\) for \(h_{0}(\cdot )\) in (2.3). Assume that \(h_{0}(\cdot )\) is discrete with mass \({{h_{0}}} ({T_{(1)}}), \ldots ,{{h_{0}}} ({T_{(k)}})\) at the ordered observed survival time \(T_{(1)}< \cdots < T_{(k)}\). Denote \(\{z_{(o)}{({T_{(o)}})}:o=1, \ldots , k\}\) as the k covariates corresponding to the ordered observed survival times \({T_{(o)}}\). The baseline cumulative hazard function \(H_{0}(t)\) is modeled non-parametrically as the step function \({H_{0}}(t) = \sum_{o = 1}^{k} {{h_{0}}} ({T_{(o)}})I({T_{(o)}} \le t)\), and hence \(\sum_{i = 1}^{n}\int _{0}^{{T_{i}}} {{h_{0}}(s){e^{z_{i}^{\tau }(s)\beta }}} \,\mathrm{d}s=\sum_{i = 1}^{n} {\sum_{o = 1}^{k} {{h_{0}}} ({T_{(o)}})I({T_{(o)}} \le T_{i}){e^{z_{i}^{\tau }({T_{(o)}}) \beta }}} \).
From (2.5), the joint log-likelihood function is expressed as follows:
where \({ \{ {j:{T_{j}} \ge {T_{(o)}}} \} }\) denotes the set of individual js who are “at risk” for failure at time \({T_{(o)}}\).
Taking derivative on \(\log {L_{n}}(\beta ;T,z,\Delta )\) with respect to \({h_{0}}({T_{(o)}}), o=1, \ldots , k\), we get
which is also called Breslow’s estimator for the baseline hazard function.
Plugging \({{\hat{h}}_{0}} ( {T_{(o)}} )\) into (2.6), we have
which gives (2.4).
Following the counting process framework in Andersen and Gill [2], let \(N_{i}(t)=1 (T_{i} \leq t, \Delta _{i}=1 )\) be the counting process, and denote \(Y_{i}(t)=: 1 (T_{i} \geq t )\) to be the at-risk process for subject i. The σ-filtration is defined by \({{\mathcal{F}}_{t}} = \sigma \{ {N_{i}}(s),{Y_{i}}(s),{z_{i}}(s),s \le t,i = 1, \ldots ,n\}\), which represents the information that occurs up to time t. Let \(\mathrm{d} N_{i}(s):=1 \{T_{i} \in [s, s+\mathrm{d}s], \Delta _{i}=1 \}\). The negative log-partial-likelihood (2.4) for data (2.1) is rewritten as follows:
where \(R_{n}(u, \beta )=\frac{1}{n}\sum_{j=1}^{n}1 (T_{j} \geq u ) \exp \{ {z_{j}^{\tau }(u)}\beta \} \) is the empirical relative risk function.
The negative log-partial likelihood function (2.7), as the summands are neither independent nor Lipschitz, can be approximated by the following intermediate empirical loss function:
with expected relative risk function defined by \(R(t, \beta )={\mathrm{{E}}}[1(T \geq t) \exp \{ {z^{\tau }(t)}\beta \} ]\).
We define the loss function by \(l(\beta ;T,z,\Delta ): = - [{z^{\tau }}(t)\beta - \log R(t,\beta )] \Delta \).
Let \(\overline{N}(t):=\sum_{i=1}^{n} N_{i}(t)\). The gradient of \({\ell _{n}}(\beta ;T,z,\Delta )\) can be written as
where \({{\bar{z}}_{n}}(u,\beta ) = \frac{1}{n}\sum_{j = 1}^{n} {\frac{{{Y_{j}}(u){{\mathrm{{e}}}^{z_{j}^{\tau }(u){\beta }}}}}{{{R_{n}} ( {u,{\beta }} )}}} {z_{j}}(u)\) is the random weighted sum of covariates.
The \(\nabla \ell _{n}(\beta ;T,z,\Delta )\) is called score process, which is a martingale adapted to the filtration \(\mathcal{F}_{t}\). Furthermore, the Hessian matrix of \(\ell _{n}(\beta ;T,z,\Delta )\) is
where \({V_{n}}(u, \beta ) = \frac{1}{n}\sum_{i = 1}^{n} {\frac{{{Y_{i}}(u){{\mathrm{{e}}}^{z_{i}^{\tau }(u){\beta }}}}}{{{R_{n}} ( {u,{\beta }} )}}} [z_{i}(s)-\overline{z}_{n}(u,\beta )][z_{i}(s)-\overline{z}_{n}(u, \beta )]^{\tau }\) is the random weighted sample covariance matrix. Readers can refer to technical details required to make the counting process rigorous in Andersen et al. [1].
3 Weighted group lasso for misspecified Cox models
In this section, we present the concepts and mathematics notations for the penalized misspecified Cox models with the group structure.
Many high-dimensional variables in microarrays data and other scientific applications have a natural group structure. It is better to divide p variables into small sets of variables based on biological knowledge, see Kanehisa and Goto [14], Wang et al. [27]. Suppose that the p-dimensional covariate X is divided into \(G_{n}\) groups each of size \(d_{g}\) for \(g \in \lbrace 1,\ldots,G_{n} \rbrace \),
where \(X^{g}_{i}=(X_{i,1}^{g},\ldots,X_{i,d_{g}}^{g})^{T}\) and \(\sum_{g=1}^{G_{n}}d_{g}=p\).
It is allowed that the number of groups increases with the sample size n and \(G_{n}\gg n\). We define the two quantities
which are crucial constants in the theoretical analysis.
For \(\beta \in \mathbb{R}^{p}\), let \(\beta ^{g}\) be the sub-vector of β whose indexes correspond to the index set of the gth group of X. Given a proper tuning parameter λ, we are interested in weighted group Lasso estimator which achieves group sparsity. It is obtained as the solution of the convex optimization problem:
where \(\Vert \cdot \Vert _{2}\) refers to the Euclidian norm and \({w_{g}}\) is a given weighted function.
If all \(d_{g}\) are of size one and \({w_{g}}=1\), then \(\sum_{g=1}^{G_{n}}{w_{g}}\| \beta ^{g}\|_{2}\) reduces to \(\| \beta \|_{1}\) which is essentially a Lasso problem; If all \(d_{g}\) are of size one and \(\{ {w_{j}}\} _{j = 1}^{p}\) are data-dependent weights (the weights only depend on observed data). Let \(W = \operatorname{diag}\{ {w_{1}}, \ldots ,{w_{p}}\} \), thus the weighted group Lasso penalty \(\sum_{g=1}^{G_{n}}{d_{g}}\| \beta ^{g}\|_{2}\) becomes weighted Lasso penalty \(\| {W\beta } \|_{1}\). Increasing λ leads to the shrinkage of \(\beta ^{g}\) tending to zero, which indicates that some blocks of β diminish to zero simultaneously, and groups of predictors are eliminated from the model. Typically in the reference, they usually choose \({w_{g}}:=\sqrt{ d_{g}}\) to penalize more heavily groups of large size. For adaptive group Lasso in Cox models, Yan and Huang [28] use \({w_{g}}=\sqrt{d_{g}} /\|\tilde{\beta }^{g}\|\), where \(d_{g}\) is the size of group g and \(\tilde{\beta }^{g}\) is some consistent estimator of \(\beta ^{g}\).
Taking the subdifferential of the objective function (3.1), we get the first order condition:
(It is also called Karush–Kuhn–Tucker (KKT) condition, see Sect. 2.2 of Huang et al. [13] for un-group version.) From the adaptive estimation point of view, the weights in equation (3.1) can be determined from the observed data, where KKT conditions (3.2) hold with high probability, for example, \(1-p^{r}, r<0\). Applying the concentration inequalities to martingale, the data-driven weights \(\{ {w_{j}}\} _{j = 1}^{p}\) are obtained from the KKT conditions with high probability, see Huang et al. [12] and the references therein. The motif of this work is to derive nonasymptotic oracle inequalities in a mathematical view. The choice of optimal adaptive weight and statistical inferences (confidence interval, testing the coefficient, FDR control) is left for future studies.
In the high-dimensional settings, we study the estimation and prediction of the oracle inequalities for the weighted group Lasso even when the number of groups is extremely greater than the sample size, i.e., \(G_{n}\gg n\). Define \(H^{*}= \lbrace g: \beta _{g}^{*}\neq 0 \rbrace \) as the group index set corresponding to the nonzero sub-vectors of \(\beta ^{*}\).
Let \(X_{1}, \ldots , X_{n}\) be a random sample from a measure \(\mathbb{P}\) on a measurable space \((\mathcal{X},\mathcal{A})\). We denote the empirical distribution as a discrete uniform measure \(\mathbb{P}_{n}=n^{-1} \sum_{i=1}^{n} \delta _{X_{i}}\), where \(\delta _{x}\) is the probability distribution that is degenerate at x.
The expected loss function is defined by
Corresponding to the form of estimator, the true parameter of the misspecified Cox models is the minimizer of the expected loss function
where \(R(t, \beta )={\mathrm{{E}}}[1(T \geq t) \exp \{ {z^{\tau }(t)}\beta \} ]\).
Definition (3.3) was pioneeringly studied in Struthers and Kalbfleisch [21] by clarifying the true parameter as a solution of estimating equation neatly mentioned in the proof of Lemma 3.1 in Andersen and Gill [2].
Here, the expectation of the random variables in the model is unknown, thus as well as \({\beta ^{*}}\). By solving the optimization problem in (3.3), \({\beta ^{*}}\) satisfies
In order to get the unique solution in (3.4), we require that the Hessian matrix for expected loss function
is nonpositive definite.
We aim to estimate sparse \(\beta ^{*}\) and to predict the hazard function \(h ( t|{z_{i}}(t) )\) conditionally on a given process \({z_{i}}(t)\). To facilitate the technical proof, additional assumptions are required.
-
(H.1): The covariates \(\{z_{i j}(t)\}\) are almost surely bounded by a positive constant L, i.e.,
$$ \sup_{0 \le t \le \tau } \max_{1 \le i \le n,1 \le j \le p } \bigl\vert z_{i j}(t) \bigr\vert \le L,\quad \mbox{a.s.} $$ -
(H.2): Assume that the parameter space is compact, \(\|\beta ^{*} \|_{1} \le B\), where B is a positive constant.
-
(H.3): There exists a large constant M such that β̂ is in the weighted \(\ell _{2}\)-ball
$$ {{\mathcal{S}}_{M}}\bigl(\beta ^{*}\bigr):= \Biggl\{ { \beta \in {\mathbb{R}^{p}}:{\sum_{g = 1}^{G_{n}} {{w_{g}} \bigl\Vert {\beta ^{g}}- \beta ^{*g} \bigr\Vert _{2}}} \le {M}} \Biggr\} . $$ -
(H.4): Under \(\Delta =1\), there exists a constant \({c_{l}}> 0\) and \({c_{u}}<\infty \) such that \(\ddot{l}(\beta ;t,z,\Delta )\) is uniformly positive definite for all \({\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})}\)
$$ {c_{u}} {z(t)} {z^{{\tau }}(t)} \succ {\mathrm{{E}}}\bigl[ \ddot{l}(\beta ;T,z, \Delta )|z(t)\bigr]\succ {c_{l}} {z(t)} {z^{{\tau }}(t)}\quad \mbox{a.s.} $$
(H.1) and (H.2) are standard assumptions in deriving consistency property for regularized GLMs, see Blazere et al. [5], Zhang and Wu [33]. (H.2) is also used in Zhao et al. [35] for the increasing dimensional Cox models with interval-censored data. (H.3) has been addressed by Kong and Nan [16]. (H.4) makes sure the object function for a minimizer of population expected loss is strongly convex, a similar assumption is used in Andersen and Gill [2], Fan and Li [9].
As mentioned by one reviewer, we often assume that the data are generated from the model with some baseline hazard function and some true parameter \(\beta ^{*}\). In (3.3), the true parameter is defined as the minimizer of true loss function. We present it in detail from Theorem 1 in Struthers and Kalbfleisch [21].
Lemma 3.1
(Consistency)
Let the expectation E be taken with respect to randomness of \(\{ (T_{i}, \Delta _{i}, z_{i}(t) )\}_{i=1}^{n}\) from the true model. Consider the following notations for \(r=0,1,2\):
where, for a column vector a, \(a^{\otimes 2}\) refers to the matrix \(a a^{T}\), \(a^{\otimes 1}\) refers to the vector a, and \(a^{\otimes 0}\) refers to the scalar 1. Consider the following conditions.
Condition 3.1
There exists a neighborhood \({\mathcal{S}}_{M}(\beta ^{*})\) of \(\beta ^{\ast }\) such that, for each \(t<\infty \),
Condition 3.2
(a). The \(s^{ ( 0 ) } ( \beta ,x ) \) is bounded away from zero on \({\mathcal{S}}_{M}(\beta ^{*})\times [ 0,t ] \), and \(s^{ ( 0 ) } ( \beta ,x ) \) and \(s^{ ( 1 ) } ( \beta ,x ) \) are bounded on \({\mathcal{S}}_{M}(\beta ^{*})\times [ 0,t ] \); (b). For each \(t<\infty \), we have \(\int _{0}^{t}s^{ ( 2 ) } ( x ) \,dx< \infty \).
-
When the data are generated from the correctly specified Cox models (2.3), under Conditions 3.1 and 3.2, we have that the maximum partial likelihood estimator β̂ is a consistent estimator for \(\beta ^{\ast }\), where \(\beta ^{\ast }\) is the solution to the equation \(h ( \beta ) =0\) with
$$ h ( \beta ) := \int _{0}^{\infty }s^{ ( 1 ) } ( t ) \,dt- \int _{0}^{\infty } \frac{s^{ ( 1 ) } ( \beta ,t ) }{s^{ ( 0 ) } ( \beta ,t ) }s^{ ( 0 ) } ( t ) \,dt. $$ -
When the model is misspecified, i.e., suppose that the true hazard function is \(h_{i} ( t )\ne {h_{0}}(t)e^{ z_{i}^{\tau }(t){\beta ^{*}}}\). If \(S^{(r)}(t)\) and \(s^{(r)}(t)\) are replaced by \(S_{m}^{(r)}(t):=n^{-1} \sum_{i=1}^{n} Y_{i}(t)h_{i} ( t ) z_{i}(t)^{\otimes r}\) and \(s_{m}^{(r)}(t):={\mathrm{E }}[S_{m}^{(r)}(t)]\) in Conditions 3.1 and 3.2, then the solution of the equation \(h_{m} ( \beta ) =0\) with
$$ h_{m} ( \beta ) := \int _{0}^{\infty }s_{m}^{ ( 1 ) } ( t ) \,dt- \int _{0}^{\infty } \frac{s^{ ( 1 ) } ( \beta ,t ) }{s^{ ( 0 ) } ( \beta ,t ) }s_{m}^{ ( 0 ) } ( t ) \,dt $$(3.6)is the pseudo-true parameter \(\beta ^{\ast }\).
Since \(d M_{i}(t):=d {N_{i}}(t)-{1 ( {{T_{i}} \ge t} )} {h_{0}}(t)e^{{z_{i}^{\tau }(t)}\beta ^{*}}\,dt\) is mean-zero \({{\mathcal{F}}_{t}}\)-martingale from the theory in Andersen and Gill [2], by comparing the empirical version (2.9) and the population version (3.4) with the limits \(s^{ ( 0 ) } ( t )\), \(s^{ ( 1 ) } ( t )\) and \({s^{ ( 0 ) } ( \beta ,t ) }\), \({s^{ ( 1 ) } ( \beta ,t ) }\), we can see that (3.4) coincides with (3.6). Moreover, our assumptions (H.1)–(H.5) verify Conditions 3.1 and 3.2 without confliction if the uniform law of large numbers is applied by using the compactness of \(\beta ^{\ast }\) and the boundedness of covariates.
4 Oracle inequalities for estimation and prediction
As a powerful mathematical skill, oracle inequalities provide deep insight into the nonasymptotic fluctuation of an estimator compared to the unknown true parameter. A comprehensive theory of oracle inequalities in high-dimensional regressions has been developed for Lasso and its generalization, see Chap. 7 of Wainwright [26].
4.1 Key of nonasymptotic analysis
In this section, nonasymptotic oracle inequalities for weighted group Lasso estimates of Cox models are sought, as well as assumptions of the required restricted eigenvalue (such as group stabil condition). The proof leans on several steps:
-
Step1: To avoid ill behavior of Hessian, propose the restricted eigenvalue condition or other analogous conditions about the design matrix.
-
Step2: Find the tuning parameter based on high-probability event (KKT conditions or other KKT-like conditions).
-
Step3: According to some restricted eigenvalue assumptions and tuning parameter selection, derive the oracle inequalities via the definition of weighted group Lasso optimality and the minimizer under unknown expected risk function and some basic inequalities. There are three sub-steps:
-
(i) Under the KKT-like conditions, show that the error vector \(\hat{\beta }- \beta ^{*}\) is in some restricted set with structure sparsity, and moreover check that \(\hat{\beta }- \beta ^{*}\) is in a big compact set;
-
(ii) Show that likelihood-based divergence of β̂ and \(\beta ^{*}\) can be lower bounded by some quadratic distance between β̂ and \(\beta ^{*}\);
-
(iii) By some elementary inequalities and (ii), show that \({\sum_{g = 1}^{G_{n}} {{w_{g}}\| {\hat{\beta }_{n}^{g}}- \beta ^{*g}\|_{2}}}\) is in a smaller compact set with radius of optimal rate (proportional to λ).
-
As mentioned by one reviewer, our general framework of the proof is quite standard, but consecutive steps of defining some high-probability events rely on nontrivial new results. For simplicity, we introduce and use the notation in empirical processes, see van der Vaart and Wellner [25].
Let \(X_{1}, \ldots , X_{n}\) be a random sample from a measure \(\mathbb{P}\) on a measurable space \((\mathcal{X},\mathcal{A})\). We denote the empirical distribution as a discrete uniform measure \(\mathbb{P}_{n}=n^{-1} \sum_{i=1}^{n} \delta _{X_{i}}\), where \(\delta _{x}\) is the probability distribution that degenerates at x.
Given a measurable function \(f : \mathcal{X} \mapsto \mathbb{R}\), we write \(\mathbb{P}_{n} f\) for the expectation of f under the empirical measure \(\mathbb{P}_{n}\), and Pf for the expectation under P. Thus
The \(\mathbb{P}_{n} f\) is called empirical processes index by n. In fact, we treat \(\mathbb{P}_{n}\) and P as operators rather than the measure.
It follows from (2.8) and \(\mathbb{P}_{n} l(\beta ;T,z,\Delta ):=\tilde{\ell }_{n}(\beta ;T,z, \Delta )\) that
4.2 Define some events with high probability
Using the definition of \(\hat{\beta }_{n}\) in (3.3), we have
Hence we get
Then, by (4.1), the first and second terms in the right-hand side of (4.3) are
It implies
where
To obtain oracle inequalities for the weighed group Lasso applied to misspecified Cox models, it is necessary to study the rate of convergence of the empirical process \(( \mathbb{P}_{n}-\mathbb{P})( l(\beta ^{*};T,z,\Delta )-l( \hat{\beta }_{n};T,z,\Delta ))\) and \({D_{n}}(\hat{\beta },{\beta ^{*}})\). The centralized empirical loss \(( \mathbb{P}_{n}-\mathbb{P})( l(\beta ^{*};T,z,\Delta )-l( \hat{\beta }_{n};T,z,\Delta ) )\) and the normalized error \({D_{n}}(\hat{\beta },{\beta ^{*}})\) represent the fluctuation between the expected loss and sample loss. It will be shown that
have stochastic Lipschitz properties with respect to \({\sum_{g = 1}^{G_{n}} {{w_{g}}\|{\hat{\beta }_{n}^{g}}-\beta ^{*g} \|_{2}}}\).
The concentration inequalities are essential tools to obtain an upper bound of (4.4), which is proportional to a regularization parameter that ensures good statistical properties of the regularized estimator with high probability.
Define \(F(s,z)\) as the joint distribution of \((T_{i},z_{i}^{\tau }(t))\). Let \(\tilde{\beta }: = {({{\tilde{\beta }}_{1}}, \ldots ,{{\tilde{\beta }}_{p}})^{T}}\) with the components \(\{ {{\tilde{\beta }}_{j}}\} _{j = 1}^{p}\) between \(\{ {{\hat{\beta }}_{j}}\} _{j = 1}^{p}\) and \(\{ \beta _{j}^{*}\} _{j = 1}^{p}\), respectively, via first-order Taylor’s expansions of the function
with derivative
Plugging \(t=T_{i}\), we have componentwise Taylor’s expansion
Considering the first term in (4.4), we have
To get the stochastic Lipschitz properties, we define the following two events:
The random sum in event \({{\mathcal{A}}_{2}}\) is not independent, which renders this problem more challenging. We need to check a uniform version of the event \({{\mathcal{A}}_{2}}\) in terms of β. Concentration inequalities for suprema empirical processes are powerful to check that event \({{\mathcal{A}}_{2}}\) holds with high probability. It will be derived from Talagrand’s sharper bounds for suprema empirical processes, which is a generalization of Dvoretzky–Kiefer–Wolfowitz inequality, see Talagrand [22]. Like an index function class for the empirical distribution function, boundedness assumption (H.1) on the components of \(z(t)\) guarantees the conditions for concentrations of suprema empirical processes.
Next, an upper bound is obtained for the centralized empirical process \(( \mathbb{P}_{n}-\mathbb{P}) [{l}(\beta ^{*}; T,z,\Delta )-l( \hat{\beta }_{n};T,z,\Delta )]\).
Proposition 4.1
Assume that (H.1)–(H.3) are true. On the event \(\mathcal{A}={{\mathcal{A}}_{1}}\cap {{\mathcal{A}}_{2}}\), we have \(P(\mathcal{A})\ge 1-2d_{\mathrm{max}}(2G_{n})^{1-A^{2}}\). Moreover, the upper bound (4.6) holds with the probability as least \(1-2d_{\mathrm{max}}(2G_{n})^{1-A^{2}}\),
where \(\lambda _{a}:=\lambda _{a1}+\lambda _{a2}\) with
This proposition states that the difference between the centralized empirical processes is bounded from above by the tuning parameter multiplied by the weighted group Lasso norm of the difference between the estimated parameter and the true parameter \(\beta ^{*}\).
For the normalized error \({D_{n}}(\beta ,{\beta ^{*}})\), set
where \({D_{n}}(\beta ,{\beta ^{*}}):=\frac{1}{n} [\sum_{i = 1}^{n} {\{ {\log \frac{{{R_{n}}({T_{i}},{\beta ^{*}})}}{{R({T_{i}},{\beta ^{*}})}}} \}} -\sum_{i = 1}^{n} {\{ {\log \frac{{{R_{n}}({T_{i}},{{\beta }})}}{{R({T_{i}},{{\beta }})}}}\}} ]{\Delta _{i}}\) and \({\lambda _{a2}}\) is a suitable tuning parameter.
Observe that
for certain random variable \({t_{s}}\) on a compact set \([0,\tau ]\).
By the first order Taylor’s expansion of the function \(g_{t_{s}}(\beta ): = \log ( \frac{1}{n}\sum_{i = 1}^{n} {\frac{{{1} ( {{T_{i}} \ge {t_{s}}} ){\mathrm{e} ^{z_{i}^{\tau }({t_{s}})\beta }}}}{{R({t_{s}},\beta )}}} )\), let the corresponding mean value \(\tilde{\beta }= {({{\tilde{\beta }}_{1}}, \ldots ,{{\tilde{\beta }}_{p}})^{T}}\) be between \(\beta _{j}^{*}\) and \(\beta _{j}\) for each \(j=1,2,\ldots ,p\). We have
From the following decomposition and inequality
which implies that
where the last inequality is from
by using assumptions (H.1)–(H.2).
If we have \(\hat{\beta }\in {{\mathcal{S}}_{M}}(\beta ^{*})\) for some finite M, thus \(\tilde{\beta }\in {{\mathcal{S}}_{M}}(\beta ^{*})\) by
Note that summation (4.10) contains a common random variable \({t_{s}}\) which renders (4.10) to be a dependent summation. In order to bound the quotient and the two centralized summations, we denote three events by \({{\mathcal{B}}_{0}}\), \({{\mathcal{B}}_{1}}\), \({{\mathcal{B}}_{2}}\), respectively:
and
To solve the problem, we need the concentration inequalities for the suprema of the empirical processes in \(\{ {\mathcal{B}}_{l}\} _{l = 0}^{2}\) uniformly in \(t\in [0,\tau ]\) and \(\beta \in {{\mathcal{S}}_{M}}(\beta ^{*})\), see Sect. 2.14 of van der Vaart and Wellner [25].
Let \({\mathcal{B}} = {{\mathcal{B}}_{0}} \cap {{\mathcal{B}}_{1}} \cap {{\mathcal{B}}_{2}}\). We aim to show that each event in \(\{{\mathcal{B}}_{l}\}_{l=0}^{2}\) holds with high probability. Thus \({\mathcal{B}}\) is also a high probability event via utilizing the basic inequality \(P({\mathcal{B}})\ge P({{\mathcal{B}}_{0}}) + P({{\mathcal{B}}_{1}}) + P({{\mathcal{B}}_{2}}) - 2\).
Based on (4.10), we obtain the following local stochastic Lipschitz condition under the event \({\mathcal{B}}\):
where \({\lambda _{b}}\) can be viewed as the local stochastic Lipschitz constant.
The following proposition is a similar but significant improvement of Corollary 2 in Kong and Nan [16] from the Lasso to the group Lasso case and from the fixed design to the random design.
Proposition 4.2
Let \(p_{\tau }:={{P({T_{1}} \ge \tau )}}>0\), and \({D^{2}}(\sqrt {2})\) be a universal constant. Under (H.1)–(H.3) and some constant \(A^{2}>2\), we have \(P (\mathcal{B} ) \ge 1-2{e^{ - np_{\tau }^{2}/2}}- \frac{{{d_{{\mathrm{{max}}}}}{D^{2}}(\sqrt {2}){A^{2}}\log ({G_{n}})}}{{4n}} {G_{n}^{2 - {A^{2}}}}-\frac{{{D^{2}}(\sqrt {2}){A^{2}}\log p}}{{4n}}{p^{ - {A^{2}}}}\) with
Moreover, let \(\lambda _{b}:=\lambda _{b1}+\lambda _{b2}\), we have
with probability at least \(1-2{e^{ - np_{\tau }^{2}/2}}- \frac{{{d_{{\mathrm{{max}}}}}{D^{2}}(\sqrt {2}){A^{2}}\log }}{{4n}}{G_{n}^{2 - {A^{2}}}}- \frac{{{D^{2}}(\sqrt {2}){A^{2}}\log p}}{{4n}}{p^{ - {A^{2}}}}\).
If the true model is sparse and \(\log p =o(n)\), then the two propositions above illustrate that \({P}( \mathcal{A}),{P}( \mathcal{B}) \to 1\) as \(p,n \to \infty \).
4.3 Sharp oracle inequalities from restricted eigenvalue conditions
In this section, we give sharp bounds for estimation and prediction errors for Cox models using a weaker condition similar to the restricted eigenvalue condition of Bickel et al. [4].
Consider linear models \(\{{\mathrm{{E}}} [Y_{i}|X_{i}]=X_{i}^{\tau }(t){\beta ^{*}}\}_{i=1}^{n}\) with random covariate vectors \(\{{X}_{i}\}_{i=1}^{n}\). The key condition to derive oracle inequalities rests on the correlation between the covariates, i.e., on the behavior of the sample covariance matrix \({ \Sigma }_{n}=\frac{1}{n} \sum_{i = 1}^{n} {{{ {X}}_{i}}{{X}_{i}^{T}}}\), which is necessarily singular when \(p>n\). Let S be any subset of \(\{1,2,\ldots ,p\}\). The restricted eigenvalue condition (RE in short) of \(p \times p\) matrix \({ \Sigma }_{n}\) is defined by
where \({\mathrm{{C}}}(\eta ,S)=\{ {b} \in {\mathbb{R}^{p}}:{\| {{{b}_{{S^{c}}}}} \|_{1}} \le \eta {\| {{{b}_{S}}} \|_{1}}\}\), \(\eta >0\).
It should be noted that if we omit the sparse restricted set \({\mathrm{{C}}}(\eta ,S)\), (4.13) leads to \(\frac{{{{{{b}^{T}}{ \Sigma }_{n} {b}}}}}{ \Vert {b} \Vert _{2}^{2}}>RE^{2}( \eta ,S,{\Sigma }_{n})\). Thus it means that the smallest eigenvalue of the sample covariance matrix \({\boldsymbol{\Sigma }}_{n}\) is positive, which is impossible when \(p>n\) (\({\boldsymbol{\Sigma }}_{n}\) is not full rank). To avoid the low rank of \({\boldsymbol{\Sigma }}_{n}\), Bickel et al. [4] consider the restricted eigenvalue condition under the sparse restricted set \({\mathrm{{C}}}(\eta ,S)\) as considerable relation in the sparse high-dimensional estimation. The restricted eigenvalue is from the restricted strong convexity, which enforces a type of strong convexity condition for the negative log-likelihood function of linear models under certain sparse restrict set.
A shortcoming for (4.13) is that we cannot assume that \(RE(\eta ,S,{\Sigma }_{n})>0\) happens with high probability 1. Instead, we replace \({\Sigma }_{n}\) by a non-random version: \({ \Sigma }={\mathrm{{E}}}{\Sigma }_{n}\). Observe that \(\frac{{{{{{b}^{T}}{ \Sigma }_{n} {b}}}}}{ \Vert {b}_{S} \Vert _{2}^{2}} \ge \frac{{{{{{b}^{T}}{ \Sigma }_{n} {b}}}}}{ \Vert {b} \Vert _{2}^{2}}>0\) if (4.13) holds. So \({{{{{{b}^{T}}{ \Sigma }_{n} {b}}}}}\ge k { \Vert {b}_{S} \Vert _{2}^{2}}>k { \Vert {b}_{S} \Vert _{2}^{2}}-\varepsilon \) for a constant \(k>0\) and a relax constant \(\varepsilon >0\). Technically, for group penalty, here we use a condition which is a modified version of the restricted eigenvalue conditions presented in Blazere et al. [5] for generalized linear models. Define by \(H^{*}= \{ g: \beta ^{* g} \neq 0 \} \) the index set of the groups and \(\gamma ^{*}:= \vert H^{*} \vert \)
Definition
(Group stabil condition)
Let \(c_{0},\varepsilon ,k >0\) be given constants. Let Σ be the \(p \times p\) non-random matrix, which satisfies the group stabil condition \(GS(c_{0},\varepsilon ,k,H^{*})\) if there exists \(k>0\) such that
where the restricted set is defined as \(S(c_{0},H^{*}):=\{ \delta : \sum_{g \in {H^{*}}^{c} }{w_{g}}\Vert \delta ^{g}\Vert _{2}\le c_{0}\sum_{g \in H^{*} }{w_{g}}\Vert \delta ^{g}\Vert _{2} \} \).
\(S(c_{0},H^{*})\) is a restricted cone set with group sparsity, which is similar to the condition used by Lounici et al. [18] to prove oracle inequalities for group Lasso in linear models. The ε is an error or relax term that can be set to zero, and we can view k as the smallest generalized eigenvalue of Σ.
If we assume that the group stabil condition is satisfied for the covariance matrix \(\Sigma :=\mathrm{E}[{z(t)}{z^{{\tau }}(t)}]\) under the restricted cone set \(S(c_{0},H^{*})\) with \(\delta =\hat{\beta }_{n}-\beta ^{*}\), then we check that \(\hat{\beta }_{n}-\beta ^{*} \in S(1,H^{*})\) holds with high probability. With the preparation above, we are now able to present the main result of this paper, which provides sharper and minimax optimal bounds for the estimation and prediction error when the true model is sparse and logp is small as compared to n.
Theorem 4.1
Let \(\gamma ^{*}:=\sum_{g\in H^{*}}d_{g}\), \(p_{\tau }:={{P({T_{1}} \ge \tau )}}>0\) and \({D^{2}}(\sqrt {2})\) be a universal constant. Assume that (H.1)–(H.4) and group stabil condition \(GS(1,\varepsilon _{n},k,H^{*})\) are satisfied for \(\Sigma :=\mathrm{E}[{z(t)}{z^{{\tau }}(t)}]\). If λ is chosen such that
Then, with probability at least \((A^{2}>2\bigr)\)
we have \(\hat{\beta }_{n}-\beta ^{*} \in S(1,H^{*})\) and
where \(c_{1}>0\) is a constant given in (H.4).
Moreover, if a new covariate \({z^{*}}(t)\) (the test data) is an independent copy of \({z}(t)\) (as the training data) and \(\mathrm{E}^{*}\) represents expectation only about \({z^{*}}(t)\), then the square prediction error under \(\Delta =1\) is
under the event \({{\mathcal{A}}}\cap {{\mathcal{B}}}\).
Consider \({\varepsilon _{n}}=0\). The obtained results are for the fixed design which is analogous to the bounds in Lounici et al. [18] who show the optimal convergence rate of the group Lasso estimator for linear models under the fixed design. Note that if \(\gamma ^{*}=O(1)\) then the bound on the estimation error is of the order \(O ( \sqrt{ \frac{\log p}{n}} )+O ( \sqrt{\frac{\log (G_{n})}{n}} )\) and the weighted group Lasso estimator still remains consistent for the \(\ell _{2,1}\)-estimation error and for the square prediction error under the group stabil condition if the number of groups increases almost as fast as \(e^{o(n)}\). The terms \(\sqrt{\log p}\) and \(\sqrt{\log {G_{n}}}\) are the price to pay for the unknown group sparsity of \({\beta ^{*}}\). If the relax error \({\varepsilon _{n}}\) is a big order of λ, it leads to the convergence rate \({\varepsilon _{n}}\) for the estimation error \(\sum_{g=1}^{G_{n}}{w_{g}}\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \Vert _{2}\).
From Theorem 4.1, if all \(d_{g}=1\), it enables us to derive analogous results for un-weighted Lasso penalty in what follows.
Corollary 4.1
Let \(\gamma ^{*}:=\|\beta ^{*}\|_{0}\), \(p_{\tau }:={{P({T_{1}} \ge \tau )}}>0\), and \({D^{2}}(\sqrt {2})\) be a universal constant in the proof. Assume that (H.1)–(H.4) and condition \(GS(1,\varepsilon _{n},k)\) are fulfilled for \(\Sigma :=\mathrm{E}[{z(t)}{z^{{\tau }}(t)}]\). If λ is chosen such that
Then, with probability at least
we have \(\hat{\beta }_{n}-\beta ^{*} \in S(1,H^{*})\) and
Corollary 4.1 presents an upper bound of the \(\ell _{1}\)-estimation error, which is similar to the existing result in Theorem 3.2 in Huang et al. [13] for classical Lasso penalized Cox models. The advantages of Corollary 4.1 are that the restricted eigenvalue condition is not stochastic and Theorem 3.2 in Huang et al. [13] requires further analysis of the restricted eigenvalue condition to guarantee a high-probability event. Another significant difference is that oracle inequalities in Huang et al. [13] require that the sample size is larger than a given constant. Our oracle inequalities are valid for any finite n under the given high-probability event.
5 Proofs
5.1 Proofs of Theorem 4.1
The proof is based on the following three steps.
Step1: Check \(\hat{\beta }_{n}-\beta ^{*} \in S(1,H^{*})\).
Using Proposition 4.1 and Proposition 4.2 to bound the empirical process on the event \(\mathcal{A}\cap \mathcal{B}\) by (4.4), we have
By adding \(\lambda \sum_{g=1}^{G_{n}}{w_{g}}\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \Vert _{2}\) to both sides of inequality (5.2), on \(\mathcal{A}\cap \mathcal{B}\), we can obtain that
If \(g\notin H^{*}\), then \(\Vert \hat{\beta }_{n}^{g}- {\beta ^{*}}^{g}\Vert _{2} +\Vert {\beta ^{*}}^{g} \Vert _{2} - \Vert \hat{\beta }_{n}^{g}\Vert =0\), and otherwise \(\Vert {\beta ^{*}}^{g}\Vert _{2} - \Vert \hat{\beta }_{n}^{g}\Vert _{2} \le \Vert \hat{\beta }_{n}^{g}- {\beta ^{*}}^{g}\Vert _{2}\). So the last term in inequality (5.3) can be rewritten as
By the definition of \(\beta ^{*}\), we have \(\mathbb{P}( l(\hat{\beta }_{n};T,z,\Delta )-l(\beta ^{*};T,z,\Delta ))>0\) and therefore
i.e., \(\hat{\beta }_{n}-\beta ^{*} \in S(1,H^{*})\).
Step2: Find a lower bound for \(\mathbb{P}( l(\hat{\beta }_{n};T,z,\Delta )-l(\beta ^{*};T,z,\Delta ))\) .
The next proposition provides the desired lower bound.
Proposition 5.1
Under (H.4), conditioning on \(\Delta =1\), we have
with \(c_{l}>0\) is a constant given in (H.4).
Proof
By the second order Taylor’s expansion of the function \(\beta \mapsto l(\beta ;T,z,\Delta )\), let the corresponding mean value \(\tilde{\beta }= {({{\tilde{\beta }}_{1}}, \ldots ,{{\tilde{\beta }}_{p}})^{T}}\) be between \(\beta _{j}^{*}\) and \(\beta _{j}\) for each \(j=1,2,\ldots ,p\).
Let \(z^{\tau }(t)\tilde{\beta }\) be the intermediate point between \(z^{\tau }(t)\beta ^{*}\) and \(z^{\tau }(t)\hat{\beta }_{n}\) given by a second order Taylor’s expansion of \(l(\beta ^{*};T,z,\Delta )\). Then, conditioning on \(\Delta =1\), we have
where the second last equality is obtained by estimating the equation in (3.4). □
From Proposition 5.1 and (5.4), it deduced that
Step 3: Squeeze error bounds from group stabil condition
Let Σ be the \(p \times p\) covariance matrix whose entries are \(\mathrm{E}[{z_{j}(t)}{z_{k}(t)}]=\mathrm{E}^{*}[{z_{j}(t)}{z_{k}(t)}]\). We have
since we assume that \(\Sigma :=\mathrm{E}[{z(t)}{z^{{\tau }}(t)}]\) satisfies the group stabil condition \(GS(1,\varepsilon _{n},k,H^{*})\) after \(\hat{\beta }_{n}-\beta ^{*} \in S(1,H^{*})\) is verified. Multiplying \(c_{l}/2\) in (4.14), we have
Then substitute the above inequality to (5.7), by using the Cauchy–Schwarz inequality, we get
Now the fact that \(2xy\le tx^{2}+y^{2}/t\) for all \(t>0\) leads to the following inequality:
Putting \(t:=\frac{2}{kc_{l}}\) in (5.8), we have the oracle inequality
Finally, for the prediction oracle inequality, it is deduced from (5.7) that
Therefore,
Note that the term \(\sum_{g \notin {H^{*}}} {w_{g}} {\| {\hat{\beta }_{n}^{g} - {\beta ^{*}}^{g}} \|_{2}} = \sum_{g \notin {H^{*}}} {w_{g}} {\| {\hat{\beta }_{n}^{g}} \|_{2}}\) that we have discarded for the first inequality sign in the above expression is very small on the set \(\{ g:{\beta ^{*}}^{g} = 0\} \).
Then using oracle inequality for \(\sum_{g=1}^{G_{n}}{w_{g}}\Vert \hat{\beta }_{n}^{g}-{\beta ^{*}}^{g} \Vert _{2}\) leads to
Finally we conclude the proof by using Propositions 4.1 and 4.2. We show that the desired oracle inequalities hold with high probability under the event \({{\mathcal{A}}}\cap {{\mathcal{B}}}\).
5.2 Proofs of the propositions
5.2.1 Proof of Proposition 4.1
First we show that the summation is satisfied by applying Hoeffding’s inequality, see Wainwright [26].
Lemma 5.1
(Hoeffding’s inequality)
Let \({X_{1}}, \ldots ,{X_{n}}\) be independent random variables on \(\mathbb{R}\) satisfying bound condition \({a_{i}}\le {{X_{i}}} \le {b_{i}}\) for \(i = 1,2, \ldots ,n \). Then we have
For \({{\mathcal{A}}_{1}} = \bigcap_{g = 1}^{{G_{n}}} \{ {{ \Vert {\frac{1}{n}\sum_{i = 1}^{n} {[ \frac{{z_{ig}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}} - {\mathrm{{E}}}( \frac{{z_{ig}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}})]} } \Vert _{2}} \le {\lambda _{a1}}} \} \), let \(W_{i}^{g}:= \frac{{z_{ig}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}}- \mathrm{E}(\frac{{z_{ig}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}})\) and
We have
due to \(\lbrace \Vert \frac{1}{n}\sum_{i=1}^{n}W_{i}^{g} \Vert _{2}^{2}> {\lambda _{a1}^{2}} \rbrace \subset \bigcup_{j \in \mathrm{Group}_{g}, \vert \mathrm{Group}_{g} \vert = {d_{g}}} \{ {|\frac{1}{n} {\sum_{i = 1}^{n} {W_{ij}^{g}} } |^{2} > \frac{\lambda _{a1}^{2}}{ {{d_{g}}} }} \} \).
Applying Hoeffding’s inequality with \({a_{i}}=\frac{-L}{n{w_{\min }}}\le \frac{1}{n} \frac{{z_{ij}^{\tau }({T_{i}})}}{{{w_{g}}}}{\Delta _{i}} \le \frac{L}{n{w_{\min }}}= {b_{i}}\), we obtain
Finally, from (5.10) and (5.11), it is deduced that
which gives \(\lambda _{a1} = \frac{{L\sqrt{2{d_{\max }}} }}{{w_{\min }}}\sqrt{\frac{{\log (2{G_{n}})}}{n}}\).
For \({\mathcal{A}}^{\prime }_{2}\), we resort to McDiarmid’s concentration inequalities with bounded difference condition for random vectors, see Wainwright [26].
Lemma 5.2
Suppose that \(X_{1},\ldots ,X_{n}\) are independent random vectors all taking values in the set A, and assume that \(f:A^{n}\rightarrow \mathbb{R}\) is a function satisfying the bounded difference condition
Then, for all \(t>0\),
If there are no absolute signs in the above event, then the upper bound is changed by \(\exp ( 2{t^{2}}/\sum_{i = 1}^{n} {c_{i}^{2}} )\).
Similar to the treatment of \({\mathcal{A}}_{1}\), let
and
Then \({{\mathcal{A}}_{2}}: = \bigcap_{g = 1}^{{G_{n}}} {\{ {{\| {\frac{1}{n}\sum_{i = 1}^{n} {Z_{i}^{g}} } \|_{2}} \le {\lambda _{a2}}} \}} \). We have
Let
and
Then we have
Note that, for \(j=1,\ldots,d_{g}\) and \(i=1,\ldots,n\), we have
For fixed j, (5.15) gives
for all \({z_{1},\ldots ,z_{n},\tilde{z}_{k}}\).
Lemma 5.2 implies
It is sufficient to estimate the sharper upper bounds of \({\mathrm{{E}}} ( {\sup_{\beta \in {S_{M}}({\beta ^{*}})} \vert {\frac{1}{n}\sum_{i = 1}^{n} {Z_{ij}^{g}} (\beta )} \vert } )\) by the symmetrization theorem and the contraction theorem below, which can be found in van der Vaart and Wellner [25], Wainwright [26].
Lemma 5.3
(Symmetrization theorem)
Let \(\varepsilon _{1},\ldots,\varepsilon _{n}\) be a Rademacher sequence with uniform distribution on \(\{ - 1,1\}\), independent of \(X_{1},\ldots,X_{n}\) and \(f\in \mathcal{F}\). Then we have
where \({\mathrm{{E}}}[\cdot ]\) refers to the expectation w.r.t. \(X_{1},\ldots,X_{n}\) and \({\mathrm{{E}}}_{\epsilon } \{ \cdot \} \) w.r.t. \(\epsilon _{1},\ldots,\epsilon _{n}\).
Using the symmetrization theorem, we have
where \(w_{i}(\beta ):= \frac{{\int {z_{ij} (s)1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }}{{\int {1(s \ge {T_{i}}){e^{z_{i}^{\tau }(T_{i})\beta }}\,dF(s,z)} }} \) for \(i=1,2,\ldots,n\).
For any \(w_{i}(\beta )\), we can find a sequence of random vectors \(\{a_{i}\}_{i=1}^{n}\in \mathbb{R}^{p}\) with \({ \Vert a_{i} \Vert _{\infty }}=1\) and vector \(b \in \mathbb{R}^{p}\) with \({{ \Vert b \Vert }_{1}} \le L\) such that
Then we have
Next, we are going to use the following maximal inequality for bounded variables; see [31] for more discussions.
Lemma 5.4
(Maximal inequality)
Let \(X_{1},\ldots,X_{n}\) be independent random vectors that take values in a measurable space \(\mathcal{X}\) and \(f_{1},\ldots,f_{n}\) be real-valued functions in \(\mathcal{X}\) which satisfy, for all \(j=1,\ldots,p\) and all \(i=1,\ldots,n\),
Then
By Proposition 5.4, with \({\mathrm{{E}}}[{z_{i}}\epsilon _{i}{{a_{ij}}}]=0\) and \(\epsilon _{i}{{a_{ij}}} \le \max_{1 \le i \le n}{ \Vert a_{i} \Vert _{\infty }} =1\), we get
Then
Therefore, (5.13) can be further bounded by letting \(\frac{{\lambda _{a2}}}{{\sqrt{{d_{\max }}} }}= \frac{{2L}}{{{w_{\min }}}}\sqrt{\frac{{2\log 2p}}{n}} + t\)
Let \(2d_{\mathrm{max}}G_{n}\exp ( { - \frac{{nt^{2}w_{\min }^{2}}}{{8{L^{2}}{e^{4LB}}}}} )= d_{\mathrm{max}}(2G_{n})^{1-A^{2}}\), which gives
Finally, we have
by letting \({\lambda _{a2}} = \frac{{2L\sqrt{2{d_{\max }}} }}{{{w_{\min }}}} ( {\sqrt{\frac{{\log 2p}}{n} + } A{e^{2LB}}\sqrt{\frac{{\log (2{G_{n}})}}{n}} } )\). Together with (5.12), it gives
Then (4.6) is obtained by using (4.5) conditioning on the event \({{\mathcal{A}}_{1}}\cap {{\mathcal{A}}_{2}}\).
5.2.2 Proof of Proposition 4.2
For the event \({{\mathcal{B}}_{0}}\), we need the exponential concentration inequality for the uniform convergence of empirical distribution function
Lemma 5.5
(DKW inequality, Massart [19])
For \({x\in {\mathbb{R}} }\), the DKW inequality bounds the probability that the random function \(F_{n}(x)\) differs from \(F(x)\) by more than a given constant \(\varepsilon > 0\):
[8] proves the inequality with an unspecified multiplicative constant multiples of the exponent in the tail bounds. Massart [19] shows that the DKW inequality has the multiplying constant 2. Let \(p_{\tau }:={{P({T_{1}} \ge \tau )}}=2U{{\mathrm{{e}}}^{ LB}}\), so \(U = {p_{\tau }}{{\mathrm{{e}}}^{ - LB}}/2\). We have
Let \((\mathcal{F},\|\cdot \|)\) be a subset of a normed space of real functions \(f: \mathcal{X} \rightarrow \mathbb{R}\) in some set \(\mathcal{X}\). Define the \(L_{r}(Q)\)-norm by \(\|f\|_{L_{r}(Q)}= (\int |f|^{r} \,d Q )^{1 / r}\). For probability measures Q, we have \(L_{r}(Q)\)-spaces endowed by the \(L_{r}(Q)\)-norm. Given two functions \(l(\cdot )\) and \(u(\cdot )\), the bracket \([l, u]\) is the set of all functions \(f \in \mathcal{F}\) with \(l(x) \leq f(x) \leq u(x)\) for all \(x \in \mathcal{X} \). An ε-bracket is a bracket \([l, u]\) with \(\|l-u\|_{L_{r}(Q)}<\varepsilon \), see van der Vaart and Wellner [25]. The bracketing number \(N_{[\,]} ({\varepsilon }, \mathcal{F}, L_{r}(Q) )\) is the minimum number of ε-brackets covered by \(\mathcal{F}\), i.e.,
For the event \({{\mathcal{B}}_{1}}\), let \(B_{i}^{g}(\beta ):= {{1} ( {{T_{i}} \ge {t_{s}}} ) \frac{{z_{ig}({t_{s}})}{{\mathrm{{e}}}^{z_{i}^{\tau }({t_{s}})\beta }}}{{{w_{g}}}}} - {\mathrm{{E}}} [1(T \ge {t_{s}}) \frac{{{z_{ig}}(T)}{{\mathrm{{e}}}^{{z^{\tau }}(T) \beta }}}{{{w_{g}}}} ]\) and
Similar to the analysis of \({{\mathcal{A}}_{1}}\) and \({{\mathcal{A}}_{2}}\), we have
Then we will apply sub-Gaussian concentration for suprema of the empirical processes as the following event:
with bracketing numbers \(\{ N_{[\,]} ({\varepsilon }, {\mathcal{B}}_{1gj}, L_{2}(P) )\}\) relative to \(L_{2}(P)\)-norm, see Theorem 2.14.9 of van der Vaart and Wellner [25].
Lemma 5.6
(Sharper bounds for suprema of empirical processes, Talagrand [22])
Consider a probability space \((\Omega , \Sigma , P)\) and n i.i.d. random variables \(X_{1}, \ldots , X_{n}\), valued in Ω, of law P. Let \(\mathcal{F}\) be a class of measurable functions \(f: \mathcal{X} \mapsto [0,1]\) that satisfy
Then, for every \(t>0\),
for a constant \(D(K)\) that depends on K only.
The explicit constant \(D(K)\) can be found in Zhang [30], who studies the tail bounds for the supremums of the empirical process \(\{ n^{-1 / 2} \sum_{i=1}^{n} [f (X_{i} )- \mathrm{E} f (X_{i} ) ] \} \), where \(\{X_{i}\}\) is a sequence of (non-i.i.d, unbounded) independent random vectors with values in a general measurable space \((\mathcal{X}, \mathcal{A})\), and f is a measurable real function on \((\mathcal{X}, \mathcal{A})\).
In what follows, we assume that \(z(t)\) is non-random. For \(\{{\mathcal{B}}_{1gj}\}\) in (5.22), we have the function classes
so \(0\le f_{t,\beta }(x,z)\le 1\).
In \({\mathcal{B}}_{2}\), we focus on the class of functions \(0\le g_{t,\beta }(x,z)\le 1\),
Let \(\lceil x\rceil \) be the smallest integer that is greater than or equal to x. For any \(\epsilon \in (0,1)\), let \(t_{s}\) be the sth \(\lceil 1 / \varepsilon \rceil \) quantile of \(T_{1}\), thus
For \({\mathcal{F}}_{1gj}\) and \({\mathcal{G}}_{2}\), we consider two types of brackets of the forms
and
for a grid of points \(-\infty =s_{0}< s_{1}<\cdots <s_{\lceil 1 / \varepsilon \rceil }= \infty \) with the property \(F (s_{k} )-F (s_{k-1} )<\varepsilon \) for all i.
Then, for given j and g, the bracket functions satisfy
provided \(s_{k-1}< s \leq s_{k}\).
For \(\{{\mathcal{B}}_{1gj}\}\), the \(L_{2}(P)\)-norm of \(U_{jg,k}^{\mathcal{F}}(x,z)-L_{jg,k}^{\mathcal{F}}(x,z)\) is
For \({\mathcal{B}}_{2}\), the \(L_{2}(P)\)-norm for \(U_{k}^{\mathcal{G}}(x,z)-L_{k}^{\mathcal{G}}(x,z)\) is
In both cases, by the definition of bracketing number, we get
Hence, \(N_{[\,]} ({\varepsilon }, \mathcal{F}, L_{2}(P) )\le 2 / \varepsilon ^{2}\).
For the event \({{\mathcal{B}}_{1}}\) with relation (5.22), we get \(K=\sqrt{2}\) and \(V=2\) in Lemma 5.6. Then, conditioning on the random design z, with Lemma 5.6 we define
Note that \(U = {p_{\tau }}{{\mathrm{{e}}}^{ - LB}}/2\), thus we put \(\frac{{2L{e^{LB}}t}}{{{w_{\min }}}}= \frac{{\lambda _{b1}U}}{{\sqrt{{d_{\max }}} }}= \frac{{\lambda _{b1}{p_{\tau }}{{\mathrm{{e}}}^{ - LB}}}}{2{\sqrt{{d_{\max }}} }}\) in (5.22), which implies
with \(t = \frac{{{\lambda _{b1}}{p_{\tau }}{{\mathrm{{e}}}^{ - 2LB}}{w_{\min }}}}{{4L\sqrt{{d_{\max }}} }}\).
Let \(d_{\mathrm{max}}G_{n}\frac{D^{2}(\sqrt {2}) t^{2}}{{2}} e^{-2n t^{2}}=d_{\mathrm{max}}G_{n}\frac{D^{2}(\sqrt {2}) t^{2}}{{2}} (G_{n})^{1-A^{2}}\), it gives \(t = \frac{A}{{\sqrt{2} }}\sqrt{\frac{{\log ({G_{n}})}}{n}}\). Then we have
with the tuning parameter \({\lambda _{b1}}\) determined by
For the event \({{\mathcal{B}}_{2}}\), we have \(K=\sqrt{2}\) and \(V=2\) in Lemma 5.6. Define
Note that \(U = {p_{\tau }}{{\mathrm{{e}}}^{ - LB}}/2\), thus we set \({{e^{LB}}t}={{\lambda _{b1}U}}= \frac{{\lambda _{b1}{p_{\tau }}{{\mathrm{{e}}}^{ - LB}}}}{2}\) in (4.11). It gives \({P}({\mathcal{B}}_{2}^{c})\le \frac{D^{2}(\sqrt {2}) t^{2}}{{2}} e^{-2n t^{2}}\) with \(t = \frac{{\lambda _{b2}{p_{\tau }}{{\mathrm{{e}}}^{ - 2LB}}}}{2}\).
Assign \(\frac{D^{2}(\sqrt {2}) t^{2}}{{2}} e^{-2n t^{2}}=\frac{D^{2}(\sqrt {2}) t^{2}}{{2}}p^{-A^{2}}\), it implies \(t = \frac{A}{{\sqrt{2} }}\sqrt{\frac{{\log p}}{n}}\). Therefore, the tuning parameter \({\lambda _{b2}}\) is determined by
such that
Finally, we obtain by combining (5.21), (5.23), and (5.24)
6 Conclusions and future study
In this paper, we focus on the survival analysis problem by proportional hazard regressions, which includes situations when both the number of covariates p and sample size n are increasing, and \(p\gg n\). When \(p>n\), the classical partial likelihood estimation is over-parameterized and requires Lasso or weighted group Lasso regularization estimation to obtain a stable and satisfactory fitting of proportional hazard regressions. Under the group stabil condition, the sharp oracle inequalities for weighted group Lasso regularized misspecified Cox models are derived. The upper bound of their \(\ell _{2,1}\)-estimation error is determined by the tuning parameter with the rate \(O ( \sqrt{ \frac{\log p}{n}} )+O ( \sqrt{\frac{\log (G_{n})}{n}} )\). The obtained nonasymptotic oracle inequalities imply that the penalized estimator is consistent when \(\log p /n \to 0\) under mild conditions. The rate is rate-optimality in the minimax sense.
In the future study, the statistical inferences (confidence interval and testing for the coefficient, FDR control) are left for further studies.
Availability of data and materials
This is a purely mathematical paper. Data analysis is not applicable.
References
Andersen, P.K., Borgan, O., Gill, R.D., Keiding, N.: Statistical Models Based on Counting Processes. Springer, Berlin (1993)
Andersen, P.K., Gill, R.D.: Cox’s regression model for counting processes: a large sample study. Ann. Stat. 10(4), 1100–1120 (1982)
Bartlett, P.L., Mendelson, S., Neeman, J.: L1-regularized linear regression: persistence and oracle inequalities. Probab. Theory Relat. Fields 154(1), 193–224 (2012)
Bickel, P.J., Ritov, Y.A., Tsybakov, A.B.: Simultaneous analysis of lasso and Dantzig selector. Ann. Stat. 37, 1705–1732 (2009)
Blazere, M., Loubes, J.M., Gamboa, F.: Oracle inequalities for a group lasso procedure applied to generalized linear models in high dimension. IEEE Trans. Inf. Theory 60(4), 2303–2318 (2014)
Cox, D.R.: Regression models and life-tables. J. R. Stat. Soc., Ser. B, Methodol. 34, 187–220 (1972)
Cox, D.R.: Partial likelihood. Biometrika 62, 269–276 (1975)
Dvoretzky, A., Kiefer, J., Wolfowitz, J.: Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Stat. 27(3), 642–669 (1956)
Fan, J., Li, R.: Variable selection for Cox’s proportional hazards model and frailty model. Ann. Stat. 30, 74–99 (2002)
Greenshtein, E., Ritov, Y.A.: Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10(6), 971–988 (2004)
Honda, T., Hardle, W.K.: Variable selection in Cox regression models with varying coefficients. J. Stat. Plan. Inference 148, 67–81 (2014)
Huang, H., Gao, Y., Zhang, H., Li, B.: Weighted lasso estimates for sparse logistic regression: non-asymptotic properties with measurement error. Acta Math. Sci. (2021, in press). arXiv preprint, arXiv:2006.06136
Huang, J., Sun, T., Ying, Z., Yu, Y., Zhang, C.H.: Oracle inequalities for the lasso in the Cox model. Ann. Stat. 41(3), 1142–1165 (2013)
Kanehisa, M., Goto, S.: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1), 27–30 (2000)
Knight, K., Fu, W.: Asymptotics for lasso-type estimators. Ann. Stat. 28, 1356–1378 (2000)
Kong, S., Nan, B.: Non-asymptotic oracle inequalities for the high-dimensional Cox regression via lasso. Stat. Sin. 24(1), 25–42 (2014)
Lemler, S.: Oracle inequalities for the lasso in the high-dimensional Aalen multiplicative intensity model. Ann. Inst. Henri Poincaré Probab. Stat. 52(2), 981–1008 (2016)
Lounici, K., Pontil, M., Van De Geer, S., Tsybakov, A.B.: Oracle inequalities and optimal inference under group sparsity. Ann. Stat. 39(4), 2164–2204 (2011)
Massart, P.: The tight constant in the Dvoretzky–Kiefer–Wolfowitz inequality. Ann. Probab. 18, 1269–1283 (1990)
Rosenwald, A., Wright, G., Chan, W.C., Connors, J.M., Campo, E., Fisher, R.I., Giltnane, J.M.: The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N. Engl. J. Med. 346(25), 1937–1947 (2002)
Struthers, C.A., Kalbfleisch, J.D.: Misspecified proportional hazard models. Biometrika 73(2), 363–369 (1986)
Talagrand, M.: Sharper bounds for Gaussian and empirical processes. Ann. Probab. 22, 28–76 (1994)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc., Ser. B, Methodol. 58, 267–288 (1996)
Tibshirani, R.: The lasso method for variable selection in the Cox model. Stat. Med. 16(4), 385–395 (1997)
van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, Berlin (1996)
Wainwright, M.J.: High-Dimensional Statistics: A Non-asymptotic Viewpoint, vol. 48. Cambridge University Press, Cambridge (2019)
Wang, S., Nan, B., Zhu, N., Zhu, J.: Hierarchically penalized Cox regression with grouped variables. Biometrika 96(2), 307–322 (2009)
Yan, J., Huang, J.: Model selection for Cox models with time-varying coefficients. Biometrics 68(2), 419–428 (2012)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc., Ser. B, Stat. Methodol. 68(1), 49–67 (2006)
Zhang, D.X.: Tail bounds for the suprema of empirical processes over unbounded classes of functions. Acta Math. Sin. 22, 339–345 (2006)
Zhang, H., Chen, S.X.: Concentration inequalities for statistical inference. arXiv preprint, arXiv:2011.02258
Zhang, H., Jia, J.: Elastic-net regularized high-dimensional negative binomial regression: consistency and weak signals detection. Stat. Sin. (2021). https://doi.org/10.5705/ss.202019.0315
Zhang, H., Wu, X.: Compound Poisson point processes, concentration and oracle inequalities. J. Inequal. Appl. 2019(1), 312 (2019)
Zhang, H.H., Lu, W.: Adaptive lasso for Cox’s proportional hazards model. Biometrika 94(3), 691–703 (2007)
Zhao, H., Wu, Q., Li, G., Sun, J.: Simultaneous estimation and variable selection for interval-censored data with broken adaptive ridge regression. J. Am. Stat. Assoc. 115, 204–216 (2020)
Zhou, S., Zhou, J., Zhang, B.: High-dimensional generalized linear models incorporating graphical structure among predictors. Electron. J. Stat. 13(2), 3161–3194 (2019)
Acknowledgements
Ting Yan (tingyanty@mail.ccnu.edu.cn) and Huiming Zhang (huimingzhang@um.edu.mo) are co-coresponding authors. The authors are listed in alphabetical order and they contributed equally to this work. We would like to thank the two reviewers for taking the time to read our paper and for providing excellent suggestions and comments. The first author would like to show sincere gratefulness to the advisor Prof. Jinzhu Jia for his guidance of high-dimensional statistics. The authors also thank Prof. Hui Zhao for helpful discussions.
Funding
Yan Ting is partially supported by the National Natural Science Foundation of China (No. 11771171), the Fundamental Research Funds for the Central Universities. Zhang Huiming is supported in part by the University of Macau under UM Macao Talent Programme (UMMTP-2020-01).
Author information
Authors and Affiliations
Contributions
The authors completed the paper and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Xiao, Y., Yan, T., Zhang, H. et al. Oracle inequalities for weighted group lasso in high-dimensional misspecified Cox models. J Inequal Appl 2020, 252 (2020). https://doi.org/10.1186/s13660-020-02517-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13660-020-02517-3