Keywords

1 Introduction

Learning from positive and unlabeled (PU in short) data is an approach, where training data contains only positive and unlabeled examples, which means that the true labels \(Y\in \left\{ 0,1\right\} \) are not observed directly, since only surrogate variable \(S\in \left\{ 0,1\right\} \) is observable. This surrogate variable equals 1 - if an example is labeled, or 0 - if otherwise. The PU datasets appear in a large number of applications. For example, they often appear while dealing with the so-called under-reporting data from medical surveys, fraud detection and ecological modeling (see, e.g., Hastie and Fithian [6]). Some other interesting examples of the under-reporting survey data may be found in Bekker and Davis [1] and Teisseyre et al. [12].

Suppose that X is a feature vector and, as mentioned earlier, \(Y\in \left\{ 0,1\right\} \) denotes a true class label and \(S\in \left\{ 0,1\right\} \) is a variable indicating, whether an example is labeled or not (then, \(S=1\) or \(S=0\), respectively). We apply a commonly used assumption, called the Selected Completely At Random (SCAR) condition, which states that the labeled examples are randomly selected from a set of positives examples, independently from X, i.e. \(P(S=1|Y=1,X)=P(S=1|Y=1).\) Let \(c=P(S=1|Y=1)\). The parameter c is called the label frequency and plays a key role in the PU learning problem. The primary objective of our note is to introduce a new PU learning classification procedure leading to the estimation of the posterior probability \(f(x)=P(Y=1|X=x)\).

The three basic methods of this estimation have been proposed so far. They consist in minimizing the empirical risk of logistic loss function and are known as: the naive method, the weighted method, and the joint method (the last one has been recently introduced in the paper of Teisseyre et al. [12]). All of these approaches have been thoroughly described in [12]. As the joint method will be applied in our procedure’s construction, some details regarding this method will be presented in the next section of our article. We have named our proposed classification method as the LassoJoint procedure, since it is a three-step approach combining the thresholded Lasso procedure with the joint method from Teisseyre et al. [12]. Namely, in its two first steps we perform - for some prespecified level - the thresholded Lasso procedure, in order to obtain the support for coefficients of a feature vector X, while in the third step we apply - on the previously determined support - the joint method. Apart from the works, where different learning methods applying logistic regression for PU data have been proposed, there are also some other interesting articles, where various machine learning tools in the PU learning problems have been used. In this context, it is worthwhile to mention: the papers of Hou [7] and Guo [5], where the generative adversial networks (GAN) for the PU problem have been employed, the work of Mordelet and Vert [10], where the bagging Support Vector Machine (SVM) approach for the PU data have been applied, and an article of Song and Raskutti [11], where the multidimensional PU problem with regard to the features selection has been investigated, and where the so-called PUlasso design has been established. It turns out that the LassoJoint procedure, which we propose in our work, is computationally simple and efficient in comparison to the other existing methods where the PU problem is considered. The simplicity and efficiency of our approach have been confirmed by the conducted simulation study.

The remainder of the paper is structured as follows. Namely, in Sect. 2 we describe our classification procedure in detail, in particular we also prove that the introduced method is the so-called screening procedure (i.e., it selects with a high probability the most significant predictors and the number of selected features is not greater that the sample size), as the screening property is necessary to apply the joint method in the final step of the procedure. In turn, in Sect. 3 we carry out some numerical study, in order to check the efficiency of the proposed approach, while in Sect. 4 we summarize and conclude our research. The results of numerical experiments on real data are given in SupplementFootnote 1.

2 The Proposed LassoJoint Algorithm

In our considerations, we assume that we have a random vector (YX), where \(Y\in \left\{ 0,1\right\} \) and \(X\in \mathbb {R}^{p}\) is a feature vector, and that a random sample \(\left( Y_{1},X_{1}\right) ,\ldots ,\left( Y_{n},X_{n}\right) \) is distributed as \(\left( Y,X\right) \) and independent of it. In addition, we suppose that the coordinates \(X_{ji}\) of \(X_{i}\), \(i=1,...,n\), \(j=1,...,p\), are subgaussian with a parameter \(\sigma _{jn}^{2}\), i.e. \(E\exp \left( uX_{ji}\right) \le \exp \left( u^{2}\sigma _{jn}^{2}/2\right) \) for all \(u\in \mathbb {R}.\)

Let: \(s_{n}^{2}=\max _{1\le j\le p}\sigma _{jn}^{2}\), \(\lim \sup _{n}s_{n}^{2}<\infty \), and \(P\left( Y=1\mid X=x\right) =q(x^{T}\beta )\) for some function \(0<q(x)<1\) and all \(x\in \mathbb {R}^{p}\), where p may depend on n and \(p>n\). Put: \(I_{0}=\left\{ j:\beta _{j}\ne 0\right\} \), \( I_{1}=\{1,\ldots ,p\}\backslash I_{0}\) and \(\left| I_{0}\right| =p_{0}.\) We shall assume - as in Kubkowski and Mielniczuk [9] - that the distribution of X satisfies the linear regression condition (LRC), which means that

$$\begin{aligned} E(X\mid X^{T}\beta )=u_{0}+u_{1}X^{T}\beta \text { for some } u_{0},u_{1}\in \mathbb {R}^{p}. \end{aligned}$$

This condition is fulfilled (for all \(\beta \)) by the class of elliptical distributions (such that, e.g., the normal distribution or the multivariate t-Student distribution). Reasoning as in Kubkowski and Mielniczuk [9], we obtain that under (LRC), there exists \(\eta \) satisfying \(\beta ^{*}=\eta \beta ,\) where \(\eta \ne 0\) if \(\mathrm{cov}(Y,X^{T}\beta )\ne 0\), and where \(\beta ^{*}=\arg \min _{\beta }R(\beta ),\) with R standing for the risk function given by \(R\left( \beta \right) =-E_{\left( X,Y\right) }l\left( \beta ,X,Y\right) ,\) where in turn, \(l\left( \beta ,X,Y\right) =Y\log \sigma (X^{T}\beta )+\left( 1-Y\right) \log \left( 1-\sigma (X^{T}\beta )\right) ,\) with \(\sigma \) denoting logistic function of the form \(\sigma (X^{T}\beta )=\exp (X^{T}\beta )/\left[ 1+\exp (X^{T}\beta )\right] .\) Put: \(I_{0}^{*}=\left\{ j:\beta _{j}^{*}\ne 0\right\} \), \( I_{1}^{*}=\{1,\ldots ,p\}\backslash I_{0}^{*}\). It may be observed that under (LRC), we have \(I_{0}=I_{0}^{*}\) and consequently that \( \mathrm{Supp}(I_{0})=\mathrm{Supp}(I_{0}^{*}).\) In addition, put \(H(b)=E(X^{T}X\sigma ^{\prime }(X^{T}b))\) and define a cone \( C(d,w)=\{\varDelta \in \mathbb {R}^{p}:\left\| \varDelta _{w^{c}}\right\| _{1}\le d\left\| \varDelta _{w}\right\| _{1}\}\), where: \(w\subseteq \{1,\ldots ,p\}\), \(w^{c}=\{1,\ldots ,p\}\backslash w\), \(\varDelta _{w}=(\varDelta _{w_{1}},\ldots ,\varDelta _{w_{k}})\), for \(w=(w_{1},\ldots ,w_{k})\). Furthermore, let \(\kappa \) be a generalized minimal eigenvalue of the matrix \(H(\beta ^{*})\) given by \(\kappa =\inf _{\varDelta \in C(3,s_{0}^{*})}\frac{\varDelta ^{T}H(\beta ^{*})\varDelta }{\varDelta ^{T}\varDelta }\text {.}\) Moreover, we also define \(\beta _{\min }^{*}\) and \(\beta _{\min }\) as \( \beta _{\min }^{*}:=\min _{j\in I_{0}^{*}}\left| \beta _{j}^{*}\right| \) and \(\beta _{\min }:=\min _{j\in I_{0}}\left| \beta _{j}\right| \), respectively.

After these preliminaries, we are now in a position to depict the proposed method. Namely, our procedure, called the LassoJoint approach, is a three-step method, which is described as follows:

(1) For available PU dataset \((s_{i},x_{i})\), \(i=1,\ldots ,n\), we perform the ordinary Lasso procedure (see Tibshirani [14]) for some tuning parameter \(\lambda >0\), i.e. we compute the following Lasso estimator of \(\beta ^{*}\): \(\hat{\beta }^{(L)}=\arg \min _{\beta \in R^{p+1}}\hat{R}(\beta )+\lambda \sum _{j=1}^{p}\left| \beta _{j}\right| ,\) where \(\hat{R}(\beta )=-\frac{1}{n}\sum _{i=1}^{n}\left[ s_{i}\log \left( \sigma (x_{i}^{T}\beta )\right) +\left( 1-s_{i}\right) \log \left( 1-\sigma (x_{i}^{T}\beta )\right) \right] \) and subsequently, we obtain the corresponding support \(\mathrm{Supp}^{(L)}=\{1\le j\le p:\hat{\beta }_{j}^{(L)}\ne 0\}\);

(2) We perform the thresholded Lasso for some prespecified level \(\delta \) and obtain the support \(\mathrm{Supp}^{(TL)}=\{1\le j\le p:\left| \hat{\beta } _{j}^{(L)}\right| \ge \delta \}\);

(3) We apply the joint method from Teisseyre et al. [12] for the predictors from \(\mathrm{Supp}^{(TL)}\).

Remark. The PU problem is related to incorrect specification of the logistic model. Under the SCAR assumption, we obtain that \(P(S=1|X=x)=cq(x^T\beta )\) and consequently, if \(cq()\ne \sigma ()\), then in step (1) we are fitting misspecified logistic model to (SX). Generally speaking, the joint method from [12] consists in fitting the PU data to logistic function and in the minimization, with respect to \(\beta \) and \(c=P(S=1\mid Y=1)\), of the following empirical risk \(\hat{R}\left( \beta ,c\right) =-\frac{1}{n}\sum _{i=1}^{n}\left[ s_{i}\log \left( c\sigma (x_{i}^{T}\beta )\right) +\left( 1-s_{i}\right) \log \left( 1-c\sigma (x_{i}^{T}\beta )\right) \right] ,\) where \(\sigma \) stands for logistic function and \(\left\{ \left( s_{i},x_{i}\right) \right\} \) is the sample of observations from the distribution of a random vector \(\left( S,X\right) . \)

The newly proposed LassoJoint procedure is similar to the LassoSD approach introduced in Furmańczyk and Rejchel [3]. The only difference between these two methods is that, we apply the joint method from [12] in the last step of our procedure - contrary to the procedure from [3] , where the authors use multiple hypotheses testing in its final stage. The introduced LassoJoint procedure is determined by the two parameters \( \lambda \) and \(\delta \), which may depend on n. The selection of \(\lambda \) and \(\delta \) is possible, if we impose the following conditions - denoted as the assumptions (A1)–(A4):

  1. (A1)

    The generalized eigenvalue \(\kappa \), of the matrix \(H(\beta ^{*})\), is such that \(m\le \kappa \le M\), for some \(0<m<M;\)

  2. (A2)

    \(p_{0}^{2}\log (p)=o(n)\), \(\log (p)=o(n\lambda ^{2})\), \(\lambda ^{2}p_{0}^{2}\log (np)=o(1),\) as \(n\rightarrow \infty ;\)

  3. (A3)

    \(p_{0}+\frac{c_{n}^{2}}{\delta ^{2}}\le n\), where \(c_{n}=10 \frac{\sqrt{p_{0}}}{\kappa }\lambda ;\)

  4. (A4)

    \(\beta _{\min }\ge \left( \delta +c_{n}/\sqrt{p_0}\right) /\eta .\)

Clearly, in view of (LRC), the condition from (A4) is equivalent to the constraint that \(\beta _{\min }^{*}\ge \delta +c_{n}/\sqrt{p_0}\). In addition, due to (A2), we get that \(c_{n}\rightarrow 0.\)

The main strictly theoretical result of our work is the following assertion.

Theorem 1

(Screening property) Under the conditions (LRC) and (A1)–(A4), we have that with a probability at least \(1-\epsilon _{n}\), where \(\epsilon _{n}\rightarrow 0\):

  1. (a)

    \(\left| \mathrm{Supp}^{(TL)}\right| \le p_{0}+\frac{c_{n}^{2}}{ \delta ^{2}}\le n\),

  2. (b)

    \(I_{0}\subset \mathrm{Supp}^{(TL)}\).

The presented theorem states that the proposed LassoJoint procedure is the so-called screening procedure, which means that (in the first two steps) this method selects, with a high probability, the most significant predictors of the model and that the number of selected features is not greater than the sample size n. This screening property guarantees that with a high probability, we may apply the joint procedure from Teisseyre et al. [12], based on fitting logistic regression to PU data. The proof of Theorem 1 uses the following lemma, which straightforwardly follows from Theorem 4.9 in Kubkowski [8].

Lemma 1

Under the assumptions (A1)–(A2), we obtain that with a probability at least \( 1-\epsilon _{n}\), with \(\epsilon _{n}\) satisfying \(\epsilon _{n}\rightarrow 0 \), the following property holds: \(\left\| \hat{\beta }^{(L)}-\beta ^{*}\right\| _{2}\le c_{n},\) where \(c_{n}\rightarrow 0\) and \(\left\| x\right\| _{2}=\sqrt{ \sum _{j=1}^{p}x_{j}^{2}}\) for \(x\in \mathbb {R}^{p}\).

Proof

First, we prove the relation stated in (a). By the definition of \(\hat{\beta } _{j}^{(L)}\), we get \(\sum _{j\in I_{1}\cap \mathrm{Supp}^{(TL)}}\left( \hat{\beta }_{j}^{(L)}\right) ^{2}\ge \delta ^{2}\left| I_{1}\cap \mathrm{Supp}^{(TL)}\right| \text {.}\) Hence,

$$\begin{aligned} \left| I_{1}\cap \mathrm{Supp}^{(TL)}\right|\le & {} \frac{1}{\delta ^{2}} \sum _{j\in I_{1}\cap \mathrm{Supp}^{(TL)}}\left( \hat{\beta }_{j}^{(L)}\right) ^{2} \\= & {} \frac{1}{\delta ^{2}}\sum _{j\in I_{1}\cap \mathrm{Supp}^{(TL)}}\left( \hat{\beta } _{j}^{(L)}-\beta _{j}^{*}\right) ^{2}\le \frac{1}{\delta ^{2}} \left\| \hat{\beta }^{(L)}-\beta ^{*}\right\| _{2}^{2}, \end{aligned}$$

and

\(\left| \mathrm{Supp}^{(TL)}\right| \le \left| I_{0}\cap \mathrm{Supp}^{(TL)}\right| +\left| I_{1}\cap \mathrm{Supp}^{(TL)}\right| \le p_{0}+\frac{1}{\delta ^{2}}\left\| \hat{\beta }^{(L)}-\beta ^{*}\right\| _{2}^{2}\text {.}\)

It follows from the cited lemma that with a probability at least \(1-\epsilon _{n}\), where \(\epsilon _{n}\rightarrow 0\), we have \(\left| \mathrm{Supp}^{(TL)}\right| \le p_{0}+\frac{c_{n}^{2}}{\delta ^{2}} \text {.}\) Combining this inequality with (A3), we obtain (a). Thus, we only need to prove the property in (b).

Since \(\left\{ \min _{j\in I_{0}}\left( \hat{\beta }_{j}^{(L)}\right) ^{2}\ge \delta ^{2}\right\} \subseteq \left\{ I_{0}\subset \mathrm{Supp}^{(TL)}\right\} \) , it is sufficient to show that

$$\begin{aligned} P\left( \min _{j\in I_{0}}\left( \hat{\beta }_{j}^{(L)}\right) ^{2}\ge \delta ^{2}\right) \ge 1-\epsilon _{n} \text {.} \end{aligned}$$
(1)

Let:

\(\hat{\beta }_{I_{0}}^{(L)}:=\left\{ \hat{\beta }_{j}^{(L)}:j\in I_{0}\right\} \), \(\beta _{I_{0}}^{*}:=\left\{ \beta _{j}^{*}:j\in I_{0}\right\} \). As \(\left\| \hat{\beta }^{(L)}-\beta ^{*}\right\| _{2}^{2}\ge \left\| \hat{\beta }_{I_{0}}^{(L)}-\beta _{I_{0}}^{*}\right\| _{2}^{2}\), we have from the given lemma that with a probability at least \(1-\epsilon _{n}\),

$$p_0\min _{j\in I_{0}}\left( \hat{\beta }_{j}^{(L)}-\beta _j^*\right) ^{2} \le \left\| \hat{\beta }_{I_{0}}^{(L)}-\beta _{I_{0}}^{*}\right\| _{2}^{2}=\sum _{j\in I_{0}}\left( \hat{\beta }_{j}^{(L)}-\beta _{j}^{*}\right) ^{2}\le c_{n}^{2}.$$

Hence, \(\min _{j\in I_{0}}\left| \hat{\beta }_{j}^{(L)}-\beta _j^*\right| \le c_n/\sqrt{p_0}.\) In addition, by the triangle inequality, we obtain that for \(j\in I_{0},\) \(\left| \hat{\beta }_{j}^{(L)}\right| \ge \left| \beta _{j}^{*}\right| -\left| \hat{\beta }_{j}^{(L)}-\beta _{j}^{*}\right| \) and therefore, \(\min _{j \in I_0}\left| \hat{\beta }_{j}^{(L)}\right| \ge \min _{j \in I_0}\left| \hat{\beta }_{j}^{*}\right| -c_n/\sqrt{p_0}. \) This and (A4) imply that with a probability at least \(1-\epsilon _{n}, \) \(\min _{j\in I_{0}}\left( \hat{\beta }_{j}^{(L)}\right) \ge \delta ^2,\) which yields (1) and consequently (b).

3 Numerical Study

Suppose that: \(X_{1},\ldots \,X_{p}\) are generated independently from N(0, 1), and \(Y_{i}\), \(i=1,\ldots ,n\), are generated from the \(\mathrm{binom}(1,p_{i})\) distribution, where: \(p_{i}=\sigma (\beta _{0}+\beta _{1}X_{1i}+\ldots +\beta _{p}X_{pi}), \beta _{0}=1. \) The following high-dimensional models were simulated:

  1. (M1)

    \(p_{0}=5,p=1.2\cdot 10^3,n=10^3, \beta _{1}=\ldots =\beta _{p_0}=1,\beta _{p_{0}+1}=\ldots =\beta _{p}=0\);

  2. (M2)

    \(p_{0}=5,p=1.2\cdot 10^3,n=10^3, \beta _{1}=\ldots =\beta _{p_0}=2,\beta _{p_{0}+1}=\ldots =\beta _{p}=0\);

  3. (M3)

    \(p_{0}=5,p=10^3,n=10^3, \beta _{1}=\ldots =\beta _{p_0}=2,\beta _{p_{0}+1}=\ldots =\beta _{p}=0\);

  4. (M4)

    \(p_{0}=20,p=10^3,n=10^3, \beta _{1}=\ldots =\beta _{p_0}=2,\beta _{p_{0}+1}=\ldots =\beta _{p}=0\);

  5. (M5)

    \(p_{0}=5,p=2\cdot 10^3,n=2\cdot 10^3, \beta _{1}=\ldots =\beta _{p_0}=2,\beta _{p_{0}+1}=\ldots =\beta _{p}=0\);

  6. (M6)

    \(p_{0}=5,p=2\cdot 10^3,n=2\cdot 10^3,\beta _{1}=\ldots =\beta _{p_0}=3,\beta _{p_{0}+1}=\ldots =\beta _{p}=0\).

For all of the specified models, the LassoJoint method was implemented. In its first step, the Lasso method was used with some tuning parameters \( \lambda \) that were chosen either on the basis of 10-fold cross-validation scheme in the first scenario or by putting \(\lambda =((\log p)/n)^{1/3}\) in the second scenario. In the second step, we applied the thresholded Lasso design for \(\delta =0.5\cdot ((\log p)/n)^{1/3}\).

In the third - and simultaneously - the last step of our procedure, the variables selected by the thresholded Lasso method were employed to the joint method from [12] for the problem of the PU data classification. From the listed models, we randomly selected \(c\cdot 100\%\) of the labeled observations of S, for \(c=0.1;0.3;0.5;0.7;0.9\). Next, we generated a test sample of size 1000 from our models and determined their accuracy percentage based on 100 MC replications of our experiments. The idea of our procedure’s accuracy assesment is similar to the idea from Furmańczyk and Rejchel [4]. We applied the ‘glmnet’ package [2] from the R software and [13] in our computations. The results of our simulation study are collected in Table 1 (the column ‘oracle’ shows the accuracy of classifier that uses only the significant predictors and the true parameters of logistic models).

Table 1. Results for M1–M6

Real data experiments and all codes in R are presented in Supplement, available on https://github.com/kfurmanczyk/ICCS21.

4 Conclusions

The results of our simulation study show that if c increases, then the percentage of correct classifications increases as well. They also show that the classifications obtained by applying the proposed LassoJoint method display smaller classification errors (and thus - better classification accuracy) for the models with larger signals (i.e., for the M5 and M6 models). Comparing the M3 and M5 models, we can see that with an increase of the number of significant predictors (\(p_{0}\)), the classification accuracy is slightly decreasing. Furthermore, in all cases - except for the situation where \(c=0.1\) - the selection of the tuning parameter \(\lambda \) obtained by using the cross-validation design results in better classification accuracy. In addition, we may observe that in the case when \(c=0.7\) or \(c=0.9\), our LassoJoint approach is nearly as good as the ‘oracle’ method. In turn, for \( c=0.1\) the classification accuracy was low - from 0.41 do 0.60, but in the ‘easiest’ case, i.e. when \(c=0.9\), the classification accuracy ranged from 0.7 to 0.82. Furthermore, the results of our experiments conducted on real datasets show that if c increases, then the percentage of correct classifications increases as well. In addition, these results show similar classification accuracy among all of the considered classification methods (see Supplement). The proposed new LassoJoint classification method for PU data allows for the relatively low simulation computational costs to analyze data in a high-dimensional case, i.e. when the number of predictors exceeds the size of available sample (\(p>n\)). We aim to devote our further research to a more detailed analysis of the introduced procedure, in particular to the examination regarding optimal selection of the model parameters.