Keywords

1 Introduction

As an important statistical and machine learning method, classification has been widely used in genomic studies. Binary classification is the basis of all types of classification problems, and there exist many approaches to ensemble binary classifiers to solve multi-class classification problems or to reduce multi-class to binary classification. Important genomic applications of binary classification include labeling microarray data as tumor or non-tumor samples [18, 53], dividing genes into housekeeping and single-tissue specific groups [54], classifying genomic hairpin structures into precursor microRNAs and pseudo hairpins (i.e., genomic inverted repeats that are not precursor microRNAs) [32], and predicting transcription factor binding sites and other DNA regulatory elements based on genomic features [7, 20].

The aim of binary classification is to accurately predict binary (i.e., 0 or 1) labels for new observations on the basis of labeled training data. There are two types of errors: type I error (the conditional probability that the predicted label for a new observation is 1 given that the observation has a true label 0) and type II error (the conditional probability that the predicted label for a new observation is 0 given that the observation has a true label 1). For more than half a century, significant advances have been made in the development of binary classification theory and methods to construct good classifiers with various desirable properties [24]. Most existing binary classification methods aim to optimize the risk, which is the expected classification error (the probability that the predicted label is different from the true label) and can be expressed as a weighted sum of the type I and II errors, where the two weights are the marginal probabilities of the true label being 0 and 1, respectively. In real-world applications, however, users’ priorities for type I and type II errors may differ from these weights, and then minimizing the risk may lead to unsatisfactory classifiers. For example, in tumor diagnosis, suppose that we label a tumor sample as 0 and a normal sample as 1, the risk minimization approach fails if it leads to a classifier with type I error (i.e., the conditional probability that a tumor sample is misclassified as a normal sample) equal to 0. 3 but doctors prefer to constrain the type I error under 0. 05.

There are many scenarios where users need asymmetric error control, and they often occur when the two types of classification errors lead to vastly different consequences. Again in the example of tumor diagnosis, mispredicting a normal sample as a tumor sample may increase a patient’s anxiety and impose additional medical costs, but misclassifying a tumor sample as a normal sample may delay a patient’s treatment and even cause a life loss. Hence, the latter type of error—type I error—is more severe and should be controlled at a low level. In another example of classifying genes into housekeeping ones (say class 0) and cell-specific ones (say class 1), if the research aim is to identify novel cell-specific genes for a cell type (e.g., human embryonic stem cells) and the identified genes will be validated by experiments, researchers would generally prefer to control the type I error (the conditional probability of misclassifying a housekeeping gene as a cell-specific gene) at a low level to reduce experimental costs.

One common approach to addressing asymmetric classification errors is cost-sensitive learning, which allows users to assign two different costs as weights for type I and type II errors [14, 56]. Although this approach has many merits, its effectiveness is largely limited when there lacks a consensus way to assign costs. Cost-sensitive learning is also unable to serve the purpose when users desire a specific high probabilistic bound α on the type I or II error [e.g., \(\mathbb{P}(\mbox{ type I error} \leq \ \alpha ) > 1-\delta\), the probability that a chosen classifier has type I error not exceeding α is greater than 1 −δ for some small positive α (e.g., 0. 05) and δ (e.g., 0.05)], even though users may vary the two costs to achieve a small type I or type II error. There are several other classification approaches that target on small type I errors. Examples include asymmetric support vector machines (SVM) [52] and p-values for classification [13]. But like cost-sensitive learning, these approaches also provide no probabilistic guarantee for the type I error bound and could lead to non-negligible probability of large type I errors. In practice, there has been a long-time intuitive and straightforward common practice, that is to tune the observed type I error (also called empirical type I error) on the training data to the desired type I error bound α, for example, by adjusting the costs of errors or by changing the classification threshold. However, this approach cannot control the type I error of the chosen classifier on a new data set to be under α with high probability; in fact, a classifier chosen in this way will have type I error greater than α for approximately half the chance. Figure 1 illustrates this phenomenon with a simple two-class Gaussian example.

Fig. 1
figure 1

An example to illustrate that tuning the empirical type I error on training data to α cannot control the type I error on test data under α with high probability. The population is a two-class Gaussian distribution, where X follows N(−1, 1) under class 0 and N(1, 1) under class 1. The two classes have equal probabilities. A training data set with size n = 1000 is generated from this population, and a threshold t = 0. 635 (the dark blue vertical line) is chosen so that the resulting classifier \(\mathbb{I}(X \geq t)\) has the observed (empirical) type I error equal to α = 0. 05 on the training data. This classifier is then applied to B = 1000 test data sets from the same population, and the resulting empirical type I errors on each of these test data sets are summarized in the histogram, which shows that approximately 50 % of the type I errors are greater than α and 18. 1 % of the errors are even greater than 0. 06

Unlike the above approaches, the Neyman–Pearson (NP) classification, which was motivated by the century-long Neyman–Pearson paradigm in hypothesis testing, specifically aims to bound one type of error with high probability and meanwhile minimize the other type error. The main advantage of the NP classification is that it provides high probability guarantee on controlling one type of error under a user desired level.

This chapter is organized as follows. Section 2 provides a review of the Neyman–Pearson classification paradigm, including its theoretical and algorithmic advances. Section 3 presents three simulation examples to demonstrate how to implement the NP classification with popular classification algorithms (logistic regression, SVM, and random forests) that are widely used in genomic applications. Section 4 implements the NP classification on a genomic case study, where the goal is to classify DNA regions containing transcription factor motifs into two classes: transcription factor binding sites and non-binding sites, using two genomic features (absolute DNase-seq tag counts and DNase-seq footprint scores). Section 5 describes future research directions and potential genomic applications of the NP classification.

2 Neyman–Pearson Paradigm

A few commonly used notations are set up to facilitate our discussion. Let (X, Y ) be random variables where \(X \in \mathcal{X} \subset \mathbb{R}^{d}\) is a vector of features and Y ∈ { 0, 1} is a class label. A classifier h(⋅ ) is a mapping \(h: \mathcal{X} \rightarrow \{ 0,1\}\) that returns the predicted class given X. An error occurs when h(X) ≠ Y, and the binary loss is \(\mathbb{I}(h(X)\neq Y )\), where \(\mathbb{I}(\cdot )\) denotes the indicator function. The risk is the expected loss with respect to the joint distribution of (X, Y ): \(R(h) =\mathrm{ E}\left [\mathbb{I}(h(X)\neq Y )\right ] = \mathbb{P}\left (h(X)\neq Y \right ),\) which can be expressed as a weighted sum of type I and II errors: \(R(h) = \mathbb{P}(Y = 0)R_{0}(h) + \mathbb{P}(Y = 1)R_{1}(h),\) where \(R_{0}(h) = \mathbb{P}\left (h(X)\neq Y \vert Y = 0\right )\) denotes the type I error, and \(R_{1}(h) = \mathbb{P}\left (h(X)\neq Y \vert Y = 1\right )\) denotes the type II error. While the classical binary classification aims to minimize the risk R(⋅ ), the NP classification aims to mimic the NP oracle classifier ϕ , which is defined as

$$\displaystyle{\phi ^{{\ast}} =\mathop{ \mathrm{arg\ min}}\limits _{\phi:\ R_{0}(\phi )\leq \alpha }R_{1}(\phi )\,,}$$

where the user-specified level α reflects a conservative attitude (priority) towards the type I error. Figure 2 shows a toy example that demonstrates the difference between a classical classifier that aims to minimize the risk and an NP classifier.

Fig. 2
figure 2

Classical vs. NP classification in two binary classification examples. (a ) A one-dimensional toy example where X has a two-class Gaussian distribution. X follows N(−1, 1) under class 0 and N(1, 1) under class 1. The two balanced classes have equal marginal probabilities. Suppose that a user prefers a type I error ≤ 0. 05. The classical classifier \(\mathbb{I}(X \geq 0)\) that minimizes the risk would result in a type I error = 0. 16 > 0. 05. On the other hand, the NP classifier \(\mathbb{I}(X \geq 0.65)\) that minimizes the type II error under the type I error constraint ( ≤ 0. 05) delivers the desirable type I error. (b ) A two-dimensional toy example where (X 1, X 2) has a two-class uniform distribution over circles. (X 1, X 2) follows a uniform distribution on {X 1 2 + X 2 2 ≤ 4} under class 0 and a uniform distribution on {(X 1 − 2)2 + (X 2 − 2)2 ≤ 4} under class 1. The two balanced classes have equal marginal probabilities. Suppose that a user prefers a classifier that is linear in X 1 and X 2 and has type I error ≤ 0. 05. The classical classifier \(\mathbb{I}(X_{1} + X_{2} \leq 2)\) that minimizes the risk would result in a type I error = 0. 29 > 0. 05. On the other hand, the NP classifier \(\mathbb{I}(X_{1} + X_{2} \leq 2.28)\) that minimizes the type II error under the type I error constraint ( ≤ 0. 05) delivers the desirable type I error

Earlier work on the NP classification came from the engineering community. Earlier theoretical development for the NP classification includes traditional statistical learning results such as probably approximately correct bounds and oracle type inequalities [8, 38, 40]. Then performance measures for the NP classification were proposed [39]. More recently, a general solution to semi-supervised novelty detection via reduction to NP classification was developed [3]. There are also other related work [9, 19]. All these work follow an empirical risk minimization (ERM) approach, and suffer a common limitation: a relaxed empirical type I error constraint is used in the optimization program, and as a result, the type I error can only be shown to satisfy some relaxed upper bound, which is bigger than α.

We have worked extensively on NP classification using ERM and plug-in approaches [37, 46, 58]. We initiated a significant departure from the previous NP classification literature in [37] by arguing that a good classifier \(\hat{\phi }\) under the NP paradigm should respect the chosen significance level α, rather than some relaxation of it. More concretely, the following two properties should both be satisfied with high probability.

  1. (1)

    The type I error constraint is respected, i.e., \(R_{0}(\hat{\phi }) \leq \alpha\).

  2. (2)

    The excess type II error \(R_{1}(\hat{\phi }) - R_{1}(\phi ^{{\ast}})\) diminishes with an explicit rate (w.r.t. sample size).

We say a classifier satisfies the NP oracle inequalities if it has properties (1) and (2) with high probability. The NP oracle inequalities measure the theoretical performance of classifiers under the NP paradigm, as well as define a new NP counterpart of the well-established oracle inequalities under the classical paradigm (see [23] and the references within). In contrast, for a classifier \(\hat{h}\), the classical oracle inequality insists that with high probability,

$$\displaystyle{\mbox{ the excess risk $R(\hat{h}) - R(h^{{\ast}})$ diminishes with an explicit rate,}}$$

where \(h^{{\ast}}(x) = \mathbb{I}(\eta (x) \geq 1/2)\) is the Bayes classifier under the classical paradigm, with \(\eta (x) =\mathrm{ E}[Y \vert X = x] = \mathbb{P}(Y = 1\vert X = x)\) denoting the regression function of Y on X.

Using a more stringent empirical type I error constraint (less than α), we established NP oracle inequalities for their proposed classifiers under convex loss functions (as opposed to the binary loss) [37]. We also proved a negative result by constructing a counterexample: under the binary loss, ERM approaches cannot guarantee diminishing excess type II error if one insists that type I error of the classifier be bounded from above by α with high probability. This negative result motivated us to develop a plug-in approach to NP classification, described in [46].

In classical binary classification, plug-in methods that target the Bayes classifier \(\mathbb{I}(\eta (x) \geq 1/2)\) have been studied. The earliest works cast doubt on the efficacy of the plug-in approach to classification. For example, it was shown that plug-in estimators cannot achieve excess risk with rates faster than \(O(1/\sqrt{n})\) under certain assumptions [55], while direct methods can achieve fast rates up to O(1∕n) under margin assumption [30, 45, 49, 50]. However, a more recent work combined a smoothness condition on η with the margin assumption and showed that plug-in classifiers \(\mathbb{I}(\hat{\eta }_{n} \geq 1/2)\) based on local polynomial estimators can achieve rates faster than O(1∕n) [1].

The oracle classifier under the NP paradigm arises from its close connection to the Neyman–Pearson Lemma in statistical hypothesis testing. Hypothesis testing bears a strong resemblance to binary classification if we assume the following model. Let P 1 and P 0 be two known probability distributions on \(\mathcal{X} \subset \mathbb{R}^{d}\). Let ζ ∈ (0, 1) and assume that Y ∼ Bernouli(ζ). Assume further that the conditional distribution of X given Y is denoted by P Y . Given such a model, the goal of statistical hypothesis testing is to determine whether X was generated from P 1 or from P 0. To this end, we construct a randomized test \(\phi: \mathcal{X} \rightarrow [0,1]\) and the conclusion of the test based on ϕ is that X is generated from P 1 with probability ϕ(X) and from P 0 with probability 1 −ϕ(X). Two types of errors arise: type I error occurs when P 0 is rejected given X ∼ P 0, and type II error occurs when P 0 is not rejected given X ∼ P 1. The Neyman–Pearson paradigm in hypothesis testing amounts to choosing ϕ that

$$\displaystyle{\text{maximizes }E[\phi (X)\vert Y = 1]\,,\text{ subject to }E[\phi (X)\vert Y = 0] \leq \alpha \,,}$$

where α ∈ (0, 1) is the significance level of the test. A solution to this constrained optimization problem is called a most powerful test of level α. The Neyman–Pearson Lemma gives mild sufficient conditions for the existence of such a test.

Theorem 1 (Neyman–Pearson Lemma).

Let P 0 and P 1 be probability distributions possessing densities q and p, respectively, with respect to some measure μ. Let r(x) = p(x)∕q(x) and C α be such that P 0 (r(X) > C α ) ≤α and P 0 (r(X) ≥ C α ) ≥α. Then for a given level α, the most powerful test of level α is defined by

$$\displaystyle{\phi ^{{\ast}}(X) = \left \{\begin{array}{lll} 1 &~~~\text{if }&r(X) > C_{\alpha } \\ 0 &~~~\text{if }&r(X) < C_{\alpha } \\ \frac{\alpha -P_{0}(r(X)>C_{\alpha })} {P_{0}(r(X)=C_{\alpha })} & ~~~\text{if }&r(X) = C_{\alpha }\, \end{array} \right..}$$

Therefore, our plug-in target under the NP paradigm is the oracle classifier

$$\displaystyle{\phi ^{{\ast}}(x) = \mathbb{I}(r(x) \geq C_{\alpha }) = \mathbb{I}(\eta (x) \geq D_{\alpha }),\text{ where }D_{\alpha } = \frac{\mathbb{P}(Y = 1)C_{\alpha }} {\mathbb{P}(Y = 1)C_{\alpha } + \mathbb{P}(Y = 0)}\,.}$$

Note that under the classical paradigm, the oracle classifier \(\mathbb{I}(\eta (x) \geq 1/2)\) puts a threshold on the regression function η at precisely 1∕2, so plug-in methods do not involve estimating the threshold level. In contrast, the NP paradigm poses more challenges because the threshold level C α or D α needs to be estimated in addition to r(x) or η(x).

Also note that in practice, the threshold on \(\hat{\eta }(x)\) is often not set to 1∕2 but is chosen by data-driven approaches such as cross validation and bootstrap. In contrast to the NP classification, these data-driven approaches aim to minimize an estimated classification risk, not the type II error with a type I error constraint.

2.1 An Estimate of C α

Pinning down a good estimate of C α is of central importance for classifiers under the NP paradigm. Contrary to common intuition, naïvely tuning the empirical type I error to α does not deliver a desirable classifier, as we have shown in Fig. 1. To facilitate our discussion, we assume that our sample contains n i.i.d. observations \(\mathcal{S}^{1} =\{ U_{1},\cdots \,,U_{n}\}\) from class 1 with density p, and m i.i.d. observations \(\mathcal{S}^{0} =\{ V _{1},\cdots \,,V _{m}\}\) from class 0 with density q. The sample \(\mathcal{S}^{0}\) is decomposed as follows: \(\mathcal{S}^{0} = \mathcal{S}_{1}^{0} \cup \mathcal{S}_{2}^{0}\), where \(\vert \mathcal{S}_{1}^{0}\vert = m_{1}\) and \(\vert \mathcal{S}_{2}^{0}\vert = m_{2}\). Below is a generic procedure introduced in our recent paper [58].

General Neyman–Pearson Plug-In Procedure

Step 1::

Use \(\mathcal{S}^{1}\) and \(\mathcal{S}_{1}^{0}\) to construct a density ratio estimate \(\hat{r}\).

Step 2::

Given \(\hat{r},\) choose a threshold estimate \(\widehat{C} _{\alpha }\) from the set \(\hat{r}(\mathcal{S}_{2}^{0}) =\{\hat{ r}(V _{i+m_{1}})\}_{i=1}^{m_{2}}.\) Denote by \(\hat{r}_{(k)}(\mathcal{S}_{2}^{0})\) the kth order statistic of \(\hat{r}(\mathcal{S}_{2}^{0})\), k ∈ { 1, ⋯ , m 2}. The corresponding plug-in classifier by setting \(\widehat{C} _{\alpha } = \hat{r}_{(k)}(\mathcal{S}_{2}^{0})\) is

$$\displaystyle\begin{array}{rcl} \hat{\phi }_{k}(x) = \mathbb{I}\{\hat{r}(x) \geq \hat{r}_{(k)}(\mathcal{S}_{2}^{0})\}\,.& & {}\end{array}$$
(1)

The general strategy is that for any given estimate \(\hat{r}\), we want to find a proper order statistic \(\hat{r}_{(k)}(\mathcal{S}_{2}^{0})\) to estimate the threshold C α , so that type I error of the classifier defined in (1) will be controlled from above by α with high probability 1 −δ. To achieve this, it is necessary to study the distribution of order statistics, which we find to be beta-distributed. Based on a concentration inequality for beta distributed variables, we have derived the following high probability bound for \(R_{0}(\hat{\phi }_{k})\):

Proposition 1.

Suppose \(\hat{r}\) is such that \(F_{0,\hat{r}}(t) = P_{0}(\hat{r}(X) \leq t)\) is continuous almost surely. For any δ ∈ (0,1) and k ∈{ 1,⋯ ,m 2 }, it holds that

$$\displaystyle\begin{array}{rcl} \mathbb{P}\left (R_{0}(\hat{\phi }_{k}) > g(\delta,m_{2},k)\right ) \leq \delta \,,& & {}\\ \end{array}$$

where

$$\displaystyle{ g(\delta,m_{2},k) = \frac{m_{2} + 1 - k} {m_{2} + 1} + \sqrt{ \frac{k(m_{2 } + 1 - k)} {\delta (m_{2} + 2)(m_{2} + 1)^{2}}}\,. }$$

Let \(\mathcal{K} = \mathcal{K}(\alpha,\delta,m_{2}) = \left \{k \in \{ 1,\cdots \,,m_{2}\}: g(\delta,m_{2},k) \leq \alpha \right \}.\) Proposition 1 implies that \(k \in \mathcal{K}(\alpha,\delta,m_{2})\) is a sufficient condition for the classifier \(\hat{\phi }_{k}\) to satisfy the NP oracle inequality (1). The next step is to characterize \(\mathcal{K}\). The smallest \(k \in \mathcal{K}\) accommodates small excess type II error for \(\hat{\phi }_{k}\); for details, please see [58].

Proposition 2.

The minimum k that satisfies g(δ,m 2 ,k) ≤α is \(k_{\min }:= \left \lceil A_{\alpha,\delta }(m_{2}) \cdot (m_{2} + 1)\right \rceil,\) where ⌈z⌉ denotes the smallest integer larger than or equal to z and

$$\displaystyle{ A_{\alpha,\delta }(m_{2}) = \frac{1 + 2\delta (m_{2} + 2)(1-\alpha ) + \sqrt{1 + 4\delta (1-\alpha )\alpha (m_{2 } + 2)}} {2\left [\delta (m_{2} + 2) + 1\right ]} \,. }$$

The choice k min coupled with a good estimate of r or η delivers a plug-in NP classifier that satisfies the NP oracle inequalities. We have worked out estimates based on parametric and nonparametric naïve Bayes models [58], but estimates for more complex model assumptions are not yet developed. While these directions are interesting to explore, we would like to note a limitation in the use of theoretical estimates for the threshold C α . That is, the theoretical results require concentration inequalities, which are not specific to certain types of data distributions and sometimes give threshold estimates that are too conservative in practice. Therefore, we have developed an alternative route to implement the NP paradigm. This route makes the NP classification more adaptable to popular classification algorithms and thus more useful in practice.

Algorithm 1: The NP umbrella classification algorithm

2.2 The NP Umbrella Algorithm

Here we present the alternative route, the NP umbrella algorithm we developed in [47], as pseudocodes in Algorithm 1. The essential idea is to use bootstrap to approximate the distribution of type I errors and determine a threshold such that the corresponding classifier has type I errors bounded by a predefined level with high probability. This algorithm is widely applicable to the scoring type of classification methods, which include a wide range of popular methods, such as logistic regression [11], SVM [10], random forests [6], naïve Bayes [26], and neural networks [43]. Methods of the scoring type output a numeric value, i.e., a classification score, to represent the degree to which a test data point belongs to class 1. The classification scores can be strict probabilities or uncalibrated numeric values, as long as a higher score indicates a higher probability of an observation belonging to class 1. Many other classification methods that only output class labels can be converted to the scoring type via bagging to generate an ensemble of classifiers, each of which predicts a class label for a test data point, and the proportion of predicted labels being 1 serves as a classification score. Since almost all the state-of-the-art classification methods belong to or can be converted to the scoring type, this NP umbrella algorithm is easily adaptable in practice, though its theoretical properties are difficult to establish.

3 Simulation

In this section, we demonstrate the use of NP classification with three popular classification algorithms: logistic regression [11], SVM [10], and random forests [6]. The three simulation examples, each employing one algorithm, are implemented by calling the R package nproc [15] we developed in recent work [47].

The nproc package can be installed by calling

> install.packages("nproc")

and loaded into the R environment using the next command. All the following numerical results were generated by the nproc package version 0.1.

> library(nproc)

We first simulate a training data set from a logistic regression model. The training data have 1000 observations with binary responses and two-dimensional features.

> # training data

> set.seed(1)

> x1 <- rnorm(1000)      # feature 1

> x2 <- rnorm(1000)      # feature 2

> x <- cbind(x1, x2)     # matrix of features

> z <- 1 + 2*x1 + 3*x2   # linear combination of the

                            two features

> pr <- 1/(1+exp(-z))    # logisitic function to

                           generate probability

> y <- rbinom(1000,1,pr) # response as Bernoulli

                           variable

> df <- data.frame(x1=x1, x2=x2, y=y)

Figure 3 shows a scatterplot of the training data. We also simulate 1000 test data sets with 1000 observations from the same model, to evaluate the performance of classifiers.

Fig. 3
figure 3

Scatterplot of the training data with 1000 observations and two-dimensional features. Black circles and red crosses represent class 0 and class 1, respectively

> # test data

> test_data <- lapply(1:1000, FUN=function(i) {

+   set.seed(i+1)

+   x1 <- rnorm(1000)

+   x2 <- rnorm(1000)

+   x <- cbind(x1, x2)

+   z <- 1 + 2*x1 + 3*x2

+   pr <- 1/(1+exp(-z))

+   y <- rbinom(1000, 1, pr)

+   df <- data.frame(x1=x1, x2=x2, y=y)

+   return(list(x=x, y=y, df=df))

+ })

3.1 Logistic Regression

Logistic regression is a type of generalized linear model. For binary classification purposes, logistic regression can serve as a classification method. One way to interpret logistic regression is that it models

$$\displaystyle{p:= \mathbb{P}(Y = 1) = \frac{1} {1 + e^{-(\beta _{0}+\beta ^{T}X)}}}$$

for a binary response Y with d features \(X \in \mathbb{R}^{d}\). Given training data {(x i , y i )} i = 1 n, logistic regression estimates β 0 and β as maximum likelihood estimates \(\hat{\beta }_{0}\) and \(\hat{\beta }\), and then estimates p as \(\hat{p} = \left (1 + e^{-(\hat{\beta }_{0}+\hat{\beta }^{T}X) }\right )^{-1}\). This estimated probability can be interpreted as a classification score, as logistic regression predicts Y by a linear decision rule, i.e., \(\hat{Y }= \mathbb{I}(\hat{\beta }^{T}X \geq c) = \mathbb{I}\left (\hat{p} \geq \left (1 + e^{-\hat{\beta }_{0}-c}\right )^{-1}\right )\) for some threshold \(c \in \mathbb{R}\). Under the classical classification paradigm, \(c = -\hat{\beta }_{0}\), which corresponds to \(\hat{p} \geq 1/2\). Under the NP paradigm, we potentially have different choices for c. By regarding \(\hat{p}\) as a classification scoring function of X, we use the umbrella algorithm to find a threshold on \(\hat{p}\) so that the resulting classifier will have type I error below the desired level α with high probability close to 1 −δ.

We use the simulated data to demonstrate the use of NP classification with logistic regression and compare it with the classical paradigm. We first train a logistic regression model on the training data under the classical paradigm.

> lr_model1 <- glm(y~x1+x2, data=df, family="binomial")

Then we apply the trained model lr_model1 to the 1000 test data sets to evaluate the distribution of its empirical type I errors on test data.

> lr_model1_err <- sapply(test_data, FUN=function(tdat)

{ +   pred <- predict(lr_model1, tdat$df,

type="response") > 0.5 +   ind0 <- which(tdat$y==0)

+   typeI <- mean(pred[ind0]!=tdat$y[ind0])

+ })

> summary(lr_model1_err)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

 0.1290  0.1723  0.1846  0.1849  0.1980  0.2368

We next train a logistic regression model on the training data under the NP paradigm with type I error bound α = 0. 05, using the npc function.

> set.seed(1001)

# for reproducible purposes, because npc()

  involves bootstrap,

# whose results will be reproducible with a fixed seed

> lr_model2 <- npc(x=x, y=y, method='logistic',

  alpha=0.05)

Then we also applied the trained model lr_model2 to the test data.

> lr_model2_err <- sapply(test_data, FUN=function(tdat){

+   pred <- predict(lr_model2, tdat$x)

+   ind0 <- which(tdat$y==0)

+   typeI <- mean(pred$pred.label[ind0]!=tdat$y[ind0])

+ })

> summary(lr_model2_err)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.

0.009732 0.027230 0.033530 0.033950 0.040400 0.062330

> sum(lr_model2_err <= 0.05) / 1000

[1] 0.948

Comparing the empirical type I errors of the two logistic regression classifiers found under the classical and the NP paradigm, respectively, we can see that the NP classifier gives much smaller type I errors, 94. 8 % of which are under α = 0. 05.

3.2 Support Vector Machines

Similar to logistic regression, SVM is also a scoring type of classification method, for which approximate posterior probabilities of class labels proposed by Platt [35] can be used as classification scores. We demonstrate the use of NP classification with SVM on the simulated data as follows.

We first train an SVM model on the training data under the NP paradigm.

> set.seed(1001)

> svm_model <- npc(x=x, y=y, method='svm', alpha=0.05)

Then we apply the trained model svm_model to the 1000 test data sets to evaluate the distribution of its empirical type I errors on test data.

> svm_model_err <- sapply(test_data, FUN=function(tdat)

{

+   pred <- predict(svm_model, tdat$x)

+   ind0 <- which(tdat$y==0)

+   typeI <- mean(pred$pred.label[ind0]!=tdat$y[ind0])

+ })

> summary(svm_model_err)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.

0.007772 0.023530 0.029560 0.029900 0.035350 0.062330

> sum(svm_model_err <= 0.05) / 1000

[1] 0.987

We see that the SVM classifier found by the NP algorithm has empirical type I errors under α = 0. 05 with high probability.

3.3 Random Forests

Random forests is another popular and powerful classification method. It is an ensemble method of tree-based classifiers. We can also interpret it as a scoring type of method, if we consider the proportion of output votes for class 1 (i.e., the proportion of trees that predict an observation as class 1) as classification scores. We demonstrate the use of NP classification with random forests on the toy data as follows.

We first train a random forest model on the training data under the NP paradigm.

> set.seed(1001)

> rf_model <- npc(x=x, y=y, method='randomforest',

  alpha=0.05)

Then we apply the trained model rf_model to the 1000 test data sets to evaluate the distribution of its empirical type I errors on test data.

> rf_model_err <- sapply(test_data, FUN=function(tdat) {

+   pred <- predict(rf_model, tdat$x)

+   ind0 <- which(tdat$y==0)

+   typeI <- mean(pred$pred.label[ind0]!=tdat$y[ind0])

+ })

> summary(rf_model_err)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.

0.002451 0.021000 0.026280 0.026810 0.032260 0.054460

> sum(rf_model_err <= 0.05) / 1000

[1] 0.996

We see that the random forest classifier found by the NP algorithm has empirical type I errors under α = 0. 05 with high probability.

4 Case Study

We demonstrate the use of NP classification in a genomic case study on the prediction of transcription factor binding sites. A recent study [20] found that DNase-seq signals can well predict whether genomic regions containing transcription factor sequence motifs are transcription factor binding sites. DNase-seq is a recent high-throughput technology that combines traditional DNaseI footprinting [17] and next-generation DNA sequencing to identify genomic regions where regulatory factors interact with DNA to modify chromatin structure [5, 12, 31, 42]. An important question investigated in this study is which DNase-seq features can well predict binding sites of CTCF, a transcription factor that acts as an insulator to regulate the 3D structure of chromatin [34].

The study [20] formulated this question as a binary classification problem, where the goal is to classify genomic regions that contain CTCF sequence motifs into CTCF binding sites (i.e., class 1) and non-binding sites (i.e., class 0). For this task, two one-dimensional genomic features extracted from DNase-seq data are compared: (1) the number of DNase-seq tags in a 200 base pair window centered in each CTCF motif site and (2) the DNaseI footprint score calculated using the formula \(f = -\left [(n_{C} + 1)/(n_{R} + 1) + (n_{C} + 1)/(n_{L} + 1)\right ]\), where n C , n R , and n L represent, respectively, the tag count in the motif region and the flanking regions to the right and left of the motif (the lengths of the flanks are both the same as that of the motif). In the data, there are n = 216, 929 genomic regions, each with one tag count and one footprint score. Among these regions, 27, 220 regions were found as CTCF binding sites (class 0), and the rest 189, 709 regions were considered as non-binding sites (class 1). By varying the threshold on each feature, the study showed that the footprint score outperforms the tag count at low FPR and underperforms at higher FPR. In other words, if users desire a small type I error, for example, if they prefer to predict fewer but more confident CTCF binding sites, the footprint score is a better genomic feature; otherwise if users prefer a small type II error, for example, if they prefer to predict more potential CTCF binding sites, the tag count is a better feature.

However, this study only reported the observed (empirical) type I and II errors on one data set, without assessing the randomness of these empirical errors. If a user is interested in knowing which feature is better when the type I error (or FDR) is constrained under α (e.g., 5 %) with high probability, this analysis cannot provide a good answer. Here we address this question using the NP classification algorithm described in Sect. 2.2. We also compare the performance of the NP classification with the common practice, which is to tune the empirical type I error on the training data to α. At three different type I error bounds α = 0. 01, 0. 05, and 0. 1, we find their corresponding thresholds on the number of DNase-seq tags or the footprint scores via the common practice or the NP approach. Specifically, in the common practice we find the thresholds as the 99th, 95th, and 90th percentiles of the number of DNase-seq tags or the footprint scores of the 189, 709 non-binding sites; in the NP approach we use the NP algorithm with the number of bootstrap runs B = 1000 and the violation tolerance level δ = 0. 05 to find the thresholds on the number of DNase-seq tags or the footprint scores of the 189, 709 non-binding sites. Since values of each feature serve as natural classification scores in this case, we do not need to train a classification scoring function, and all the class 0 data points can be used to find the NP thresholds. That is, in the algorithm, \(\mathcal{S}^{0}\) contains all the class 0 data points, and the f function in Step 3 is just the identity map.

We evaluate the thresholds and their corresponding classifiers found by the NP approach or the common practice via bootstrap. We generate B′ = 1000 sets of n = 216, 929 bootstrap regions with corresponding tag counts and footprint scores from the original data via random sampling with replacement. Then we evaluate the empirical type I and type II errors of each threshold on each bootstrap data set, and summarize the distribution of the empirical type I and type II errors of all bootstrap data sets. The results in Table 1 show that the classifiers found by the NP approach have empirical type I errors bounded by α with high probability, while the classifiers found by the common practice have large portions of empirical type I errors above α. Comparing the means of the empirical classification errors, we can see that the classifiers found by the NP approach have slightly smaller mean type I errors and slightly larger mean type II errors, a reasonable result given its high probability bound on the type I errors. The standard deviations are similar for the two approaches. Back to the question about which feature is better when the type I error is under α with high probability, the NP classification results suggest that at α = 0. 01 and 0. 05, the footprint score is a better feature, while the tag count is a better feature at α = 0. 1.

Table 1 Comparison of classifiers established by the common practice vs. the NP approach

5 Future Research and Genomic Applications of the Neyman–Pearson Classification

Neyman–Pearson classification paradigm handles binary class classification problem, but it can be extended to address multi-class problems where errors are asymmetric in nature. Neyman–Pearson classification has wide application potentials in genomics. In Sects. 5.25.4 we describe three potential applications as future research directions: sample size determination, automatic disease diagnosis, and disease marker detection. The latter two applications were also discussed in our recent review paper [48].

5.1 Extension to Multi-class

Originating from binary trade-offs, the NP classification methods can also be applied to multi-class (Y ∈ { 1, ⋯, K}, K ≥ 3) problems using the following two strategies:

–:

[Strategy 1] Missing class 1 has more severe consequences than missing other classes. A two-step procedure can be implemented: Apply an NP method to classify a subject into class 1 versus the other classes. Stop if the subject is assigned to class 1. Otherwise, continue and apply a (multi-class) classification algorithm to assign this subject to one of the other classes {2, ⋯ , K}.

–:

[Strategy 2] There is a hierarchical order (class 1 > ⋯ > class K) of class priorities (i.e., severity of missing each class). A possible procedure is to first apply an NP method to classify a subject into class 1 versus other classes {2, ⋯ , K}. Stop if this subject is assigned to class 1. Otherwise, apply again the NP method to classify it into class 2 versus classes {3, ⋯ , K}. Continue along this line until this subject is assigned to a class.

5.2 Sample Size Determination

In clinical trials and other experimental designs, sample acquisition can be quite expensive. Therefore, how to determine the minimal sufficient sample size is an important question. Admittedly, different criteria would lead to different sample size selection procedures; so there is no hope of finding a universally golden rule. The Neyman–Pearson paradigm inspires one procedure to choose the sample size based on the theoretical upper bound of the excess type II error in the NP oracle inequalities.

We describe a contrived version of this procedure in the following. Suppose a classifier \(\hat{\varphi }\) satisfies the NP oracle inequalities, and the excess type II error of \(\hat{\varphi }\) is bounded from above by f(m, n), where m is the sample size of class 0, and n is the sample size of class 1, and f(m, n) → 0 as m and n go to infinity. If the user has a target to control the excess type II error at some γ > 0. Then m and n can be selected such that f(m, n) ≤ γ. As this procedure is based on concentration inequalities, it is conservative and might overestimate the sample size. Yet this procedure provides a valid upper bound of the needed sample size. Future statistical research is need for more accurate sample size determination.

5.3 Automatic Disease Diagnosis

Another application is a long-time challenge in clinical research: automatic disease diagnosis from patient genomic data. This challenge involves a classification problem, where diseases correspond to different classes, and the goal is to predict the diseases that are most likely associated with a patient’s genomic sample. Thanks to the development of high-throughput genomic technologies [e.g., microarray and next-generation sequencing (NGS)], a large amount of disease related genomic data can serve as training data in this classification problem. Taking gene expression data as an example, the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) contains more than 690,000 human gene expression samples that are related to hundreds of diseases, such as heart diseases, mental illnesses, infectious diseases, and various cancers.

We can study automatic disease diagnosis by NP classification and network-assisted correction using a two-step approach (Fig. 4). Step 1: using public microarray and NGS gene expression data sets with disease labels (e.g., > 100 Unified Medical Language System standardized disease concepts), we can (a) use NP classification to build a binary classifier for each disease class, and (b) classify a patient’s microarray gene expression sample into these disease classes. Step 2: (c) correct the predicted diseases based on the disease taxonomy (network). In Step 1, since the disease classes are non-exclusive (one data set may have multiple disease class labels), this multi-label classification problem is inherently composed of multiple binary classification problems, where every disease class needs a binary decision. In previous works [21, 27], binary classifiers such as SVM and naïve Bayes classifiers were used, and all disease classes were treated as interchangeable. This raises an important issue, though: some diseases are more life-threatening than others, e.g., lung cancer vs. arthritis. Therefore, it is important to allow doctors to have different levels of conservativeness, i.e., different thresholds α on the type I error (the chance of missing a disease when a patient in fact has it), for different diseases. Although previous researchers have attempted to address this trade-off between false positives and false negatives in disease diagnosis [16], they failed to control false negative rates under a desired threshold with high probability. Given the pressing need for precise disease diagnosis, the developed NP classification algorithms are in high demand to address this issue.

Fig. 4
figure 4

Automatic disease diagnosis via NP classification and network-assisted correction

5.4 Disease Marker Detection

The multi-class extension of NP classification has application potentials in detecting and screening for key markers (i.e., genes and genomic features) to aid disease diagnosis as well as to understand molecular mechanisms of diseases. In early cancer diagnosis studies that aimed to determine which genes should be included as features (markers) [16, 36, 41], classification error of each disease class versus others was used as a criterion. In other words, “the smallest set” of genes that results in low classification error for a disease class was retained as markers for that disease. However, this criterion lacks consideration of asymmetric classification errors, and as a result, the selected markers for a disease could lead to high false negative rates in the diagnosis—a dangerous situation for severe diseases such as cancers. Therefore, in the diagnosis of severe diseases, a more reasonable criterion would be to minimize the FPR given a pre-specified false negative rate control. The multi-class NP classification (Strategy 1) serves the purpose: key markers are selected so that low FPR are attained while the false negative rates are constrained below a threshold (see Fig. 5). Markers selected by this new detection strategy can be pooled to make disease prediction, in the hope of increasing the sensitivity of disease diagnosis. To implement and evaluate this strategy, we need to compare it with the more recent state-of-the-art disease prediction methods, which are for example based on multi-task learning [2, 28, 57, 59], group lasso [29], multicategory support vector machines [25], partial least squares regression [4, 33], neural networks [22, 51], and others [44].

Fig. 5
figure 5

Marker detection via multi-class NP strategy 1