1 Introduction

Estimation of cumulative distribution function (CDF) and probability density function (PDF) to random variables is a classical and basic problem in statistics, which is essential to describe some random phenomena and has significant application in signal processing [1], pattern recognition [2], machine learning [3] and so on. With the known distribution of the continuous random variable, such as Gaussian, Rayleigh, log-normal or exponential distribution, CDF and PDF can be estimated with the maximum likelihood estimation and Bayes estimation [4]. But nonparametric approach will be employed here if the distribution is not well assumed.

To estimate PDF more exactly in nonparametric approach is still a challenging problem to researchers. As the most widely used method, kernel density estimation is proposed by Rosenblatt [5] in 1956 and Parzen [6] in 1962. Many discussions have been performed to further implement such method via optimize the kernel function and bandwidth, e.g., based on the normal distribution, normal scale rules is proposed by Silverman [7] to determine the best bandwidth; Over smoothed bandwidth selection rules from Terrell [8] is more flexibility and larger application; Alexandre [9] provided iterative algorithm used when solve the equation and Plug-In estimator to give the best bandwidth corresponding to the least mean integrated square error. For the large samples with high complexity, fast Parzen density estimation by Jeon and Landgrebe [10], weighted Parzen window by Babich and Camps [11], optimally condensed data samples by Girolami and He [12], etc. are based on the subset of the large sample to reduce the running time without reduce the accuracy. What’s more, some other approaches have also been proposed to estimate PDF. Such as the sum of gamma densities [13] or a sum of exponential random variables [14] was used to substitute the kernel function to express the PDF in different fields, and orthogonal series [15], Haar’s series [16], wavelets [17]. Recently, some methods based on characteristic function [18, 19] were presented. However, all these methods are based on the series to express the PDF, in which the complexity of function would be increased with the sample size. And the accuracy of estimation was determined by the form of series, which meant one method was just suitable for some certain distributions. However, prior the process of estimation, there are little information available for us to the distribution, it will be very hard to have the proper series introduced in the estimation.

Spine functions have been widely applied in interpolation [20], smoothing of observed data [21], regression [22] and PDF estimation [23,24,25]. Inspired with the characteristics of spline function, which is a continuous function piecewise-defined by polynomial functions and possesses a high degree of smoothness where the pieces connect, and to overcome such shortness, a new method to estimate CDF or PDF is introduced here in this paper. Spline Not as previous methods, in our proposed method of spline regression, the spline function was not always defined by polynomial functions or B-splines, but could be set freely and consisted of totally different types of functions in each segment. With the method here, a new method to estimate CDF and PDF was introduced, which showed advantages in these aspects:

  1. 1.

    The PDF is expressed by piecewise functions instead of series. The estimated accuracy increases with the sample size, but the complexity of function does not increase.

  2. 2.

    The method is suitable for most types of continuous distributions, and the form of spline function and other parameters does not need to be updated unless the distribution is quite special.

  3. 3.

    The estimation is accurate for most types of distributions and is superior to kernel density estimation.

  4. 4.

    The PDF is always smooth and is not influenced by parameters.

  5. 5.

    The values of estimated CDF are less than 1, positive and monotone increasing. The values of estimated PDF are positive and the integration of PDF is about 1.

  6. 6.

    It is easy to find a subset from the large sample to reduce the running time and get similar accuracy simultaneously.

The paper is organized as below in the following sections, the new spline regression is introduced first, and then the application of proposed approach is described in the estimation of CDF and PDF. After that, comprehensive numerical experiments with Monte-Carlo simulation were made to illustrate the characteristic and advantage. At last, the PDF estimation of high dimensional random variables is discussed, and its potential application in classification and regression models is presented.

2 Method

Let \(F\left( x \right)\) and \(f\left( x \right)\) denote the CDF and PDF of random variable X, respectively, \(y\left( x \right)\) and \(y^{'\left( x \right)}\) denote the estimated CDF and PDF, respectively.

With ascending sorted samples from random variable X,

$$x_{1} ,x_{2} , \ldots ,x_{n} ,{\text{where}}\,x_{1} \le x_{2} \le \cdots \le x_{n}$$

the CDF of \(x_{i}\) is \(F\left( {x_{i} } \right) = P\left( {X \le x_{i} } \right) \approx i/\left( {n + 1} \right)\), which means the probability of event \(\left\{ {X \le x_{i} } \right\}\) is almost to be \(i/\left( {n + 1} \right)\). If we let \(y_{i} = i/\left( {n + 1} \right)\), the data points \((x_{i} ,y_{i} ) i = 1 \cdots n\) can be fitted with spline regression. Then the PDF of \(x_{i}\) can be estimated with the one order deviation of \(y\left( x \right)\), noted as \(y^{'\left( x \right)}\). Instead of spline interpolation, spline regression is used in the paper, which means that \(F\left( {x_{i} } \right)\) is not always equal to \(y_{i}\).

To avoid the large error, which is resulted from the process of derivation, transformation of random variables is employed here in the paper.

2.1 Spline Regression

Inspired with the characteristics of spline function, A new method of spline regression is introduced here, in which the spline function can be set freely and the basis functions may be totally different for each segment.

The spline function is defined as

$$y\left( x \right) = \mathop \sum \limits_{j = 1}^{v} a_{ij} \varphi_{ij} \left( x \right)\quad {\text{with}}\quad x \in \left[ {s_{i} ,s_{i + 1} } \right] ,\quad i = 1,2, \ldots ,u.$$
(1)

There are u segments in this function. For each segment to the interval \(x \in \left[ {s_{i} ,s_{i + 1} } \right] , i = 1,2, \ldots ,u\), v basis functions \(\varphi_{i1} \left( x \right), \ldots ,\varphi_{iv} \left( x \right)\) are set here, which are smooth for each segment and with nonzero derivate for each knot \(s_{2} ,s_{3} , \ldots ,s_{u}\) for their any order derivative. With the request of smoothness to the spline function, the following constrained conditions are introduced here:

$$\left\{ {\begin{array}{*{20}ll} {\mathop \sum \nolimits_{j = 1}^{v} a_{ij} \varphi_{ij} \left( {s_{i + 1} } \right) = \mathop \sum \nolimits_{j = 1}^{v} a_{i + 1,j} \varphi_{i + 1,j} \left( {s_{i + 1} } \right)\quad i = 1, \ldots ,u - 1 } \\ {\mathop \sum \nolimits_{j = 1}^{v} a_{ij} \varphi_{ij}^{\left( k \right)} \left( {s_{i + 1} } \right) = \mathop \sum \nolimits_{j = 1}^{v} a_{i + 1,j} \varphi_{i + 1,j}^{\left( k \right)} \left( {s_{i + 1} } \right) \quad i = 1, \ldots ,u - 1 k = 1, \ldots ,v - 2,} \\ \end{array} } \right.$$
(2)

where \(\varphi_{ij}^{\left( k \right)} \left( x \right)\) is the kth order derivative of \(\varphi_{ij} \left( x \right)\).

For these \(u \cdot v\) parameters \(a_{11} , \ldots ,a_{uv}\) in the spline function, \(\left( {u - 1} \right) \cdot \left( {v - 1} \right)\) linear constrained equations should be met, which means that there are \(u + v - 1\) free variables in total, noted as \(I_{1} ,I_{2} , \ldots ,I_{u + v - 1}\). As equation set here is homogeneous linear equations, based on the form of solutions, all \(a_{ij}\) can be rewrote as

$$a_{ij} = \mathop \sum \limits_{k = 1}^{u + v - 1} b_{ijk} I_{k} .$$
(3)

The spline function will be available if we can get the values of all \(b_{ijk}\) and \(I_{k}\), which will be basis to the value of \(a_{ij}\).

  1. (1)

    Values of \(b_{ijk}\)

After transposition of the constrained equation set, we can get

$$\begin{aligned} \left( {\begin{array}{*{20}c} {\varphi_{i + 1,2} \left( {s_{i + 1} } \right)} & {\varphi_{i + 1,3} \left( {s_{i + 1} } \right)} & \cdots & {\varphi_{i + 1,v} \left( {s_{i + 1} } \right)} \\ {\varphi_{i + 1,2}^{'} \left( {s_{i + 1} } \right)} & {\varphi_{i + 1,3}^{'} \left( {s_{i + 1} } \right)} & \cdots & {\varphi_{i + 1,v}^{'} \left( {s_{i + 1} } \right)} \\ \vdots & \vdots & \ddots & \vdots \\ {\varphi_{i + 1,2}^{{\left( {v - 2} \right)}} \left( {s_{i + 1} } \right)} & {\varphi_{i + 1,3}^{{\left( {v - 2} \right)}} \left( {s_{i + 1} } \right)} & \cdots & {\varphi_{i + 1,v}^{{\left( {v - 2} \right)}} \left( {s_{i + 1} } \right)} \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} {a_{i + 1,2} } \\ {a_{i + 1,3} } \\ \end{array} } \\ {\begin{array}{*{20}c} \vdots \\ {a_{i + 1,v} } \\ \end{array} } \\ \end{array} } \right) \hfill \\ \quad = \left( {\begin{array}{*{20}c} { - \varphi_{i + 1,1} \left( {s_{i + 1} } \right)} & {\varphi_{i1} \left( {s_{i + 1} } \right)} & \cdots & {\varphi_{iv} \left( {s_{i + 1} } \right)} \\ { - \varphi_{i + 1,1}^{'} \left( {s_{i + 1} } \right)} & {\varphi_{i1}^{'} \left( {s_{i + 1} } \right)} & \cdots & {\varphi_{iv}^{'} \left( {s_{i + 1} } \right)} \\ \vdots & \vdots & \ddots & \vdots \\ { - \varphi_{i + 1,1}^{{\left( {v - 2} \right)}} \left( {s_{i + 1} } \right)} & {\varphi_{i1}^{{\left( {v - 2} \right)}} \left( {s_{i + 1} } \right)} & \cdots & {\varphi_{iv}^{{\left( {v - 2} \right)}} \left( {s_{i + 1} } \right)} \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} {a_{i + 1,1} } \\ {a_{i1} } \\ \end{array} } \\ {\begin{array}{*{20}c} \vdots \\ {a_{iv} } \\ \end{array} } \\ \end{array} } \right), \hfill \\ \end{aligned}$$
(4)

in which \(i = 1,2, \ldots ,u - 1\).

Take note that if we know the values of \(a_{11} ,a_{12} , \ldots ,a_{1v} ,a_{21} ,a_{31} , \ldots ,a_{u1}\), all \(a_{ij}\) will be derived accordingly with Eq. (4), all these \(u + v - 1\) variables \(a_{11} ,a_{12} , \ldots ,a_{1v} ,a_{21} ,a_{31} , \ldots ,a_{u1}\) will be set as free variables.

If some \(I_{k} = 1\) and all others are 0 for the equation \(a_{ij} = \sum\nolimits_{k = 1}^{u + v - 1} {b_{ijk} I_{k} }\), \(a_{ij} = b_{ijk}\). \(b_{ijk}\) is derived as below: let one of \(a_{11} ,a_{12} , \ldots ,a_{1v} ,a_{21} ,a_{31} , \ldots ,a_{u1}\) be 1 and the others are 0, substitute them into Eq. (4), we can get all other \(a_{ij}\) and all \(b_{ijk}\).

  1. (2)

    Values of \(I_{k}\)

For each \(x_{i} \in \left[ {s_{k} ,s_{k + 1} } \right]\),

$$\begin{aligned} & y\left( {x_{i} } \right) = \left( {\begin{array}{*{20}c} {a_{k1} } & {a_{k2} } & \cdots & {a_{kv} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {\varphi_{k1} \left( {x_{i} } \right)} \\ {\varphi_{k2} \left( {x_{i} } \right)} \\ \vdots \\ {\varphi_{kv} \left( {x_{i} } \right)} \\ \end{array} } \right) \\ & \quad = \left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} {I_{1} } & {I_{2} } \\ \end{array} } & {\begin{array}{*{20}c} \cdots & {I_{u + v - 1} } \\ \end{array} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {b_{k11} } & {b_{k21} } & \cdots & {b_{kv1} } \\ {b_{k12} } & {b_{k22} } & \cdots & {b_{kv2} } \\ \vdots & \vdots & \ddots & \vdots \\ {b_{k1,u + v - 1} } & {b_{k2,u + v - 1} } & \cdots & {b_{kv,u + v - 1} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {\varphi_{k1} \left( {x_{i} } \right)} \\ {\varphi_{k2} \left( {x_{i} } \right)} \\ \vdots \\ {\varphi_{kv} \left( {x_{i} } \right)} \\ \end{array} } \right). \\ \end{aligned}$$
(5)

Let

$$B_{im} = \mathop \sum \limits_{j = 1}^{v} b_{kjm} \varphi_{kj} \left( {x_{i} } \right).$$

Substitute it into the above equation, and then \(y\left( {x_{i} } \right)\) can be represented as

$$y\left( {x_{i} } \right) = \mathop \sum \limits_{m = 1}^{u + v - 1} B_{im} I_{m} .$$
(6)

According to the constrained equations, the v − 2 order derivative of spline function is continuous, but the v − 1 order derivative is not continuous. Nevertheless, it is hoped there is stronger smoothness for the spline function, with De Boor’s smoothing spline [26] and define as

$$G = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left[ {y^{{\left( {v - 1} \right)}} \left( {x_{i} } \right)} \right]^{2} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left[ {\mathop \sum \limits_{j = 1}^{v} a_{ij} \varphi_{ij}^{{\left( {v - 1} \right)}} \left( {x_{i} } \right)} \right]^{2} .$$

Let

$$A_{im} = \mathop \sum \limits_{j = 1}^{v} b_{kjm} \varphi_{kj}^{{\left( {v - 1} \right)}} \left( {x_{i} } \right).$$

Then

$$G = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {\mathop \sum \limits_{m = 1}^{u + v - 1} A_{im} I_{m} } \right)^{2} .$$
(7)

We should minimize both the value of G and the sum of squared residues, so we define

$$Q = \mathop \sum \limits_{i = 1}^{n} \left[ {y_{i} - y\left( {x_{i} } \right)} \right]^{2} + \sigma G = \mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \mathop \sum \limits_{m = 1}^{u + v - 1} B_{im} I_{m} } \right)^{2} + \frac{\sigma }{n}\mathop \sum \limits_{i = 1}^{n} \left( {\mathop \sum \limits_{m = 1}^{u + v - 1} A_{im} I_{m} } \right)^{2} ,$$
(8)

where σ is a parameter called smooth factor that we should set.

Based on the least square method,

$$\frac{\partial Q}{{\partial I_{t} }} = - 2\mathop \sum \limits_{i = 1}^{n} B_{it} \left( {y_{i} - \mathop \sum \limits_{m = 1}^{u + v - 1} B_{im} I_{m} } \right) + \frac{2\sigma }{n}\mathop \sum \limits_{i = 1}^{n} A_{it} \left( {\mathop \sum \limits_{m = 1}^{u + v - 1} A_{im} I_{m} } \right) = 0\quad t = 1, \ldots ,u + v - 1.$$

After transposition

$$\mathop \sum \limits_{i = 1}^{n} B_{it} y_{i} = \mathop \sum \limits_{i = 1}^{n} B_{it} \left( {\mathop \sum \limits_{m = 1}^{u + v - 1} B_{im} I_{m} } \right) + \frac{\sigma }{n}\mathop \sum \limits_{i = 1}^{n} A_{it} \left( {\mathop \sum \limits_{m = 1}^{u + v - 1} A_{im} I_{m} } \right) \quad t = 1, \ldots ,u + v - 1.$$

Update that into matrix form, it can be rewrote to

$$\left( {B^{\text{T}} B + \frac{\sigma }{n}A^{\text{T}} A} \right)I = B^{\text{T}} Y,$$
(9)

where \(B = \left( {B_{im} } \right)_{{n \times \left( {u + v - 1} \right)}}\), \(A = \left( {A_{im} } \right)_{{n \times \left( {u + v - 1} \right)}}\), n is the sample size, σ is smooth factor, \(I = \left( {\begin{array}{*{20}c} {I_{1} } & {I_{2} } & \cdots & {I_{u + v - 1} } \\ \end{array} } \right)^{\text{T}}\) and \(Y = \left( {\begin{array}{*{20}c} {y_{1} } & {y_{2} } & \cdots & {y_{n} } \\ \end{array} } \right)^{\text{T}}\).

The value of \(I_{k}\) can be derived with the following steps:

  1. 1.

    For each \(x_{i},\) calculate the values of \(B_{im} = \mathop \sum \nolimits_{j = 1}^{v} b_{kjm} \varphi_{kj} \left( {x_{i} } \right)\) and \(A_{im} = \mathop \sum \nolimits_{j = 1}^{v} b_{kjm} \varphi_{kj}^{{\left( {v - 1} \right)}} \left( {x_{i} } \right),\)\(i = 1, \ldots ,n\quad m = 1, \ldots ,u + v - 1.\)

  2. 2.

    Set appropriate smooth factor σ and solute the equation \(\left( {B^{\text{T}} B + \sigma /n \cdot A^{\text{T}} A} \right)I = B^{\text{T}} Y\), values of \(I_{1} ,I_{2} , \ldots ,I_{u + v - 1}\) will be derived accordingly.

  3. 3.

    All parameters of the spline function can be derived with \(a_{ij} = \mathop \sum \nolimits_{k = 1}^{u + v - 1} b_{ijk} I_{k}\).

Take note that we suppose all matrices above are full rank. If some matrix is not full rank, another group of basis functions or knots will be used.

2.2 Transformation of Random Variable X

In the case of that some CDFs are not easily estimated with general spline function, such as, \(F\left( x \right) = \sqrt x ,x \in \left[ {0,1} \right]\), in which \(\mathop {\lim }\limits_{x \to 0} f\left( x \right) = + \infty\). CDF cannot be estimated with polynomial spline function to fit the data. However, if we set \(\hat{x} = { \ln }x\), then \(F\left( x \right) = e^{{\hat{x}/2}} ,\hat{x} \in \left( {0, + \infty } \right)\), which is much easier to be estimated.

For random variable X, set \(\hat{X} = \psi \left( X \right)\) in which ψ is a monotone increasing and analytic function, and let \(\hat{F}\left( {\hat{x}} \right)\) and \(\hat{f}\left( {\hat{x}} \right)\) are the distribution function and density function of \(\hat{X}\), respectively.

Then

$$F\left( x \right) = P\left( {X \le x} \right) = P\left( {\psi^{ - 1} \left( {\hat{X}} \right) \le x} \right) = P\left( {\hat{X} \le \psi \left( x \right)} \right) = P\left( {\hat{X} \le \hat{x}} \right) = \hat{F}\left( {\hat{x}} \right)$$
$$f\left( x \right) = \frac{{{\text{d}}F\left( x \right)}}{{{\text{d}}x}} = \frac{{{\text{d}}\hat{F}\left( {\hat{x}} \right)}}{{{\text{d}}\hat{x}}} \cdot \frac{{{\text{d}}\hat{x}}}{{{\text{d}}x}} = \hat{f}\left( {\hat{x}} \right)\psi^{\prime}\left( x \right).$$
(10)

Using spline function to fit the data points \(\left( {\hat{x}_{i} ,y_{i} } \right)i = 1 \ldots n\) in which \(\hat{x}_{i} = \psi \left( {x_{i} } \right)\), we can get \(\hat{F}\left( {\hat{x}} \right)\) and \(\hat{f}\left( {\hat{x}} \right)\), and then we can get \(F\left( x \right)\) and \(f\left( x \right)\) based on the above equations.

In this paper, we transformed the random variables based on the following steps:

For ordered samples: \(x_{1} ,x_{2} , \ldots ,x_{n} \left( {x_{1} \le x_{2} \le \cdots \le x_{n} } \right)\), noted the the a quantile as \(x_{an}\).

Step 1:

If \(\frac{{x_{0.02n} - x_{1} }}{{x_{0.2n} - x_{1} }} < 0.02\) and \(\frac{{x_{n} - x_{0.98n} }}{{x_{n} - x_{0.8n} }} \ge 0.02\), \(\psi_{1} = { \ln }\left( {x - x_{1} } \right)\).

If \(\frac{{x_{0.02n} - x_{1} }}{{x_{0.2n} - x_{1} }} \ge 0.02\) and \(\frac{{x_{n} - x_{0.98n} }}{{x_{n} - x_{0.8n} }} < 0.02\), \(\psi_{1} = - { \ln }\left( {x_{n} - x} \right)\).

If \(\frac{{x_{0.02n} - x_{1} }}{{x_{0.2n} - x_{1} }} \le 0.02\) and \(\frac{{x_{n} - x_{0.98n} }}{{x_{n} - x_{0.8n} }} \le 0.02\), \(\psi_{1} = { \ln }\left( {x - x_{1} } \right) - { \ln }\left( {x_{n} - x} \right)\).

And in else situation, we do not transform the random variable.

Step 2:

If \(\frac{{x_{0.05n} - x_{1} }}{{x_{0.5n} - x_{0.05n} }} > 1\) and \(\frac{{x_{n} - x_{0.95n} }}{{x_{0.95n} - x_{0.5n} }} > 1,\)\(\psi_{2} = \ln \left( {cx - cx_{0.5n} + \sqrt {1 + \left( {cx - cx_{0.5n} } \right)^{2} } } \right).\)

If \(\frac{{x_{0.05n} - x_{1} }}{{x_{0.5n} - x_{0.05n} }} > 1\) and \(\frac{{x_{n} - x_{0.95n} }}{{x_{0.95n} - x_{0.5n} }} \le 1,\)\(\psi_{2} = - { \ln }\left( {1 + cx_{n} - cx} \right).\)

If \(\frac{{x_{0.05n} - x_{1} }}{{x_{0.5n} - x_{0.05n} }} \le 1\) and \(\frac{{x_{n} - x_{0.95n} }}{{x_{0.95n} - x_{0.5n} }} > 1,\)\(\psi_{2} = { \ln }\left( {1 + cx - cx_{1} } \right).\)

Do the transformations again and again until \(\frac{{x_{0.05n} - x_{1} }}{{x_{0.5n} - x_{0.05n} }} \le 1\) and \(\frac{{x_{n} - x_{0.95n} }}{{x_{0.95n} - x_{0.5n} }} \le 1\).

where c is the value that makes \(Q = \mathop \sum \limits_{i = 1}^{n} \left[ {y_{i} - y\left( {x_{i} } \right)} \right]^{2}\) get the minimum.

Step 3:

In all situations, do the transformation \(\psi_{3} = \frac{{5\left( {x - x_{0.5n} } \right)}}{{x_{0.95n} - x_{0.05n} }}\).

After the three steps, most distributions can be estimated by the spline function. Take note that these transformations focus on the discontinuity of the two ends, but if the discontinuity is in the middle, we should separate the samples to several parts and take the spline regression for each part.

2.3 Spline Function

To define the spline function, basis functions can be set as below:

$$\varphi_{11} \left( x \right) = {\text{e}}^{{ - x^{2} }} ,\varphi_{12} \left( x \right) = x{\text{e}}^{{ - x^{2} }} ,\varphi_{13} \left( x \right) = {\text{e}}^{x} ,\varphi_{14} \left( x \right) = 1,\varphi_{15} \left( x \right) = x,$$
$$\varphi_{i1} \left( x \right) = 1,\varphi_{i2} \left( x \right) = x,\varphi_{i3} \left( x \right) = x^{2} ,\varphi_{i4} \left( x \right) = x^{3} ,\varphi_{i5} \left( x \right) = x^{4} ,\quad i = 2, \ldots ,6,$$
(11)
$$\varphi_{71} \left( x \right) = {\text{e}}^{{ - x^{2} }} ,\varphi_{72} \left( x \right) = x{\text{e}}^{{ - x^{2} }} ,\varphi_{73} \left( x \right) = {\text{e}}^{ - x} ,\varphi_{74} \left( x \right) = 1,\varphi_{75} \left( x \right) = x.$$

With such predefined basis functions, segments from the middle are quartic spline function, but in the first and last segments, the special basis functions are employed to describe the asymptotic approximation of the distribution function.

The knots \(s_{2} ,s_{3} , \ldots ,s_{u}\) are set as the 0.05, 0.23, 0.41, 0.59, 0.77, 0.95 quantile of \(x_{1} ,x_{2} , \ldots ,x_{n} ,\) respectively. If knot s is the a quantile of \(x_{1} ,x_{2} , \ldots ,x_{n}\), then \(s = x_{k}\) with k is as the approximate number of \(a \cdot n\). with such assumptions above, the first and last segment cover 5% of all sample and the other segments cover 18%, respectively.

Here, we assumed that u = 7 segments in all, and v = 5 basis functions in each segment, it can easily be obtained that the third order derivative of these functions are still continuous. The value of u may be greater than 7 for the cases of that the distribution function is more complex or the sample size is very large, However, the complexity of function will not increase with the sample size when we take any other parameters, which is quite different from most methods.

As an important parameter, the smooth factor σ will influence the performance of estimation greatly. The proper value of σ to different distribution and different sample size will be discussed in the following sections.

2.4 Adjustment of the Spline Function

With the definition or requirement to CDF, which should not be more than 1, be positive or monotone increasing, and PDF may take negative value, some constrained requirements/conditions should be introduced when we try to estimate CDF or PDF with spline regression, which may lead to much more complex in the calculation. In this paper, a simple method is introduced to resolve such problem by adjusting the spline function after regression.

In most cases, only the first and last segments of the spline function are required be adjusted. In the first segment, the constrained conditions are

$$\left( {\begin{array}{*{20}c} {\varphi_{11} \left( {s_{2} } \right)} & {\varphi_{12} \left( {s_{2} } \right)} & \cdots & {\varphi_{1v} \left( {s_{2} } \right)} \\ {\varphi_{11}^{'} \left( {s_{2} } \right)} & {\varphi_{12}^{'} \left( {s_{2} } \right)} & \cdots & {\varphi_{1v}^{'} \left( {s_{2} } \right)} \\ \vdots & \vdots & \ddots & \vdots \\ {\varphi_{11}^{{\left( {v - 2} \right)}} \left( {s_{2} } \right)} & {\varphi_{12}^{{\left( {v - 2} \right)}} \left( {s_{2} } \right)} & \cdots & {\varphi_{1v}^{{\left( {v - 2} \right)}} \left( {s_{2} } \right)} \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} {a_{11} } \\ {a_{12} } \\ \end{array} } \\ {\begin{array}{*{20}c} \vdots \\ {a_{1v} } \\ \end{array} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} {y\left( {s_{2} } \right)} \\ {y^{\prime}\left( {s_{2} } \right)} \\ \end{array} } \\ {\begin{array}{*{20}c} \vdots \\ {y^{{\left( {v - 2} \right)}} \left( {s_{2} } \right)} \\ \end{array} } \\ \end{array} } \right),$$
(12)

with v unknowns in these v − 1 equations, and only one free variable. For simple, a11 is set as the free variable, and Eqs. (12) can be updated as

$$\left( {\begin{array}{*{20}c} {\varphi_{12} \left( {s_{2} } \right)} & \cdots & {\varphi_{1v} \left( {s_{2} } \right)} \\ {\varphi_{12}^{'} \left( {s_{2} } \right)} & \cdots & {\varphi_{1v}^{'} \left( {s_{2} } \right)} \\ \vdots & \ddots & \vdots \\ {\varphi_{12}^{{\left( {v - 2} \right)}} \left( {s_{2} } \right)} & \cdots & {\varphi_{1v}^{{\left( {v - 2} \right)}} \left( {s_{2} } \right)} \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} {a_{12} } \\ \end{array} } \\ {\begin{array}{*{20}c} \vdots \\ {a_{1v} } \\ \end{array} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} {y\left( {s_{2} } \right)} \\ {y^{\prime}\left( {s_{2} } \right)} \\ \end{array} } \\ {\begin{array}{*{20}c} \vdots \\ {y^{{\left( {v - 2} \right)}} \left( {s_{2} } \right)} \\ \end{array} } \\ \end{array} } \right) - a_{11} \left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\varphi_{11} \left( {s_{2} } \right)} \\ {\varphi_{11}^{'} \left( {s_{2} } \right)} \\ \end{array} } \\ {\begin{array}{*{20}c} \vdots \\ {\varphi_{11}^{{\left( {v - 2} \right)}} \left( {s_{2} } \right)} \\ \end{array} } \\ \end{array} } \right).$$
(13)

Values of a1j, \(j = 2, \ldots ,v\) can be derived based on the initial set \(a_{11}\), different preset value of \(a_{11}\) will be repeated until we got the reasonable estimated CDF and PDF based on all samples calculated by the spline function. The last segment is adjusted in the same approach. In some cases of that the reasonable result is not available via one free value, the constrained conditions should be reduced to have two free values introduced.

The algorithm to estimate of probability distribution with spline regression model is summarized as Algorithm 1.

figure a

2.5 Method Evaluation and Comparison

For these 40 distributions (Table 1) with different types or parameters, the characteristics of these most widely used statistics has been considered to evaluate the performance of estimated CDF or PDF. However, with the increase of sample size, integrated square error (ISE), mean absolute error (MAE) and mean square error (MSE) are not valid statistics to evaluate the estimate PDF. For the example of distribution \(F\left( x \right) = \sqrt[3]{x}\,{\text{and}}\,f\left( x \right) = \frac{1}{3}x^{{ - \frac{2}{3}}} ,x \in \left( {0,1} \right)\), sample data \(x_{i} = \left( {i/n} \right)^{3} i = 1, \ldots ,n - 1\), and estimated PDF \(y^{\prime}\left( x \right) = \left\{ {\begin{array}{*{20}c} {\frac{1.01}{3}x^{{ - \frac{2}{3}}} ,x \in \left( {0,\frac{1}{8}} \right]} \\ {\frac{0.99}{3}x^{{ - \frac{2}{3}}} ,x \in \left( {\frac{1}{8},1} \right)} \\ \end{array} } \right.\) when error is 1%,

$${\text{integrated}}\,{\text{absolute}}\,{\text{error}}\,\left( {\text{IAE}} \right) = \mathop \smallint \limits_{{x_{1} }}^{{x_{n} }} \left| {y^{\prime}\left( x \right) - f\left( x \right)} \right|{\text{d}}x = 0.01 \cdot \frac{n - 2}{n},$$
$${\text{integrated}}\,{\text{square}}\,{\text{error}}\,\left( {\text{ISE}} \right) = \mathop \smallint \limits_{{x_{1} }}^{{x_{n} }} \left( {y^{\prime}\left( x \right) - f\left( x \right)} \right)^{2} {\text{d}}x = \frac{{0.01^{2} }}{3}\left( {n - \frac{n}{n - 1}} \right),$$
$${\text{mean}}\,{\text{absolute}}\,{\text{error}}\,\left( {\text{MAE}} \right) = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left| {y^{{'\left( {x_{i} } \right)}} - f\left( {x_{i} } \right)} \right| = \frac{0.01}{3} \cdot \frac{{n^{2} }}{n - 1}\mathop \sum \limits_{i = 1}^{n - 1} \frac{1}{{i^{2} }},$$
$${\text{mean}}\,{\text{square}}\,{\text{error}}\,\left( {\text{MSE}} \right) = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {y^{{'\left( {x_{i} } \right)}} - f\left( {x_{i} } \right)} \right)^{2} = \frac{{0.01^{2} }}{9} \cdot \frac{{n^{4} }}{n - 1}\mathop \sum \limits_{i = 1}^{n - 1} \frac{1}{{i^{4} }},$$

then \({\text{IAE}} \to 0.01, {\text{ISE}} \to + \infty , {\text{MAE}} \to + \infty , {\text{MSE}} \to + \infty\) when \(n \to \infty\). Integrated absolute error (IAE) is used as the statistics to evaluate the estimated PDF. Similarly, IAE and ISE are not convergent with the increase of sample data, root mean square error \({\text{RMSE}} = \sqrt {\frac{1}{n}\mathop \sum \nolimits_{i = 1}^{n} \left( {y\left( {x_{i} } \right) - F\left( {x_{i} } \right)} \right)^{2} }\) is to evaluate the estimated CDF.

Table 1 40 Distributions used in the method evaluation and comparison

And

$${\text{IAE}} = \mathop \int \limits_{{x_{1} }}^{{x_{n} }} \left| {y^{\prime}\left( x \right) - f\left( x \right)} \right|{\text{d}}x \le \mathop \int \limits_{{x_{1} }}^{{x_{n} }} \left( {y^{\prime}\left( x \right) + f\left( x \right)} \right){\text{d}}x = y\left( {x_{n} } \right) - y\left( {x_{1} } \right) + F\left( {x_{n} } \right) - F\left( {x_{1} } \right) \le 2.$$

For example, if random variable X follows the distribution:

$$F\left( x \right) = \sqrt[3]{x}\,{\text{and}}\,f\left( x \right) = \frac{1}{3}x^{{ - \frac{2}{3}}} ,x \in \left( {0,1} \right),$$

is a set of samples taken from random variable X.

In the case of estimated error is in 1%, and the estimated PDF is \(y^{\prime}\left( x \right) = \left\{ {\begin{array}{*{20}c} {\frac{1.01}{3}x^{{ - \frac{2}{3}}} ,x \in \left( {0,\frac{1}{8}} \right]} \\ {\frac{0.99}{3}x^{{ - \frac{2}{3}}} ,x \in \left( {\frac{1}{8},1} \right)} \\ \end{array} } \right.\), for each different kind of statistics to evaluate the performance of PDF estimation as below.

$${\text{integrated}}\,{\text{absolute}}\,{\text{error}}\left( {\text{IAE}} \right) = \mathop \int \limits_{{x_{1} }}^{{x_{n} }} \left| {y^{\prime}\left( x \right) - f\left( x \right)} \right|{\text{d}}x = 0.01 \cdot \frac{n - 2}{n},$$
$${\text{integrated}}\,{\text{square}}\,{\text{error}}\,\left( {\text{ISE}} \right) = \mathop \int \limits_{{x_{1} }}^{{x_{n} }} \left( {y^{\prime}\left( x \right) - f\left( x \right)} \right)^{2} {\text{d}}x = \frac{{0.01^{2} }}{3}\left( {n - \frac{n}{n - 1}} \right),$$
$${\text{mean}}\,{\text{absolute}}\,{\text{error}}\,\left( {\text{MAE}} \right) = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left| {y^{{'\left( {x_{i} } \right)}} - f\left( {x_{i} } \right)} \right| = \frac{0.01}{3} \cdot \frac{{n^{2} }}{n - 1}\mathop \sum \limits_{i = 1}^{n - 1} \frac{1}{{i^{2} }},$$
$${\text{mean}}\,{\text{square}}\,{\text{error}}\,\left( {\text{MSE}} \right) = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {y^{{'\left( {x_{i} } \right)}} - f\left( {x_{i} } \right)} \right)^{2} = \frac{{0.01^{2} }}{9} \cdot \frac{{n^{4} }}{n - 1}\mathop \sum \limits_{i = 1}^{n - 1} \frac{1}{{i^{4} }}.$$

When \({\text{IAE}} \to 0.01, {\text{ISE}} \to + \infty , {\text{MAE}} \to + \infty , {\text{MSE}} \to + \infty\), only IAE is a valid statistics to evaluate the estimation. Similarly, IAE and ISE are not suitable to evaluate the performance of estimated CDF due to that they are not always convergent.

3 Result

3.1 Basis Functions

For distributions normal (0,1), exponential (1) and Rayleigh (1), 1000 random samples were generated with Monte-Carlo simulation. Six different sets of basis functions as below were employed in the estimation of CDF and PDF.

E1:

$$\varphi_{11} \left( x \right) = {\text{e}}^{{ - x^{2} }} ,\varphi_{12} \left( x \right) = x{\text{e}}^{{ - x^{2} }} ,\varphi_{13} \left( x \right) = {\text{e}}^{x} ,\varphi_{14} \left( x \right) = 1,\varphi_{15} \left( x \right) = x,$$
$$\varphi_{i1} \left( x \right) = 1,\varphi_{i2} \left( x \right) = x,\varphi_{i3} \left( x \right) = x^{2} ,\varphi_{i4} \left( x \right) = x^{3} ,\varphi_{i5} \left( x \right) = x^{4} \quad i = 2, \ldots ,6,$$
$$\varphi_{71} \left( x \right) = {\text{e}}^{{ - x^{2} }} ,\varphi_{72} \left( x \right) = x{\text{e}}^{{ - x^{2} }} ,\varphi_{73} \left( x \right) = {\text{e}}^{ - x} ,\varphi_{74} \left( x \right) = 1,\varphi_{75} \left( x \right) = x.$$

E2:

$$\varphi_{11} \left( x \right) = {\text{e}}^{{ - x^{2} }} ,\varphi_{12} \left( x \right) = {\text{e}}^{x} ,\varphi_{13} \left( x \right) = 1,\varphi_{14} \left( x \right) = x,\varphi_{15} \left( x \right) = x^{2} ,$$
$$\varphi_{i1} \left( x \right) = 1,\varphi_{i2} \left( x \right) = x,\varphi_{i3} \left( x \right) = x^{2} ,\varphi_{i4} \left( x \right) = x^{3} ,\varphi_{i5} \left( x \right) = x^{4} \quad i = 2, \ldots ,6,$$
$$\varphi_{71} \left( x \right) = {\text{e}}^{{ - x^{2} }} ,\varphi_{72} \left( x \right) = {\text{e}}^{ - x} ,\varphi_{73} \left( x \right) = 1,\varphi_{74} \left( x \right) = x,\varphi_{75} \left( x \right) = x^{2} .$$

E3:

$$\varphi_{11} \left( x \right) = {\text{e}}^{x} ,\varphi_{12} \left( x \right) = 1,\varphi_{13} \left( x \right) = x,\varphi_{14} \left( x \right) = x^{2} ,\varphi_{15} \left( x \right) = x^{3} ,$$
$$\varphi_{i1} \left( x \right) = 1,\varphi_{i2} \left( x \right) = x,\varphi_{i3} \left( x \right) = x^{2} ,\varphi_{i4} \left( x \right) = x^{3} ,\varphi_{i5} \left( x \right) = x^{4} \quad i = 2, \ldots ,6,$$
$$\varphi_{71} \left( x \right) = {\text{e}}^{ - x} ,\varphi_{72} \left( x \right) = 1,\varphi_{73} \left( x \right) = x,\varphi_{74} \left( x \right) = x^{2} ,\varphi_{75} \left( x \right) = x^{3} .$$

E4:

$$\varphi_{i1} \left( x \right) = 1,\varphi_{i2} \left( x \right) = x,\varphi_{i3} \left( x \right) = x^{2} ,\varphi_{i4} \left( x \right) = x^{3} ,\varphi_{i5} \left( x \right) = x^{4} \quad i = 1, \ldots ,7.$$

E5:

$$\varphi_{11} \left( x \right) = {\text{e}}^{{ - x^{2} }} ,\varphi_{12} \left( x \right) = x{\text{e}}^{{ - x^{2} }} ,\varphi_{13} \left( x \right) = {\text{e}}^{x} ,\varphi_{14} \left( x \right) = x{\text{e}}^{x} ,\varphi_{15} \left( x \right) = 1,\varphi_{16} \left( x \right) = x,$$
$$\varphi_{i1} \left( x \right) = 1,\varphi_{i2} \left( x \right) = x,\varphi_{i3} \left( x \right) = x^{2} ,\varphi_{i4} \left( x \right) = x^{3} ,\varphi_{15} \left( x \right) = x^{4} ,\varphi_{i6} \left( x \right) = x^{5} \quad i = 2, \ldots ,6,$$
$$\varphi_{71} \left( x \right) = {\text{e}}^{{ - x^{2} }} ,\varphi_{72} \left( x \right) = x{\text{e}}^{{ - x^{2} }} ,\varphi_{73} \left( x \right) = {\text{e}}^{ - x} ,\varphi_{74} \left( x \right) = x{\text{e}}^{ - x} ,\varphi_{75} \left( x \right) = 1,\varphi_{76} \left( x \right) = x.$$

E6:

$$\varphi_{i1} \left( x \right) = 1,\varphi_{i2} \left( x \right) = x,\varphi_{i3} \left( x \right) = x^{2} ,\varphi_{i4} \left( x \right) = x^{3} ,\varphi_{i5} \left( x \right) = x^{4} ,\varphi_{i6} \left( x \right) = x^{5} \quad i = 1, \ldots ,7.$$

With results in Table 2, we can see the RMSECDF is not sensitive to the basis functions but IAEPDF is influenced by the basis functions significantly. Based on Fig. 1, we can also find that the two ends of the distribution are usually hard to be estimated but basis function of E1 shows the best estimation and successfully describes the asymptotic approximation, which has significant advantage over pure polynomial spline function (E4 and E6). Then we use E1 as the basis functions, which has been mentioned in Sect. 2.3.

Table 2 The evaluation result of different basis functions
Fig. 1
figure 1

Illustrations of the density functions estimated by different basis functions. a Normal distribution. b Exponential distribution. c Rayleigh distribution

3.2 Smooth Factor

With Monte Carlo simulation, IAEs for each batch of generated random samples were repeated for 50 times and averaged to evaluate the estimated PDF. In Fig. 2a, with the increase of smooth factor \(\sigma\), the averaged IAEs for most distributions decreases at first and increases after getting the minimum, but the averaged IAE to uniform distribution is in decreases with the increase of smooth factor \(\sigma\). The minimum averaged IAE is for different values to smooth factor \(\sigma\). and when σ = 2, the averaged IAE to most distributions is in the nearby area of minimum averaged IAE. In Fig. 2b, σ corresponding to the minimum averaged IAE is identical with different sample sizes, and with the increase of sample size, the averaged IAE is in smaller with the increase number of simulated sample. The minimum value of averaged IAEs are in the nearby area of σ = 2. Then σ = 2 is chosen as the optimal smooth factor for each different distribution and different sample size.

Fig. 2
figure 2

The influence of smooth factor σ to the estimated error of PDF. a With different distributions. b With different sample size

3.3 Evaluation for Well-Proportioned Samples

In the extreme case of well-proportioned samples \(x_{1} ,x_{2} , \ldots ,x_{n} \left( {x_{1} \le x_{2} \le \cdots \le x_{n} } \right)\), which means \(F\left( {x_{i} } \right) = y_{i} = i/\left( {n + 1} \right)\) can be well estimated. With Monte Carlo simulation, all these 40 distributions from Sect. 2.5 were evaluated using this well-proportioned sample. As Fig. 3, RMSEs to these estimated CDFs are all smaller than 0.00016 and IAEs to these estimated PDFs are all smaller than 0.008, which means that the proposed method estimated most distributions well.

Fig. 3
figure 3

RMSE of estimated CDF and IAE of estimated PDF to well-proportioned samples

3.4 Evaluation for Random Samples

With Monte Carlo simulation, for each distribution from Sect. 2.5, RMSE and IAE for each batch of generated random samples were repeated for 50 times and averaged to evaluate the estimated CDF and PDF. Each batch random samples were constructed in the steps as below:

  1. (1)

    Sort the sampled n random samples from standard uniform distribution \(U\left( {0,1} \right)\).

  2. (2)

    Calculate F−1(x) (inverse function of the distribution function) with these n samples, as the random samples for each distribution.

As the CDF is follow the standard unit distribution \(U\left( {0,1} \right)\), for each set of random samples,

$${\text{RMSE}}_{\text{rand}} = \sqrt {\frac{1}{n} \cdot \mathop \sum \limits_{i = 1}^{n} \left( {\frac{i}{n + 1} - F\left( {x_{i} } \right)} \right)^{2} } ,$$

is to evaluate the deviation from distribution of these samples, which can also be seen as the error of empirical distribution function with \(i/\left( {n + 1} \right)\) as the estimation of \(F\left( {x_{i} } \right)\). In Fig. 4, RMSErand for our proposed method is almost always smaller than RMSErand for any distribution, which indicates that our method is superior to empirical distribution function.

Fig. 4
figure 4

In each set of random samples, red dots are the RMSErand for each set of random numbers generated by step (1). Corresponding series of random samples to each set of random sample from step (1) is generated for each different distribution by step (2). RMSE to each estimated CDF was represented by black dots. a The sample size is 300. b The sample size is 1000

In PDF estimation, compared to kernel density estimation in Fig. 5 and Table 3, our proposed method is superior in the estimation of PDF for any distribution. Most of the IAE is in the range of 0.05–0.06 on average, while kernel density estimation is generated larger error than the proposed method. In Fig. 6, both normal distribution and Rayleigh distribution can be well estimated with both kernel density estimation with normal kernel function and spline regression, but spline regression is much more accuracy than kernel density estimation with normal kernel function in the estimation of PDF for all of these distributions.

Fig. 5
figure 5

Comparision of spline regression and kernel density function with a the sample size is 300. b the sample size is 1000

Table 3 The evaluation result of spline regression and kernel density estimation by random samples
Fig. 6
figure 6

Illustrations of the density functions estimated by spline regression and kernel density function with different distributions

3.5 Large Sample Size

With the much large sample size, CDF and PDF are estimated by a subset of the sample. With Monte Carlo simulation, the similar accuracy is for both our method for the subset sample and the full sample. In the simulated samples, the subset sample is obtained via the every 100 data for this ascending sorted sample when the sample size is n = 100,000, then the subset sample of these 1000 observations are used to estimate CDF and PDF. With Table 4, the RMSE to estimated CDF and IAE to estimated PDF are quite similar between the full sample and the subset samples.

Table 4 The evaluation result of the subsets of large samples

4 Discussion

In this section, the PDF estimation to high dimensional random variables and the application in classification and regression models will be discussed.

4.1 Probability Distribution of n Dimensional Random Variables

It is almost impossible to estimate the joint probability distribution of n dimensional random variables by limited number of samples because of the curse of dimensionality, but the problem can be simplified as the linear correlations of every variables which will be a rough but quite practical approach in the estimation.

To simplify the problem, the following assumption is to be hold:

If random variables \(Y_{1} ,Y_{2} , \ldots ,Y_{n}\) follow normal distribution, n dimensional random variable \(\left( {Y_{1} ,Y_{2} , \ldots ,Y_{n} } \right)\) follows n dimensional joint normal distribution approximately.

This approximation uses normal distribution as a bridge to construct high dimensional probability distribution. Then, for any n dimensional random variable \(\left( {X_{1} ,X_{2} , \ldots ,X_{n} } \right)\), set the marginal distribution functions as \(F_{1} \left( {x_{1} } \right),F_{2} \left( {x_{2} } \right), \ldots ,F_{n} \left( {x_{n} } \right)\) and define

$$\hat{X}_{i} = \varPhi^{ - 1} \left( {F_{i} \left( {X_{i} } \right)} \right) \quad i = 1, \ldots ,n.$$

Then \(P\left( {\hat{X}_{i} \le \hat{x}_{i} } \right) = P\left( {\varPhi^{ - 1} \left( {F_{i} \left( {X_{i} } \right)} \right) \le \hat{x}_{i} } \right) = P\left( {X_{i} \le F_{i}^{ - 1} \left( {\varPhi \left( {\hat{x}_{i} } \right)} \right)} \right) = F_{i} \left( {F_{i}^{ - 1} \left( {\varPhi \left( {\hat{x}_{i} } \right)} \right)} \right) = \varPhi \left( {\hat{x}_{i} } \right)\), where \(\varPhi \left( x \right)\) is the distribution function of standard normal distribution. And we can see \(\hat{X}_{1} ,\hat{X}_{2} , \ldots ,\hat{X}_{n}\) follow normal distribution.

Based on the previous assumption, the joint distribution function of n dimensional random variable \(\left( {X_{1} ,X_{2} , \ldots ,X_{n} } \right)\) can be estimated as

$$\begin{aligned} & F\left( {x_{1} ,x_{2} , \ldots ,x_{n} } \right) = P\left( {X_{1} \le x_{1} ,X_{2} \le x_{2} , \ldots ,X_{n} \le x_{n} } \right) \\ & \quad = P\left( {F_{1}^{ - 1} \left( {\varPhi \left( {\hat{X}_{1} } \right)} \right) \le x_{1} ,F_{2}^{ - 1} \left( {\varPhi \left( {\hat{X}_{2} } \right)} \right) \le x_{2} , \ldots ,F_{n}^{ - 1} \left( {\varPhi \left( {\hat{X}_{n} } \right)} \right) \le x_{n} } \right) \\ & \quad = P\left( {\hat{X}_{1} \le \varPhi^{ - 1} \left( {F_{1} \left( {x_{1} } \right)} \right),\hat{X}_{2} \le \varPhi^{ - 1} \left( {F_{2} \left( {x_{2} } \right)} \right), \ldots ,\hat{X}_{n} \le \varPhi^{ - 1} \left( {F_{n} \left( {x_{n} } \right)} \right)} \right) \approx \varPhi_{n} \left( {\hat{x}_{1} ,\hat{x}_{2} , \ldots ,\hat{x}_{n} } \right) \\ \end{aligned}$$

And the joint density function is

$$\begin{aligned} & f\left( {x_{1} ,x_{2} , \ldots ,x_{n} } \right) = \frac{{\partial^{n} }}{{\partial x_{1} \partial x_{2} \ldots \partial x_{n} }}F\left( {x_{1} ,x_{2} , \ldots ,x_{n} } \right) \\ & \quad \approx \frac{{\partial^{n} }}{{\partial x_{1} \partial x_{2} \ldots \partial x_{n} }}\varPhi_{n} \left( {\varPhi^{ - 1} \left( {F_{1} \left( {x_{1} } \right)} \right),\varPhi^{ - 1} \left( {F_{2} \left( {x_{2} } \right)} \right), \ldots ,\varPhi^{ - 1} \left( {F_{n} \left( {x_{n} } \right)} \right)} \right) \\ & \quad = \varphi_{n} \left( {\varPhi^{ - 1} \left( {F_{1} \left( {x_{1} } \right)} \right),\varPhi^{ - 1} \left( {F_{2} \left( {x_{2} } \right)} \right), \ldots ,\varPhi^{ - 1} \left( {F_{n} \left( {x_{n} } \right)} \right)} \right) \cdot \mathop \prod \limits_{i = 1}^{n} \frac{{f_{i} \left( {x_{i} } \right)}}{{\varphi \left( {\varPhi^{ - 1} \left( {F_{i} \left( {x_{i} } \right)} \right)} \right)}} \\ & \quad = \left| R \right|^{{ - \frac{1}{2}}} { \exp }\left( { - \frac{1}{2}\hat{x}\left( {R^{ - 1} - E} \right)\hat{x}^{T} } \right) \cdot \mathop \prod \limits_{i = 1}^{n} f_{i} \left( {x_{i} } \right) = { \exp }\left[ { - \frac{1}{2}\mathop \sum \limits_{i = 1}^{n} \left( {{ \ln }\lambda_{i} + \frac{{\beta_{i}^{2} }}{{\lambda_{i} }} - \beta_{i}^{2} - 2{ \ln }f_{i} \left( {x_{i} } \right)} \right)} \right] \\ \end{aligned}$$

where \(\hat{x} = \left( {\hat{x}_{1} ,\hat{x}_{2} , \ldots ,\hat{x}_{n} } \right)\), \(R = \left( {\rho_{ij} } \right)\) is the correlation coefficient matrix of \(\left( {\hat{X}_{1} ,\hat{X}_{2} , \ldots ,\hat{X}_{n} } \right)\), \(\lambda_{i}\) and \(\alpha_{i}\) is the eigenvalues and eigenvectors of R, respectively, and \(\left( {\beta_{1} ,\beta_{2} , \ldots ,\beta_{n} } \right) = \hat{x}\left( {\alpha_{1} ,\alpha_{2} , \ldots ,\alpha_{n} } \right)\).

\(\varphi_{n} \left( x \right) = \left( {2\pi } \right)^{{ - \frac{1}{2}}} \left| R \right|^{{ - \frac{1}{2}}} { \exp }\left( { - \frac{1}{2}xR^{ - 1} x^{\text{T}} } \right),x = \left( {x_{1} ,x_{2} , \ldots ,x_{n} } \right)\) is the density function of n dimensional standard normal distribution, and \(\varPhi_{n} \left( x \right) = \mathop \smallint \limits_{ - \infty }^{{x_{1} }} \cdots \mathop \smallint \limits_{ - \infty }^{{x_{n} }} \varphi_{n} \left( {t_{1} , \ldots ,t_{n} } \right){\text{d}}t_{1} \cdots {\text{d}}t_{n}\) is the distribution function.

4.2 Application in Bayesian Classification

For the sample with n features \(\left( {x_{1} ,x_{2} , \ldots ,x_{n} } \right)\), based on Bayes’ theorem, the probability that it belongs to class Ck is

$$P\left( {C_{k} |x_{1} ,x_{2} , \ldots ,x_{n} } \right) = \frac{{P\left( {C_{k} } \right)P\left( {x_{1} ,x_{2} , \ldots ,x_{n} |C_{k} } \right)}}{{P\left( {x_{1} ,x_{2} , \ldots ,x_{n} } \right)}} ,$$

where \(P\left( {C_{k} } \right)\) is the probability that any sample belongs to class Ck, \(P\left( {x_{1} ,x_{2} , \ldots ,x_{n} |C_{k} } \right)\) is prior and \(P\left( {C_{k} |x_{1} ,x_{2} , \ldots ,x_{n} } \right)\) is posterior.

With the assumption of that, every features are independent of each other for a given class label, which is independence of conditional probability. Then

$$P\left( {C_{k} |x_{1} ,x_{2} , \ldots ,x_{n} } \right) = \frac{{P\left( {C_{k} } \right)}}{{P\left( {x_{1} ,x_{2} , \ldots ,x_{n} } \right)}}\mathop \prod \limits_{i = 1}^{n} P\left( {x_{i} |C_{k} } \right)$$

Take note that \(P\left( {x_{1} ,x_{2} , \ldots ,x_{n} } \right)\) is independent of Ck, so we can get the predictive classification of \(\left( {x_{1} ,x_{2} , \ldots ,x_{n} } \right)\)

$$\hat{y} = \mathop {\arg \hbox{max} }\limits_{k \in 1,2, \ldots ,K} P\left( {C_{k} } \right)\mathop \prod \limits_{i = 1}^{n} P\left( {x_{i} |C_{k} } \right).$$

This method is naïve Bayes classifier, a basic algorithm in machine learning and shows implausible efficacy in many complex real-world situations [27,28,29].

But release such strong assumption and based on the estimation of the density function of n dimensional random variables, the predictive classification can be calculated straight forwardly:

$$\hat{y} = \mathop {\arg \hbox{max} }\limits_{k \in 1,2, \ldots ,K} P\left( {C_{k} } \right)P\left( {x_{1} ,x_{2} , \ldots ,x_{n} |C_{k} } \right).$$

With such update, we not only include the correlation of features into the model prediction, but also greatly extend the application of Bayesian classification.

4.3 Application in Maximum Likelihood Regression

With the proposed approach in the CDF and PDF estimation, maximum likelihood estimation can be extended from parameter estimation [30] to regression models.

The maximum likelihood function to sample \(\left( {x_{1} ,x_{2} , \ldots ,x_{n} } \right)\) is \(L\left( y \right) = P\left( {Y = y |x_{1} ,x_{2} , \ldots ,x_{n} } \right)\) with L(y) get the maximum value at the point y.

Then

$$\begin{aligned} L\left( y \right) & = \frac{{P\left( {X_{1} = x_{1} ,X_{2} = x_{2} , \ldots ,X_{n} = x_{n} ,Y = y} \right)}}{{P\left( {X_{1} = x_{1} ,X_{2} = x_{2} , \ldots ,X_{n} = x_{n} } \right)}} \\ & = \frac{{\left| {\begin{array}{*{20}c} R & {r^{\text{T}} } \\ r & 1 \\ \end{array} } \right|^{{ - \frac{1}{2}}} { \exp }\left( { - \frac{1}{2}\left( {\hat{x},\hat{y}} \right)\left( {\left( {\begin{array}{*{20}c} R & {r^{\text{T}} } \\ r & 1 \\ \end{array} } \right)^{ - 1} - E} \right)\left( {\hat{x},\hat{y}} \right)^{\text{T}} } \right) \cdot f_{Y} \left( y \right) \cdot \mathop \prod \nolimits_{i = 1}^{n} f_{n} \left( {x_{i} } \right)}}{{\left| R \right|^{{ - \frac{1}{2}}} { \exp }\left( { - \frac{1}{2}\left( {\hat{x},\hat{y}} \right)\left( {\left( {\begin{array}{*{20}c} {R^{ - 1} } & O \\ O & 1 \\ \end{array} } \right) - E} \right)\left( {\hat{x},\hat{y}} \right)^{\text{T}} } \right) \cdot \mathop \prod \nolimits_{i = 1}^{n} f_{n} \left( {x_{i} } \right)}} \\ & = \frac{{f_{Y} \left( y \right)}}{{\sqrt {1 - rR^{ - 1} r^{T} } }}{ \exp }\left( { - \frac{1}{2}\left( {\hat{x},\hat{y}} \right)\left( {\left( {\begin{array}{*{20}c} R & {r^{\text{T}} } \\ r & 1 \\ \end{array} } \right)^{ - 1} - \left( {\begin{array}{*{20}c} {R^{ - 1} } & O \\ O & 1 \\ \end{array} } \right)} \right)\left( {\hat{x},\hat{y}} \right)^{\text{T}} } \right), \\ \end{aligned}$$

where \(F_{Y} \left( y \right)\) and \(f_{Y} \left( y \right)\) is the distribution and density function of Y, respectively, \(\hat{x} = \left( {\hat{x}_{1} ,\hat{x}_{2} , \ldots ,\hat{x}_{n} } \right)\), \(\hat{y} = \varPhi^{ - 1} \left( {F_{Y} \left( y \right)} \right)\), R is the correlation coefficient matrix of \(\left( {\hat{X}_{1} ,\hat{X}_{2} , \ldots ,\hat{X}_{n} } \right)\) and r is the correlation coefficient vector between \(\hat{Y}\) and \(\hat{X}_{1} ,\hat{X}_{2} , \ldots ,\hat{X}_{n}\).

With \(\left( {\begin{array}{*{20}c} R & {r^{\text{T}} } \\ r & 1 \\ \end{array} } \right)^{ - 1} = \left( {\begin{array}{*{20}c} {\left( {R - r^{\text{T}} r} \right)^{ - 1} } & { - \left( {R - r^{\text{T}} r} \right)^{ - 1} r^{\text{T}} } \\ { - r\left( {R - r^{\text{T}} r} \right)^{ - 1} } & {1 + r\left( {R - r^{\text{T}} r} \right)^{ - 1} r^{\text{T}} } \\ \end{array} } \right)\), the log-likelihood function can be rewritten as

$${ \ln }L\left( y \right) = - \frac{1}{2}\left\{ {\hat{x}\left[ {\left( {R - r^{\text{T}} r} \right)^{ - 1} - R^{ - 1} } \right]\hat{x}^{\text{T}} - 2r\left( {R - r^{\text{T}} r} \right)^{ - 1} \hat{x}^{\text{T}} \hat{y} + r\left( {R - r^{\text{T}} r} \right)^{ - 1} r^{\text{T}} \hat{y}^{2} } \right\}.$$

MLE y will be derived with equation as below.

$$\frac{{{\text{d}}\ln L\left( y \right)}}{{{\text{d}}y}} = \frac{{f^{'(y)} }}{f\left( y \right)} + \frac{f\left( y \right)}{{\varphi \left( {\varPhi^{ - 1} \left( {F\left( y \right)} \right)} \right)}} r\left( {R - r^{\text{T}} r} \right)^{ - 1} \left( {\hat{x}^{\text{T}} - r^{\text{T}} \varPhi^{ - 1} \left( {F\left( y \right)} \right)} \right) = 0.$$

If Y follows normal distribution N (μ, σ), the above equation can be simplified as

$$- \frac{y - \mu }{1} + r\left( {R - r^{\text{T}} r} \right)^{ - 1} \left( {\sigma \hat{x}^{\text{T}} - r^{\text{T}} \cdot \left( {y - \mu } \right)} \right) = 0.$$

Then

$$\hat{y} = \mu + \frac{{\sigma r\left( {R - r^{\text{T}} r} \right)^{ - 1} \hat{x}^{\text{T}} }}{{1 + r\left( {R - r^{\text{T}} r} \right)^{ - 1} r^{\text{T}} }},$$

which is the value \(\hat{y}\) as we got for linear regression.

5 Conclusion

In this study, we proposed a new method to estimate CDF and PDF based on a new spline regression, in which the spline function is not always defined by polynomial functions or B-splines, but can be set freely and consists of totally different types of functions in each segment. In this method, the PDF is expressed by piecewise functions instead of series, and with the increase of sample size, the estimated accuracy increases but the complexity of function does not increase. This method is suitable for most types of continuous distributions, and the form of spline function and other parameters does not need to be changed unless the distribution is quite special. The estimation is accurate for various types of distributions and is superior to kernel density estimation. The PDF is always smooth and is not influenced by parameters. The values of estimated CDF are less than 1, positive and monotone increasing. The values of estimated PDF are positive and the integration of PDF is about 1. And it is easy to find a subset from the large sample to reduce the running time and get similar accuracy simultaneously. PDF estimation of high dimensional random variables was also discussed and its potential application in Bayesian classification models and maximum likelihood regression models was presented.