1 Introduction

In this article we introduce a Bayesian semiparametric bivariate copula and to clearly state our objectives we start with some definitions. Let (XY) be a bivariate random vector with joint cumulative distribution function (CDF) H(xy) and marginal CDFs F(x) and G(y), respectively. According to Sklar (1959), there exists a copula function C(uv) with \(C:[0,1]^2\rightarrow [0,1]\) that satisfies the conditions to be a proper CDF with uniform marginals, such that \(H(x,y)=C(F(x),G(y))\).

Dependence or association measures between the two random variables (XY), independently of their marginal distributions, can be entirely written in terms of the copula. For instance Kendall’s \(\tau \) and Spearman’s \(\rho \) are given by

$$\begin{aligned} \tau =4\int _0^1\int _0^1 C(u,v)f_C(u,v)\hbox {d}u \hbox {d}v-1\quad \hbox {and}\quad \rho =12\int _0^1\int _0^1 uvf_C(u,v)\hbox {d}u\hbox {d}v-3, \end{aligned}$$
(1)

respectively, where \(f_C(u,v)\) is the corresponding copula density (e.g. Nelsen 2006). Therefore, our interest lies on estimating the copula, either the CDF C(uv) or the density \(f_C(u,v)\).

Nonparametric estimation of copulas was first proposed by Deheuvels (1979) who introduced the empirical copula based on the multivariate empirical distribution on the marginal empirical distributions. Later, Fermanian et al. (2004) studied weak convergence properties of the empirical copula. Smoother estimators were also proposed based on kernels (e.g. Gijbels and Mielniczuk 1990; Fermanian and Scaillet 2003). For example Chen and Huang (2007) proposed a bivariate kernel copula based on local linear kernels that is everywhere consistent on \([0,1]^2\). In a Bayesian prepective, Hoyos-Argüelles and Nieto-Barajas (2020) proposed a nonparametric estimator of the generator, in an Archimedean copula, based on quadratic splines.

Recently, as a generalisation of the empirical copula, González-Barrios and Hoyos-Argüelles (2018) introduced a sample copula of order m based on a modified rank transformation of the data. Since our model is a generalisation of the sample copula, we review it here in detail. Let \((X_i,Y_i)\), \(i=1,\ldots ,n\), be a bivariate sample with support \(\Omega \subset \mathbb {R}^2\). Based on the probability integral transformation and using the empirical CDF of each coordinate, the modified rank transformation (Deheuvels 1979) \((U_i,V_i)\) is defined as

$$\begin{aligned} U_i=\textrm{rank}(i,{\textbf{X}})/n\quad \hbox {and}\quad V_i=\textrm{rank}(i,{\textbf{Y}})/n, \end{aligned}$$
(2)

where \(\textrm{rank}(i,{\textbf{X}})=k\) if and only if \(X_i=X_{(k)}\) for \(i,k=1,\ldots ,n\). The modified sample \((U_i,V_i)\), \(i=1,\ldots ,n\) has support in \([0,1]^2\), but contains all relevant information in the data to characterise the copula (dependence). In particular, the original sample \((X_i,Y_i)\) and the modified rank transformed sample \((U_i,V_i)\) produce exactly the same sample Spearman’s rho coefficient (e.g Nelsen 2006).

Now, independently of the data, let us consider a uniform partition of size m, \(2\le m\le n\), for each of the two coordinates in [0, 1]. Then \(\{Q_{j,k},\;j,k=1,\ldots ,m\}\) defines a partition of size \(m^2\) of \([0,1]^2\) such that

$$\begin{aligned} Q_{j,k}=\left( \frac{j-1}{m},\frac{j}{m}\right] \times \left( \frac{k-1}{m},\frac{k}{m}\right] \end{aligned}$$
(3)

is the region in the unitary square formed by the cross product of intervals jth and kth of the first and second coordinate, respectively, for \(j,k=1,\ldots ,m\). To illustrate, Fig. 1 depicts a partition with \(m=5\). Let \(r_{j,k}\) be the number of modified sample points belonging to region \(Q_{j,k}\), in notation

$$\begin{aligned} r_{j,k}=\sum _{i=1}^n I((u_i,v_i)\in Q_{j,k}), \end{aligned}$$
(4)

for \(j,k=1,\ldots ,m\) such that \(\sum _{j=1}^m\sum _{k=1}^m r_{j,k}=n\). Then the sample copula density of order m is defined as

$$\begin{aligned} f_S(u,v)=(m^2/n)\sum _{j=1}^m\sum _{k=1}^m r_{j,k}I\left( (u,v)\in Q_{j,k}\right) . \end{aligned}$$
(5)

Further properties of this sample copula were studied in González-Barrios and Hoyos-Argüelles (2021).

In this paper we propose a semiparametric copula model whose maximum likelihood estimator coincides, under certain conditions, to the sample copula of order m. We further propose a Bayesian approach for inference purposes and introduce a novel prior that borrows strength across some neighbouring regions in the space and produces smooth estimates. Posterior inference is obtained via a Markov Chain Monte Carlo algorithm that requires a Metropolis–Hastings step. We propose a novel adaptation scheme for the random walk proposal distributions.

The outline of the rest of the paper is as follows: In Sect. 2 we define the semiparametric copula model and obtain the maximum likelihood estimators of the model parameters. In Sect. 3 we introduce the spatially dependent prior and study its properties. Section 4 characterises posterior distributions and deals with posterior computations. In Sect. 5 we present a simulation study to show the performance of our model under different scenarios and carry out a real data analysis. We conclude in Sect. 6 with a discussion.

2 Model

Let us consider a uniform partition of size \(m^2\), \(2\le m\le n\), of \([0,1]^2\) as in (3). We define a semiparametric copula density of the form

$$\begin{aligned} f_C(u,v\mid \varvec{\theta })=m^2\sum _{j=1}^m\sum _{k=1}^m\theta _{j,k}I\left( (u,v)\in Q_{j,k}\right) , \end{aligned}$$
(6)

where \(\varvec{\theta }=\left\{ \theta _{j,k}\in [0,1/m], j,k=1\ldots ,m \right\} \) are the set of model parameters that satisfy the following conditions:

$$\begin{aligned} \sum _{j=1}^m\theta _{j,k}=\sum _{k=1}^m\theta _{j,k}=\frac{1}{m}\quad \text {and}\quad \sum _{j=1}^m\sum _{k=1}^m\theta _{j,k}=1. \end{aligned}$$
(7)

Our semiparametric model (6) can be seen as a bivariate probability histogram with \(m^2\) number of bins, and (7) are the required conditions such that the marginal induced densities are uniform and the bivariate density is proper, respectively. Note that (6) resembles the sample copula (5), however, (6) is a parametrised model, whereas (5) is a nonparametric estimator of a bivariate copula.

Conditions (7) constrain the parameter space \(\Theta \) leaving us with a reduced number of parameters. That is, instead of having \(m^2\) parameters, we end up having \((m-1)^2\) free parameters, \(\{\theta _{j,k}\}\) for \(j,k=1,\ldots ,m-1\), where the boundary parameters are defined as

$$\begin{aligned} \theta _{j,m}=\frac{1}{m}-\sum _{k=1}^{m-1}\theta _{j,k},\quad \theta _{m,k}=\frac{1}{m}-\sum _{j=1}^{m-1}\theta _{j,k}\quad \hbox {and}\quad \theta _{m,m}=\sum _{j=1}^{m-1}\sum _{k=1}^{m-1}\theta _{j,k}-\frac{m-2}{m} \end{aligned}$$
(8)

for \(j,k=1,\ldots ,m-1\). In this case, the parameter space \(\Theta \) is defined by the following constraints for the free parameters

$$\begin{aligned} 0&<\sum _{j=1}^{m-1}\theta _{j,k}<\frac{1}{m},\quad \forall k,\quad 0<\sum _{k=1}^{m-1}\theta _{j,k}<\frac{1}{m},\quad \forall j\quad \text {and}\nonumber \\ \frac{m-2}{m}&<\sum _{j=1}^{m-1}\sum _{k=1}^{m-1}\theta _{j,k}<\frac{m-1}{m}. \end{aligned}$$
(9)

The corresponding copula can be obtained as the CDF of the copula density (6). This has the expression

$$\begin{aligned} C(u,v\mid \varvec{\theta })=\sum _{j=1}^m\sum _{k=1}^m\left( A_{j,k}+B_{j,k}u+D_{j,k}v+m^2\theta _{j,k}uv\right) I\left( (u,v)\in Q_{j,k}\right) , \end{aligned}$$
(10)

where

$$\begin{aligned} A_{j,k}&=\sum _{r=1}^j\sum _{s=1}^k\theta _{r,s}-j\sum _{s=1}^k\theta _{j,s}-k\sum _{r=1}^j\theta _{r,k}+jk\theta _{j,k},\\ B_{j,k}&=m\sum _{s=1}^k\theta _{j,s}-mk\theta _{j,k}\quad \hbox {and}\quad D_{j,k}=m\sum _{r=1}^j\theta _{r,k}-mj\theta _{j,k}. \end{aligned}$$

Although the copula density (6) is piecewise constant, the corresponding copula (10) is absolutely continuous. Moreover, Spearman’s \(\rho \) coefficient has a simple expression

$$\begin{aligned} \rho (\varvec{\theta })=\frac{3}{m^2}\left\{ 4\sum _{j=1}^m\sum _{k=1}^m jk\theta _{j,k}-(m+1)^2\right\} . \end{aligned}$$
(11)

A similar expression as (11) was obtained by González-Barrios and Hernández-Cedillo (2013) for the sample copula.

To establish a connection with the sample copula, we provide the maximum likelihood estimators of the model parameters, which are given in the following Proposition.

Proposition 1

Let \((U_i,V_i)\), \(i=1,\ldots ,n\) be a bivariate sample of size n from copula density (6). The maximum likelihood estimators (MLE) \(\widehat{\theta }_{i,j}\) of the parameters \(\theta _{j,k}\), for \(j,k=1,\ldots ,m-1\), satisfy

$$\begin{aligned} \frac{r_{j,k}}{\widehat{\theta }_{j,k}}+\frac{r_{m,m}}{\sum _{t=1}^{m-1}\sum _{s=1}^{m-1}\widehat{\theta }_{t,s}-(m-2)/m}=\frac{r_{j,m}}{1/m-\sum _{s=1}^{m-1}\widehat{\theta }_{j,s}}+\frac{r_{m,k}}{1/m-\sum _{t=1}^{m-1}\widehat{\theta }_{t,k}} \end{aligned}$$
(12)

where \(r_{j,k}\), for \(j,k=1,\ldots ,m\), are given in (4).

Proof

Given the observed sample \({\textbf{u}}=\{u_i\}\) and \({\textbf{v}}=\{v_i\}\), the log-likelihood function for \(\varvec{\theta }\) is given by

$$\begin{aligned} \log f({\textbf{u}},{\textbf{v}}\mid \varvec{\theta })&= 2n\log (m)+\sum _{j=1}^{m-1}\sum _{k=1}^{m-1} r_{j,k}\log \left( \theta _{j,k}\right) \\&\quad +\sum _{j=1}^{m-1}r_{j,m}\log (\theta _{j,m})+ \sum _{k=1}^{m-1}r_{m,k}\log (\theta _{m,k})+ r_{m,m}\log (\theta _{m,m}), \end{aligned}$$

where \(\theta _{j,m}\), \(\theta _{m,k}\) and \(\theta _{m,m}\) are given in (8). We take first derivative with respect to \(\theta _{j,k}\) and obtain

$$\begin{aligned} \frac{\partial }{\partial \theta _{j,k}}\log f=\frac{r_{j,k}}{\theta _{j,k}}-\frac{r_{j,m}}{\theta _{j,m}}-\frac{r_{m,k}}{\theta _{m,k}}+\frac{r_{m,m}}{\theta _{m,m}}, \end{aligned}$$

for \(j,k=1,\ldots ,m-1\). After equating the first derivative to zero we obtain condition (12). To prove that the critical point is a maximum we further take second derivative and obtain

$$\begin{aligned} \frac{\partial ^2}{\partial \theta _{j,k}^2}\log f=-\frac{r_{j,k}}{\theta _{j,k}^2}-\frac{r_{j,m}}{\theta _{j,m}^2}-\frac{r_{m,k}}{\theta _{m,k}^2}-\frac{r_{m,m}}{\theta _{m,m}^2} \end{aligned}$$

which is clearly negative. \(\square \)

Proposition 1 provides conditions to obtain the MLE estimators of parameters \(\theta _{j,k}\). However these conditions rely on nonlinear equations. For the specific case of \(m=2\), we can obtain explicit analytic expressions. Condition (12) simplifies to

$$\begin{aligned} \frac{r_{1,1}}{\widehat{\theta }_{1,1}}+\frac{r_{2,2}}{\widehat{\theta }_{1,1}}=\frac{r_{1,2}}{1/2-\widehat{\theta }_{1,1}}+\frac{r_{2,1}}{1/2-\widehat{\theta }_{1,1}}, \end{aligned}$$

which after some algebra we obtain

$$\begin{aligned} \widehat{\theta }_{1,1}=\frac{r_{1,1}+r_{2,2}}{2n}. \end{aligned}$$

This is an interesting result since the MLE of the unique parameter of the model (when \(m=2\)), \(\theta _{1,1}\), is an average of the number of points that lie in opposite regions \(Q_{1,1}\) and \(Q_{2,2}\).

In practice we do not observe data directly from the copula, that is, \((U_i,V_i)\), \(i=1,\ldots ,n\) with support in \([0,1]^2\). What we usually observe are data \((X_i,Y_i)\), \(i=1,\ldots ,n\), with support \(\Omega \subset \mathbb {R}^2\) coming from the CDF \(H(x_i,y_i)=C(F(x_i),G(y_i))\), with F and G the marginal CDFs of each coordinate, respectively. However, as in Deheuvels (1979), we can obtain a modified sample \((U_i,V_i)\) using the modified rank transformation (2), to estimate the copula. In this case we have another interesting case from Proposition 1, whose result is given in the following corollary.

Corollary 1

Let \((U_i,V_i)\), \(i=1,\ldots ,\ldots ,n\) be a bivariate modified rank transformed sample of size n for data \((X_i,Y_i)\) coming from copula density (6) or coming from CDF \(H(x_i,y_i)=C(F(x_i),G(y_i))\). Additionally, if m divides n, the MLEs of the parameters \(\theta _{j,k}\), reduce to

$$\begin{aligned} \widehat{\theta }_{j,k}=\frac{r_{j,k}}{n}, \end{aligned}$$

for \(j,k=1,\ldots ,m-1\), where \(r_{j,k}\) is given in (4). Furthermore, the MLE of the copula density \(\widehat{f}_C(u,v\mid \varvec{\theta })=f_C(u,v\mid \widehat{\varvec{\theta }})\), reduces to the sample copula (5) of González-Barrios and Hoyos-Argüelles (2018).

Proof

For modified rank transformed data and when m divides n the following marginal conditions are satisfied: \(\sum _{j=1}^m r_{j,k}=\sum _{k=1}^m r_{j,k}=\frac{n}{m}\). We only need to prove that \(\widehat{\theta }_{j,k}=\frac{r_{j,k}}{n}\) for \(j,k=1,\ldots ,m-1\) satisfy condition (12). Working on the boundary elements and considering the hypothesis, \(\sum _{t=1}^{m-1}\sum _{s=1}^{m-1}\widehat{\theta }_{t,s}-(m-2)/m\) becomes \(r_{m,m}/n\); \(1/m-\sum _{s=1}^{m-1}\widehat{\theta }_{j,s}\) becomes \(r_{j,m}/n\); and \(1/m-\sum _{t=1}^{m-1}\widehat{\theta }_{t,k}\) becomes \(r_{m,k}/n\). Substituting these values into (12) we obtain that \(2n=2n\) which is clearly true. \(\square \)

To carry out inference on the model parameters, we suggest to follow a Bayesian approach instead.

3 Prior distributions

Here we propose a prior that recognises dependence across \(\theta _{j,k}\)’s that belong to neighbouring regions \(Q_{j,k}\) in the partition grid. The most common prior for spatial dependence in areas is the conditionally autoregressive model (Besag 1974), however this model is defined in terms of normal distributions and the marginal support is the real line. In our case the parameter space is constrained to the interval [0, 1], so we propose an alternative prior that extends the work of Jara et al. (2013) and uses ideas from Nieto-Barajas and Bandyopadhyay (2013).

Fig. 1
figure 1

Graphical representation of unit square partition with \(m=5\). Neighbouring regions of location (3, 3) are painted in gey

Let \(\partial _{j,k}\) be the set of indexes of spatial neighbours of region \(Q_{j,k}\), \(j,k=1,\ldots ,m\). Since all regions are rectangles, for the purpose of this work, two regions will be considered neighbours if they share an edge, for instance, region defined by indexes (jk) will have a set of neighbours \(\partial _{j,k}=\{(j,k),(j,k-1),(j,k+1),(j-1,k),(j+1,k)\}\). This is illustrated in Fig. 1 with grey shadows showing the neighbouring regions of location (3, 3). Note that a region is considered a neighbour of itself and that regions (jk) located at the boundaries of the grid will have less than five neighbours.

Instead of defining the dependence directly on the \(\{\theta _{j,k}\}\), we will rely on a set of latent parameters \(\{\eta _{j,k}\}\) associated to each of the regions (jk) in the grid. This latter set will be conditionally independent given a common parameter \(\omega \). Therefore, our spatial dependence prior is based on conjugate distributions, in a Bayesian context, and is defined through a three-level hierarchical model of the form

$$\begin{aligned} \theta _{j,k}\mid \varvec{\eta }&{\mathop {\sim }\limits ^{\textrm{ind}}}\hbox {Be}\left( a+\sum _{(r,s)\in \partial _{j,k}}\eta _{r,s}\,,\,b+\sum _{(r,s)\in \partial _{j,k}}\left( c_{r,s}-\eta _{r,s}\right) \right) \nonumber \\ \eta _{j,k}\mid \omega&{\mathop {\sim }\limits ^{\textrm{ind}}}\hbox {Bin}\left( c_{j,k},\,\omega \right) \nonumber \\ \omega&\sim \hbox {Be}(a,b), \end{aligned}$$
(13)

where \(a,b>0\) and \(c_{j,k}\in \textrm{I}\!\textrm{N}\), for \(j,k=1,\ldots ,m-1\). We will refer to prior (13) as spatial beta process and will denote it by \(\hbox {SBeP}(a,b,{\textbf{c}})\), where \({\textbf{c}}=\{c_{j,k}\}\).

The reason for requiring a three level hierarchical model becomes clear when we study its properties. In particular, the marginal distribution induced for each \(\theta _{j,k}\) and the correlation between any two of them are given in the following proposition.

Proposition 2

Let \(\varvec{\theta }=\{\theta _{j,k}\}\sim \hbox {SBeP}(a,b,{\textbf{c}})\) given in (13). Then, \(\theta _{j,k}\sim \hbox {Be}(a,b)\) marginally for all \(j,k=1,\ldots ,m-1\). Moreover, the correlation between any two parameters, say \(\theta _{j,k}\) and \(\theta _{j',k'}\) is given by

$$\begin{aligned} \hbox {Corr}\left( \theta _{j,k},\theta _{j',k'}\right) =\frac{(a+b)\left( \sum _{(r,s)\in \partial _{j,k}\cap \partial _{j',k'}}c_{r,s}\right) +\left( \sum _{(r,s)\in \partial _{j,k}}c_{r,s}\right) \left( \sum _{(r,s)\in \partial _{j',k'}}c_{r,s}\right) }{\left( a+b+\sum _{(r,s)\in \partial _{j,k}}c_{r,s}\right) \left( a+b+\sum _{(r,s)\in \partial _{j',k'}}c_{r,s}\right) }. \end{aligned}$$

Proof

To obtain the marginal distribution of \(\theta _{j,k}\), we note that the conditional distribution of the sum of the latent variables is \(\sum _{(r,s)\in \partial _{j,k}}\eta _{r,s}\mid \omega \sim \hbox {Bin}(\sum _{(r,s)\in \partial _{j,k}}c_{r,s},\omega )\), so unconditionally its distribution is a beta-binomial \(\hbox {BeBin}(a,b,\sum _{(r,s)\in \partial _{j,k}}c_{r,s})\). Therefore, due to conjugacy (e.g. Bernardo and Smith 2000), it follows that the marginal distribution of \(\theta _{j,k}\) is beta. For the correlation we use conditional expectation formula twice to obtain the covariance and use the marginal distribution result to obtain the variances. \(\square \)

Another important property of our prior is that if \(c_{j,k}=c\), that is, common for all \(j,k=1,\ldots ,m-1\), the latent parameters \(\{\eta _{j,k}\}\) are exchangeable, and the joint distribution of \(\{\theta _{j,k}\}\) becomes a strictly stationary process. Additionally, if \(c_{j,k}=0\) for all (jk), then the \(\theta _{j,k}\)’s become all independent. Therefore for \(c_{j,k}>0\), the spatial beta process (13) defines a prior for parameters in the bounded support [0, 1] and with dependence across neighbouring \(\theta _{j,k}\)’s according to the set \(\partial _{j,k}\). This will produce a smoothing effect in the Bayesian estimation of parameters \(\varvec{\theta }\) of copula model (6).

4 Posterior distributions

Let \((U_i,V_i)\), \(i=1,\ldots ,n\) be a bivariate sample of size n from copula density (6). Then the likelihood function in terms of the \((m-1)^2\) free parameters has the form

$$\begin{aligned} f({\textbf{u}},{\textbf{v}}\mid \varvec{\theta })=m^{2n}\theta _{m,m}^{r_{m,m}} \left\{ \prod _{j=1}^{m-1}\theta _{j,m}^{r_{j,m}}\right\} \left\{ \prod _{k=1}^{m-1}\theta _{m,k}^{r_{m,k}}\right\} \prod _{j=1}^{m-1}\prod _{k=1}^{m-1}\theta _{j,k}^{r_{j,k}} \end{aligned}$$

where \(\varvec{\theta }\in \Theta \), as in (9), and the boundary parameters \(\theta _{m,m}\), \(\theta _{j,m}\) and \(\theta _{m,k}\) are given in (8) and \(r_{j,k}\), for \(j,k=1,\ldots ,m\), are defined in (4).

We assume the prior distribution for the \(\theta _{j,k}\)’s is a spatial beta process \(\hbox {SBeP}(a,b,{\textbf{c}})\), given in (13). Therefore the extended prior distribution, considering the latent variables, \(\varvec{\eta }=\{\eta _{j,k}\}\) and \(\omega \), is given by

$$\begin{aligned} f(\varvec{\theta },\varvec{\eta },\omega )&=\prod _{j=1}^{m-1}\prod _{k=1}^{m-1}\left\{ \hbox {Be}\left( \theta _{j,k}\mid a+\sum _{(r,s)\in \partial _{j,k}}\eta _{r,s},\,b+\sum _{(r,s)\in \partial _{j,k}}(c_{r,s}-\eta _{r,s}) \right) \right. \\&\quad \times \hbox {Bin}(\eta _{j,k}\mid c_{j,k},\omega )\bigg \}\,\hbox {Be}(\omega \mid a,b). \end{aligned}$$

The posterior distribution of \((\varvec{\theta },\varvec{\eta },\omega )\) is given by the product of the likelihood and the prior, up to a proportionality constant. In order to characterise the posterior distribution, we implement a Gibbs sampler (Smith and Roberts 1993) and sample \((\varvec{\theta },\varvec{\eta },\omega )\) from the following conditional posterior distributions.

  1. (i)

    Posterior conditional distribution for \(\theta _{j,k}\), \(j,k=1,\ldots ,m-1\)

    $$\begin{aligned} f(\theta _{j,k}\mid \hbox {rest})&\propto \theta _{j,k}^{a+\sum _{(r,s)\in \partial _{j,k}}\eta _{r,s}+r_{j,k}-1}\left( 1-\theta _{j,k}\right) ^{b+\sum _{(r,s)\in \partial _{j,k}}(c_{r,s}-\eta _{r,s})-1}\\&\quad \times \theta _{m,m}^{r_{m,m}}\theta _{j,m}^{r_{j,m}}\theta _{m,k}^{r_{m,k}}I_{\Theta }(\theta _{j,k}). \end{aligned}$$
  2. (ii)

    Posterior conditional distribution for \(\eta _{j,k}\), \(j,k=1,\ldots ,m-1\)

    $$\begin{aligned} f(\eta _{j,k}\mid \hbox {rest})\propto \frac{ {c_{j,k}\atopwithdelims ()\eta _{j,k}}\left\{ \left( \frac{\omega }{1-\omega }\right) \prod _{(t,s)\in \varrho _{j,k}} \left( \frac{\theta _{t,s}}{1-\theta _{t,s}}\right) \right\} ^{\eta _{j,k}}I_{\{0,\ldots ,c_{j,k}\}}(\eta _{j,k})}{\prod _{(t,s)\in \varrho _{j,k}}\Gamma \left( a+\sum _{(l,z)\in \partial _{t,s}}\eta _{l,z}\right) \Gamma \left( b+\sum _{(l,z)\in \partial _{t,s}}(c_{l,z}-\eta _{l,z})\right) }, \end{aligned}$$

    where \(\varrho _{j,k}\) is the set of reversed neighbours, that is, the set of pairs (ts) such that \((j,k)\in \partial _{t,s}\).

  3. (iii)

    Posterior conditional distribution for \(\omega \)

    $$\begin{aligned} f(\omega \mid \hbox {rest})=\hbox {Be}\left( \omega \left| a+\sum _{j=1}^{m-1}\sum _{k=1}^{m-1} \eta _{j,k},b+\sum _{j=1}^{m-1}\sum _{k=1}^{m-1}(c_{j,k}-\eta _{j,k})\right. \right) . \end{aligned}$$

Looking at posterior conditional (i) we realise that the sum of latent variables \(\sum \eta _{r,s}\) appears in the posterior in the same way as the data \(r_{j,k}\). Moreover, since \(\eta _{r,s}\in \{0,\ldots ,c_{r,s}\}\) and considering that \(\partial _{j,k}\) has between three to five elements, to avoid overwhelming the data, it is advised to take values \(c_{j,k}\le \sqrt{n}/5\).

Sampling from (iii) is straightforward and sampling from (ii) can be easily done by evaluating at the different points of the support and normalizing. However, sampling from (i) is not trivial and requires a Metropolis–Hastings step (Tierney 1994). We suggest sampling \(\theta _{j,k}^*\) at iteration \((t+1)\) from a random walk proposal distribution

$$\begin{aligned} q(\theta _{j,k}\mid \varvec{\theta }_{-(j,k)},\theta _{j,k}^{(t)})=\hbox {Un}\left( \theta _{j,k}\mid \max \{l_{j,k},\theta _{j,k}^{(t)}-\delta _{j,k} d_{j,k}\},\min \{u_{j,k},\theta _{j,k}^{(t)}+\delta _{j,k} d_{j,k}\}\right) \end{aligned}$$

where the interval \((l_{j,k},u_{j,k})\) represents the conditional support of \(\theta _{j,k}\), \(d_{j,k}=u_{j,k}-l_{j,k}\) is its length, with

$$\begin{aligned} l_{j,k} = \max \left\{ 0, \frac{m-2}{m} - \sum _{r=1}^{m-1} \sum _{s=1}^{m-1} \theta _{r,s} I\left( (r,s)\ne (j,k)\right) \right\} \end{aligned}$$

and

$$\begin{aligned} u_{j,k} = \min \left\{ \frac{m-1}{m} - \sum _{r=1}^{m-1} \sum _{s=1}^{m-1} \theta _{r,s}I\left( (r,s)\ne (j,k)\right) , \frac{1}{m} - \sum _{s=1,s \ne k}^{m-1}\theta _{j,s}, \frac{1}{m} - \sum _{r=1,r \ne j}^{m-1} \theta _{r,k} \right\} , \end{aligned}$$

for \(j,k=1,\ldots ,m-1\). Therefore, at iteration \((t+1)\) we accept \(\theta _{j,k}^*\) with probability

$$\begin{aligned} \alpha \left( \theta _{j,k}^*,\theta _{j,k}^{(t)}\right) =\min \left\{ 1, \frac{f(\theta _{j,k}^*\mid \hbox {rest})\,q(\theta _{j,k}^{(t)}\mid \varvec{\theta }_{-(j,k)},\theta _{j,k}^{*})}{f(\theta _{j,k}^{(t)}\mid \hbox {rest})\,q(\theta _{j,k}^*\mid \varvec{\theta }_{-(j,k)},\theta _{j,k}^{(t)})}\right\} . \end{aligned}$$

The parameter \(\delta _{j,k}\) is a tuning parameter that controls the acceptance rate. As suggested by Roberts and Rosenthal (2009), this parameter can be adapted every certain amount of iterations, inside the MCMC algorithm, to achieve a target acceptance rate. Differing slightly from the proposal in Roberts and Rosenthal (2009), instead of considering a single target acceptance rate, we consider the interval [0.3, 0.4] which, according to Robert and Casella (2010), define optimal acceptance rates in random walk MH steps. Specifically, our adaptation method uses batches of 50 iterations and for every batch b, we compute the acceptance rate \(AR^{(b)}\) and define

$$\begin{aligned} \delta ^{(b+1)}=\left\{ \begin{array}{ll} \delta ^{(b)}(1.1)^{\sqrt{b}} &{} \hbox {if}\; AR^{(b)}>0.4 \\ \delta ^{(b)}(1.1)^{-\sqrt{b}} &{} \hbox {if}\;AR^{(b)}<0.3 \end{array}\right. \end{aligned}$$
(14)

For the examples considered here we use \(\delta ^{(1)}=0.25\) as starting value.

This algorithm was implemented in Python. Figure 2 shows the performance of this adaptive algorithm for the parameter \(\theta _{1,1}\) in the real data analysis, with \(m=5\) and \(c=2\) (see Sect. 5 for specific definitions of these parameters). The left panel shows the values of tuning parameter \(\delta _{1,1}\) that stabilises around values between 0.8 and 1 (it is worth mentioning that since \(\delta _{j,k}\) corresponds to a proportion of the interval in which the parameter \(\theta ^{(t)}_{j,k}\) can take values, we delimit it between 0.01 and 1). The right panel shows that the acceptance rate is kept around the target interval [0.3, 0.4] as desired.

Fig. 2
figure 2

Example of acceptance rate and tuning parameter in the adaptation method with a batch every 50 iterations

5 Numerical analyses

5.1 Simulation study

We first assess the performance of our model in a controlled scenario. For this we consider five families of copulas: Product, Gumbel, Clayton, Ali-Mikhail-Haq (AMH) and Normal. From these families we generated samples of size \(n = 200\) with parameters, \(\theta =1.3\) for the Gumbel, \(\theta \in \{-0.3,1\}\) for the Clayton, \(\theta \in \{-0.5, 0.7\}\) for the AMH and \(\theta \in \{-0.5, 0.5\}\) for the Normal copula. In all but the first two cases, negative/positive parameters induce negative/positive dependence. We use the Spearman’s \(\rho \) coefficient to characterise the dependence. Since this measure is not available in closed form for all copulas considered, we computed the theoretical value via numerical integration of expression (1).

For the prior distributions (13) we took \(a = 0.1\), \(b = 0.1\) and a range of values \(c_{jk}\in \{0,1,2\}\) to compare among different degrees of prior dependence. For the partition (3) we considered two sizes \(m\in \{5,8\}\) in such a way that we can compare with the sample copula. We carry out two analysis, one with the original simulated data as it comes from the model, and another with rank transformed data. We implemented an MCMC with the adaptive scheme as described in Sect. 4. Chains were ran for 5000 iterations with a burn-in of 500 and keeping one of every 2nd iteration to produce posterior estimates. Computational times using an Intel core i7 microprocessor average around 65 min.

To assess goodness of fit we computed several statistics. The logarithm of the pseudo marginal likelihood (LPML), originally suggested by Geisser and Eddy (1979), is an average of a function of the conditional density with respect to the posterior distribution of the model parameters, aggregated for all data points, which allows us to assess the fitting of the model to the data. The supremum norm (SN), defined by \(\sup _{(u,v)} |C(u,v)-\widehat{C}(u,v)|\) assesses the discrepancy between our posterior estimate (posterior mean) \(\widehat{C}(u,v)\) from the true copula C(uv). We also computed the Spearman’s rho coefficient, as in (11), and compare the 95% interval estimates with the true value. Additionally, as a graphical aid to see the performance of our model, we compare the posterior estimates (posterior mean) of copula densities with the true ones using heat maps.

In Tables 1 and 2, we show the goodness of fit (GOF) statistics with the sampled data as it comes from the models and after applying rank transformation, this latter are indicated with a super index r. We note that the LPML statistics are not comparable between original and rank transformed data, however the supremum norms are comparable. We have included the supremum norm for the frequentist sample copula and added the subindex F to differentiate it from that of our Bayesian model that has a subindex B. In all cases we observe that the Spearman’s rho coefficient 95% interval estimates \(\widehat{\rho }\) contain the true value \(\rho \).

Table 1 GOF measures for original and rank transformed data, for different copulas
Table 2 GOF measures for original and rank transformed data, for different copulas

For the Product copula (Table 1, first block) the LPML and supremum norm choose the model with \(m=5\) and \(c=2\) for both, original and rank transformed data. These cases behave similar to the sample copula according to the supremum norm. For the Gumbel copula (Table 1, second block), there is no agreement between the LPML and the supremum norm, but in any case they both prefer the model with \(m=5\) for the rank transformed data case. Interestingly, as in the product copula case, the sample copula obtains a supremum norm slightly smaller than our best Bayesian model, however the Bayesian model with \(m=8\) and ranked transformed data obtains a similar supremum norm for \(c=0\).

For the Clayton copula (Table 1,third and fourth blocks), with \(\theta =-0.3\) and \(\theta =1\), the LPML selects the model with \(m=5\) and \(c=2\), for original and rank transformed data, and in both cases our Bayesian model is superior than the sample copula. We can notice that for \(\theta = 1\) the supremum norm is slightly smaller for \(m = 8\) than for \(m = 5\).

For the AMH copula (Table 2, first and second blocks), with \(\theta =-0.5\) and \(\theta =0.7\), there is an slight discrepancy between the LPML and the supremum norm. For original data, the LPML chooses the model with \(m=5\) and \(c=2\), but the supremum norm chooses that with \(m=8\) and \(c=0\) or \(c=1\). For rank transformed data, the best model is that with \(m=5\) and \(c=2\), for \(\theta =-0.5\). In the case of \(\theta =0.7\), the LPML selects the model with \(m = 5\) and \(c=2\), however the sumpremum norm chooses the model with \(m = 8\) and \(c=0\) or \(c=1\). Comparing with the sample copula, our model performs similarly.

For the normal copula (Table 2, third and fourth blocks), something similar to the AMH copula happens. For both values of \(\theta \), the LPML prefers the model with \(m=5\) and the supremum norm that with \(m=8\). In both cases, our best Bayesian model behaves similarly to the sample copula.

In Fig. 3 we compare the copula density estimates (Bayesian and frequentist) with the true density using heatmaps. For the families shown, product, AMH and normal (across rows), there are some differences between the Bayesian and frequentist (sample copula) estimates. These differences are due to the prior that smooths the intensities by borrowing information from the neighbouring regions. Moreover, in the five families of copulas studied here, none of the GOF statistics select the prior independence case of \(c=0\), which confirms the benefit of the prior dependence in the \(\theta _{j,k}\)’s.

Fig. 3
figure 3

Copula density estimation (heat maps) for some copulas based on rank transformed data. Product (top row), AMH with \(\theta =-0.5\) (middle row), and normal with \(\theta =0.5\) (bottom row). Real copula (first column), bayesian estimation with \(m = 5\) and \(c = 2\) (second column) and frequentist sample copula (third column)

In some of the examples considered here, there is no agreement between the models selected by the two GOF criteria, LPML and SN. The former is an average and the latter depends on a single point. Additionally, for real data examples, where the true model is unknown, it is not possible to compute the SN. Therefore we suggest to use the LPML to select the best model.

5.2 Real data analysis

In the section we show the performance of our model to estimate the dependence between variables in a real life application where data is not obtained directly from the copula but from some arbitrary unknown distribution.

In Mexico, the pension system is conformed by ten pension fund managers denominated AFOREs (Spanish acronym for Administradoras de Fondos para el Retiro), each of these fund managers work with ten investment funds, based on the group age of the worker. On a monthly basis, the National Commission for the Pension System (CONSAR), publishes statistical information and risk metrics that describe the performance of these pension funds in an open data platform, that can be accedes at https://urldefense.proofpoint.com/v2/url?u=https-3A__www.consar.gob.mx_gobmx_aplicativo_siset_Enlace.aspx &d=DwIGaQ &c=AKs6EwELrBZKOG9H-C2eL9nCFyT6KLG5z2zMuwOnNTA &r=yVA7y6tCY0xH8i9m4licvQrgjUZ25bpISGJWUqM4z1A &m=mWZYL5MSa9N8gnMMVcNMWMOPaExKURJ_GO96JjUQ-HrZcJpbuCYqNeNQGyO--N4Z &s=7GRis7JglYFK2S5CiW43AYVINmwemmZ9PrUCc3MEv1A &e=.

The information provided by CONSAR allows workers to choose the AFORE that can provide them with the best benefits in their retirement. Because of its importance, we consider two of these statistics: Net Return Indicator (IRN), which is an indicator of the average of the short, medium and long-term returns offered by a investment fund, above the cost of a life annuity, minus the applicable commissions, and reflects the past performance obtained by the investments in each fund; and the tracking error (ES), an indicator that shows the average difference between the actual investment path fund and the optimal glide path.

In general, it is considered that these two variables, IRN and ES, maintain a positive dependency relationship, that is, a higher return may present a higher error (risk). We use our semiparametric copula model to verify this assumption and quantify the possible degree of dependence. If these two variables were independent or negative dependent, workers might be able to freely chose the AFORE that maximises the IRN without incurring in any risk. Available data consists of \(n=100\) observations of variables IRN and ES in December of 2021. As a first step we apply the rank transformation given in (2) to the original data. In Fig. 4 we show scatter plots of the original data (upper left panel), and the rank transformed data (upper right panel). Note that the scale of the data changes, but the main features of dependence are maintained.

Fig. 4
figure 4

Real data. Scatter plots (top row) of original data (left) and rank transformed data (right); Bayesian copula estimators (bottom), density (left) and CDF (right) obtained with \(m=5\) and \(c=2\)

To define the prior distribution we took the same definitions indicated in the simulations of the previous section, \(a=0.1\), \(b=0.1\), \(c_{j,k} \in \{0,1,2\}\), except for m, this parameter is considered to take values \(m\in \{4,5\}\), due to the reduction in the number of observations in the sample with respect to the sample size of the previous section. We consider the same MCMC specifications as those used for the simulation study. Our posterior sampling procedure behaves well with good convergence of the chains and the adaptation reaches the desire target. Computational times using an Intel core i7 microprocessor average around 30 min.

In Table 3 we report some GOF measures, say the Spearman’s rho estimate and the LPML. According to the LPML the values \(m=5\) and \(c=2\) are preferred. The 95% credible interval estimate of the Spearman’s rho is (0.013, 0.275), which confirms that the association is positive. For the reference, the sample Spearman’s rho takes the value of 0.177, however there is no way of knowing if this value is significantly positive. Our model confirms that it is. These estimates suggest that there is a positive (weak) dependence between the return and the risk in an investment fund. Therefore, workers must pay attention at the IRN indicator as well as the ES in order to make a decision. Finally, in Fig. 4, we also show the Bayesian estimators for the copula density as a heatmap (bottom left panel) and for the copula CDF as a perspective plot (bottom right panel). In the heatmap we can appreciate slightly more intense colors in the 45 degrees diagonal, which confirms the existence of a positive dependence.

Table 3 Real data: Posterior 95% CI and GOF measure for rank transformed data

6 Concluding remarks

We proposed a semiparametric copula model that is flexible enough to approximate the dependence between any two random variables. Maximum likelihood estimators of our model coincide with the sample copula of González-Barrios and Hoyos-Argüelles (2018) under certain conditions such as rank transformation of the data and defining an m that divides n. However, our model is more general and due to the Bayesian analysis, we can produce better estimates by borrowing strength among grid neighbours through the prior distribution.

Computational times reported seem to be dependent on the sample size, for datasets with larger sample size (\(n>200\)), faster computers or the use of low-level programming languages, like C++ of Fortran, might be needed.

Along this paper we concentrated in the bivariate copula, however the extension to a d-dimensional copula can also be considered. For instance, if we consider a partition of size \(m^d\), \(2\le m\le n\), of \([0,1]^d\) such that \(Q_{j_1,\ldots ,j_d}=\times _{k=1}^d \left( \frac{j_k-1}{m},\frac{j_k}{m}\right] \) for \(j_k=1,\ldots ,m\) and \(k=1,\ldots ,d\), then a semiparametric d-copula density would be

$$\begin{aligned} f_C(u_1,\ldots ,u_d\mid \varvec{\theta })=m^d\sum _{j_1=1}^m\cdots \sum _{j_d=1}^m \theta _{j_1,\ldots ,j_d}I((u_1,\ldots ,u_d)\in Q_{j_1,\ldots ,j_d}), \end{aligned}$$

where \(\varvec{\theta }=\{\theta _{j_1,\ldots ,j_d},j_1,\ldots ,j_d=1,\ldots ,m\}\) are the set of model parameters that satisfy \(\sum _{j_k=1}^m\theta _{j_1,\ldots ,j_d}=\frac{1}{m^{d-1}}\) for all \(k=1,\ldots ,d\) and \(\sum _{j_1}^m\cdots \sum _{j_k=1}^m\theta _{j_1,\ldots ,j_d}=1\). Extending the prior to this d-dimensional setting is also possible. Performance of our semiparametric copula model in this multivariate setting is worth studying.