1 Introduction

In the last years, there has been a great demand for models describing the evolution of environmental or geophysical spatial processes. In particular, there is a considerable need for modeling the simultaneous behavior of different variables observed in the same spatial region and for providing accurate prediction maps with associated uncertainties.

Multivariate Gaussian random fields (MGRF throughout) are important tools to model and predict these kind of data. Since optimal prediction at unknown locations requires the knowledge of the covariance function, a common practice in geostatistics is to select some parametric classes of covariance functions, so that the covariance is known up to a parameter vector, being estimated through some estimation method.

Nevertheless, modeling the covariance function of such MGRF is not an easy task. For instance, the linear model of coregionalization [LMC , see Wackernagel (2003)] has been criticized for having a number of drawbacks, and we refer the reader to Porcu et al. (2013), Gneiting et al. (2010) and Genton and Kleiber (2015) among others.

The lack of flexible multivariate models has justified recent efforts (see Genton and Kleiber (2015) for a review). Among them, Gneiting et al. (2010) and Apanasovich et al. (2012) extend the Matérn model to the multivariate case and Li and Zhang (2011) propose a class of asymmetric models obtained from stationary multivariate symmetric covariance models. These models are very flexible and can capture both the marginal and the cross-spatial dependence, as well as different levels of smoothness associated, the colocated correlation between the components and the possible presence of asymmetry in the data.

Literature concerning estimation of MGRF has been focussed on the LMC model. Goulard and Voltz (1992) and Pelletier et al. (2004) proposed to estimate it extending least square-type estimators, while Zhang (2007) proposed a ML estimation method capable of handling high-dimensional data exploiting the EM algorithm.

Effective estimation methods, outside the LMC model, for covariance models are somehow unexplored. Maximum likelihood (ML) is probably the best method of estimation when dealing with MGRF, but the exact computation of the likelihood, for N irregularly spaced data, requires \(O((Np)^3)\) operations and \(O((Np)^2)\) memory, with p being the number of components of the MGRF.

In the case of scalar-valued fields, different approaches have been proposed in order to find estimation methods with a good balance between computational complexity and statistical efficiency for large data set. Some of these approaches are based on the composite likelihood (CL) method as, for instance, in Bevilacqua et al. (2012), Bevilacqua and Gaetan (2015), Stein et al. (2004) and Eidsvik et al. (2014).

Composite likelihood (Lindsay 1988) is a general class of estimation methods based on the likelihood of marginal or conditional events (see Varin et al. (2011) for a review), useful for performing statistical inference in complex problems where standard likelihood methods are difficult to apply. In this paper, we propose two estimation methods for multivariate covariance models, based on CL idea: the former being based on weighted cross- pairwise likelihood (\(pl_{1}\)) and the latter being based on weighted pairs of pairwise likelihoods (\(pl_2\)).

We discuss pros and cons of the two methods, including the type of weights that can improve the statistical and computational efficiency, the computation of the standard errors and the asymptotic properties of the estimates under increasing domain asymptotics. We show that the \(pl_1\) and \(pl_2\) estimation methods are useful tools when estimating the covariance model of a MGRF. This is done through simulations experiments, considering different covariance models, such as the multivariate Matérn, the LMC and the asymmetric model proposed in Li and Zhang (2011), using the multivariate covariance tapering (CT) (Bevilacqua et al. 2016) as benchmark.

Finally, we apply the proposed method to the analysis of a large bivariate dataset (approximatively 70000 observations) on chlorophyll concentration and sea surface temperature in the Chilean coast. In particular, we fit a bivariate Matérn model, and consider the problem of model selection between three different versions of the bivariate Matérn (separable, constrained and full).

The paper is organized as follows. Section 2 briefly reviews some multivariate covariance models and their estimation through ML method. Section 3 describes our proposal. In Sect. 4, through a simulation study, we compare \(pl_{1}\) and \(pl_{2}\) from statistical and computational efficiency point of view, using ML and CT methods as benchmark. The real data example is described in Sect. 5. Finally, in Sect. 6, we give some conclusions. Supplementary material contains the asymptotic results of the proposed methods.

2 Multivariate Covariance Models and Maximum Likelihood

Let \(\mathbf{Z}(\varvec{s})= \{(Z_1(\varvec{s}),\ldots ,Z_p(\varvec{s}))^{T} \}\) be a p-variate Gaussian field with continuous spatial index \(\varvec{s}\in {\mathbb {R}}^d\). The assumption of Gaussianity implies that the first- and second-order moments characterize completely the finite-dimensional distributions. In particular, we assume weak stationarity throughout, so that the mean vector \(\varvec{\mu }= {\mathbb {E}}( \mathbf{Z})\) is constant and the covariance function between \(\mathbf{Z}(\varvec{s}_1)\) and \(\mathbf{Z}(\varvec{s}_2)\), for any pair \(\varvec{s}_1,\varvec{s}_2\) in the spatial domain, is represented by a mapping \(\varvec{C}: {\mathbb {R}}^d \rightarrow M_{p \times p}\) defined through

$$\begin{aligned} \varvec{C}(\varvec{h})= \left[ C_{ij}(\varvec{h})\right] _{i,j=1}^p = \left[ \mathrm{cov} \left( Z_i(\varvec{s}_1),Z_j(\varvec{s}_2) \right) \right] _{i,j=1}^p, \quad \varvec{h}= \varvec{s}_1-\varvec{s}_2 \in {\mathbb {R}}^d. \end{aligned}$$
(1)

The function \(\varvec{C}(\varvec{h})\) is called matrix-valued covariance function. Here, \(M_{p \times p}\) is the set of squared, symmetric and positive definite matrices. For \(i=j\), the functions \(C_{ii}\) are called autocovariances or marginal covariances of \(Z_i (\cdot )\), \(i=1,\ldots ,p\), while for \(i \ne j\) the mapping \(C_{ij}\) is called cross-covariance between \(Z_i(\cdot )\) and \(Z_j(\cdot )\). The matrix- valued mapping \(\varvec{C}\) must be positive definite, which means that, for a given realization \(\varvec{Z}=(\varvec{Z}(\varvec{s}_1)^{T},\ldots , \varvec{Z}(\varvec{s}_N)^T)^T\), the \((Np)\times (Np)\) covariance matrix \(\Gamma := [ \varvec{C}(\varvec{s}_m-\varvec{s}_n)]_{m,n=1}^N\) is positive definite, where \(\varvec{Z}(\varvec{s}_i)=(Z_{1}(\varvec{s}_{i}),\ldots Z_{p}(\varvec{s}_{i}))^T\) and \(\varvec{C}(\varvec{s}_m-\varvec{s}_n)=[C_{ij}(\varvec{s}_m-\varvec{s}_n)]_{i,j=1}^p\), for \(m,n=1,...,N\), is the generic submatrix. Hereafter for convenience of notation, we set \(C_{ij}(\varvec{s}_m-\varvec{s}_n)=c_{ijmn}\).

On the other hand, the second-order properties of the MGRF can be represented by the cross-variogram, defined as

$$\begin{aligned} 2\gamma _{ij}({\varvec{s}}_{1}-{\varvec{s}}_{2}) = {\mathbb {E}}[ \{Z_{i}({\varvec{s}}_{1})-Z_{i}({\varvec{s}}_{2})\} \{Z_{j}({\varvec{s}}_{1})-Z_{j}({\varvec{s}}_{2})\} ]. \end{aligned}$$

Note that, under weak stationarity assumptions, the linear identity

$$\begin{aligned} 2\gamma _{ij}({\varvec{h}}) = 2C_{ij}({\varvec{0}}) - C_{ij}({\varvec{h}}) - C_{ji}({\varvec{h}}) \end{aligned}$$

shows that covariance and variogram are equivalent in terms of modeling.

In order to illustrate the estimation methods proposed in this paper, we shall assume throughout that the mapping \(\varvec{C}\) comes from a parametric family of matrix-valued covariances \(\{\varvec{C}(\cdot ; \varvec{\theta }), \varvec{\theta }\in \varvec{\Theta }\subseteq R^k\}\), with \(\varvec{\Theta }\) an arbitrary parametric space. Recent literature has been preoccupied on offering new models for matrix-valued covariances. For a thorough review, the reader is referred to Genton and Kleiber (2015), with their exhaustive list of references. Here, we shall merely list those parametric models that will be used through the paper. We have already mentioned the linear model of coregionalization (LMC) that has been popular for over 30 years (Wackernagel 2003). It consists of representing the p-variate Gaussian field as a linear combination of q-independent univariate fields, with \(q=1,\ldots ,p\). The resulting matrix-valued covariance function takes the form:

$$\begin{aligned} \varvec{C}(\varvec{h};\varvec{\theta })= \left[ \sum _{k=1}^q \alpha _{ik}\alpha _{jk}R_k(\varvec{h},\varvec{\beta }_k)\right] _{i,j=1}^p, \end{aligned}$$
(2)

with \(A:=\left[ \alpha _{lk} \right] _{l,k=1}^{p,q}\) being a \(p\times q\)-dimensional matrix with full rank and with \(R_k\) being a univariate parametric correlation model with parameter vector \({\varvec{\beta }}_{k}\). Clearly, we have \(\varvec{\theta }= (\mathrm{vec}(A)^{T},\varvec{\beta _1}^{T},\ldots ,\varvec{\beta _q}^{T})^T\).

Constructive criticism about this model has been expressed, for instance, by Gneiting et al. (2010), Apanasovich et al. (2012) and Daley et al. (2015). For instance, if \(\alpha _{ik}\ne 0\) for each ik the smoothness of any component defaults to that of the roughest latent process.

Another popular construction, called separable, is obtained through:

$$\begin{aligned} \varvec{C}(\varvec{h},\varvec{\theta })= \left[ \rho _{ij}\sigma _i\sigma _j R(\varvec{h},\varvec{\psi })\right] _{i,j=1}^p , \quad \rho _{ii}=1, \end{aligned}$$
(3)

with \(R(\cdot ,\varvec{\psi })\) a univariate parametric correlation model, \(\sigma ^2_i>0\) , \(i=1,\ldots ,p\), are the marginal variances, and the \(\rho _{ij}\) is the colocated correlation coefficient between \(Z_i(\varvec{s})\) and \(Z_j(\varvec{s})\). In this case, \(\varvec{\theta }= (\varvec{\rho },\varvec{\psi }^{T},\sigma ^2_1,\ldots , \sigma _p^2)^T\) where \(\varvec{\rho }\) is the vector containing all the pairwise colocated correlation coefficients \(\rho _{ij}\), \(i=1,\ldots , p-1\), \(j=i+1, \ldots , p\). This construction assumes that the components of the multivariate random field have the same spatial correlation structure. Therefore, the model is not able to capture the different scales and smoothness of the components. Note that the separable model is a special case of LMC model in Eq. (2).

A generalization of the model (3), which allows to overcome this drawback, is:

$$\begin{aligned} \varvec{C}(\varvec{h},\varvec{\theta })= \left[ \rho _{ij} \sigma _i \sigma _j R(\varvec{h}; \varvec{\psi }_{ij})\right] _{i,j=1}^p, \quad \rho _{ii}=1. \end{aligned}$$
(4)

Bevilacqua et al. (2015) use this general class in order to test the significance of the correlation between the components of a bivariate Gaussian random field. In this general approach, the difficulty lies in deriving conditions on the model parameters that result in a valid multivariate covariance model. For instance, Gneiting et al. (2010) proposed a special case of the construction in Eq. (4), with R belonging to the Matérn family, being defined through:

$$\begin{aligned} R(\varvec{h}; \varvec{\psi })= \frac{2^{1-\nu }}{\Gamma (\nu )} \left( \frac{ \Vert \varvec{h}\Vert }{\alpha } \right) ^{\nu } {\mathcal {K}}_{\nu } \left( \frac{ \Vert \varvec{h}\Vert }{\alpha }\right) , \end{aligned}$$
(5)

where \(\varvec{\psi }= (\alpha ,\nu )\), \(\alpha >0\) is the scale parameter, and \(\nu >0\) is a smoothness parameter. In the bivariate case, the authors find necessary and sufficient conditions based on the parameters for positive definiteness, while for the case \(p\ge 3 \) they only offer sufficient conditions. This kind of construction allows for a nice closed form, together with the possibility of different spatial scales and smoothness, different variances and colocated correlation parameters.

The previous models are symmetric, i.e., they assume that \(C_{ij}({\varvec{h}})=C_{ji}({\varvec{h}})\), but in some circumstances, such hypothesis can be restrictive. In this case, asymmetric models must be taken into account, as proposed in Li and Zhang (2011):

$$\begin{aligned} \varvec{C}(\varvec{h},\varvec{\theta })=[C^*_{ij}({\varvec{h}}+{\varvec{a}}_i-{\varvec{a}}_j; \varvec{\vartheta })]_{i,j=1}^p, \end{aligned}$$
(6)

where \(C^*_{ij}({\varvec{h}},\varvec{\vartheta } )\) is a valid symmetric parametric covariance model. Here \(\varvec{\theta }{=}(\varvec{\vartheta },\varvec{a}_1,\ldots , \varvec{a}_p)^T\) where \({\varvec{a}}_i \in {\mathbb {R}}^d\), for \(i=1,\ldots ,p\), are the vector parameters which allows to introduce asymmetry in the symmetric model.

Since we are assuming that the state of truth is represented by some parametric family of matrix-valued covariances \(\{\varvec{C}(\cdot ; \varvec{\theta }), \varvec{\theta }\in {\varvec{\Theta }} \subseteq {\mathbb {R}}^k\}\), we may use the abuse of notation \(\Gamma (\varvec{\theta })\) for the covariance matrix \(\Gamma \), in order to emphasize the dependence on the unknown parameter vector. For a N-dimensional realization from a p-variate Gaussian field, the log-likelihood can be written as

$$\begin{aligned} l(\varvec{\theta })=-\frac{1}{2}\log |\Gamma (\varvec{\theta }) |-\frac{1}{2}{\varvec{Z}}\,^T [\Gamma (\varvec{\theta })] ^{-1}{\varvec{Z}}. \end{aligned}$$
(7)

The most time-consuming part when calculating (7) is to evaluate the determinant and inverse of \(\Gamma (\varvec{\theta })\). The most widely used algorithms, such as Cholesky decomposition, require up to \({\mathcal {O}}((Np)^ 3)\) operations and \(O((Np)^2)\) memory, and this can be prohibitive if N is large. Various approximation methods have been developed to address this computational problem. For instance, the CT method has been proposed in order to reduce the number of operations and memory requirement in the case of scalar- valued Gaussian fields, for both estimation (see Kaufman et al. (2008), Shaby and Ruppert (2012), Bevilacqua et al. (2016) in the multivariate case) and prediction (Furrer et al. 2006). The method consists in setting to zero certain elements of the covariance matrix \(\Gamma (\varvec{\theta })\), by multiplying \(\Gamma (\varvec{\theta })\) element by element with a sparse matrix coming from a multitaper function. Let us denote with \(\varvec{T}(\mathbf{d})\) the \(Np \times Np\) matrix associated with a multitaper function as, for instance, those described in Bevilacqua et al. (2016).

The “tapered” covariance matrix is then obtained through \(\Gamma _{CT}(\varvec{\theta })= \Gamma (\varvec{\theta }) \circ \varvec{T}(\mathbf{d})\), where \(\circ \) denotes the Schur product. The multitaper vector parameters \(\mathbf{d}\) include the (possibly different) compact supports, these being fixed in a way to determine the desired level of sparseness for the construction above. Then, the tapered likelihood is defined as:

$$\begin{aligned} l_{{CT}}(\varvec{\theta },\mathbf{d})=-\frac{1}{2}\log |\Gamma _{CT}(\varvec{\theta }) |-\frac{1}{2}{\varvec{Z}}\,^{T}(\Gamma _{CT}(\varvec{\theta })^{-1}\circ \varvec{T}(\mathbf{d}))\varvec{Z}, \end{aligned}$$
(8)

and algorithms for sparse matrices can be exploited in order to compute (8) efficiently.

3 Composite Likelihood Methods

We propose two alternative versions of the CL approach in the multivariate context:

  1. (a)

    First, let us consider the bivariate random vector based on the cross-pairs \({\varvec{Y}}_{ijmn}=\left[ Z_{i}(\varvec{s}_{m}),Z_{j}(\varvec{s}_{n})\right] ^T\), for \(i,j=1,\ldots ,p\) and \(m,n=1,\ldots ,N\), with associated log-likelihood

    $$\begin{aligned} l_{ijmn}(\varvec{\theta }) =-\frac{1}{2}\left( \log |Q_{ijmn}|+{\varvec{Y}}_{ijmn}^\intercal Q_{ijmn}^{-1}{\varvec{Y}}_{ijmn}\right) , \, {\varvec{\theta }}\subset {\varvec{\Theta }}, \end{aligned}$$

    where

    $$\begin{aligned} Q_{ijmn}=\left( \begin{array}{cc}c_{iimm}&{}c_{ijmn}\\ -&{}c_{jjnn} \end{array} \right) . \end{aligned}$$
    (9)

Here for convenience of notation, we drop out the dependence on \(\varvec{\theta }\) of the generic element \(c_{ijmn}\). We define the index set \(\Lambda _{1}=\Lambda _{1}^{(1)}\cup \Lambda _{1}^{(2)} \cup \Lambda _{1}^{(3)}\), where

$$\begin{aligned} \Lambda _{1}^{(1)}= & {} \{(i,j,m,n): m=1,\ldots ,N-1, \, n=m+1,\ldots ,N, \, i{=}j{=}1,\ldots ,p\}, \\ \Lambda _{1}^{(2)}= & {} \{(i,j,m,n): m=1,\ldots ,N, \, n=m,\ldots ,N, \, i{=}1,\ldots ,p-1, \, j{=}i{+}1,\ldots ,p\}, \\ \Lambda _{1}^{(3)}= & {} \{(i,j,m,n): m{=}1,\ldots ,N-1, \, n{=}m+1,\ldots ,N,\\&\, j{=}1,\ldots ,p-1, \, i{=}j+1,\ldots ,p\}. \end{aligned}$$

Then, the first objective function, based on cross-pairs, is defined as

$$\begin{aligned} pl_{1}(\varvec{\theta })={\displaystyle \sum _{(i,j,m,n)\in \Lambda _1}} l_{ijmn}(\varvec{\theta })w_{ijmn}, \end{aligned}$$
(10)

where \(w_{ijmn}\) are positive suitable weights.

  1. (b)

    The second version of CL we propose is based on the four- dimensional random vector \({\varvec{X}}_{ijmn}=\left[ Z_{i}(\varvec{s}_{m}),Z_{i}(\varvec{s}_{n}),Z_{j}(\varvec{s}_{m}),Z_{j}(\varvec{s}_{n})\right] ^T\), for \(i,j=1,\ldots ,p\) and \(m,n=1,\ldots ,N\), with associated log-likelihood

    $$\begin{aligned} g_{ijmn}(\varvec{\theta })=-\frac{1}{2}\left( \log |\Sigma _{ijmn}|+{\varvec{X}}_{ijmn}^T \Sigma _{ijmn}^{-1}{\varvec{X}}_{ijmn}\right) , \end{aligned}$$

    where

    $$\begin{aligned} \Sigma _{ijmn}=\left( \begin{array}{cccc}c_{iimm}&{}c_{iimn}&{}c_{ijmm}&{}c_{ijmn}\\ -&{}c_{iinn}&{}c_{ijnm}&{}c_{ijnn}\\ -&{}-&{}c_{jjmm}&{}c_{jjmn}\\ -&{}-&{}-&{}c_{jjnn}\\ \end{array} \right) . \end{aligned}$$
    (11)

Thus, the second approach of the CL we propose is defined as:

$$\begin{aligned} pl_{2}(\varvec{\theta })={\displaystyle \sum _{(i,j,m,n)\in \Lambda _2}}g_{ijmn}(\varvec{\theta }) w_{ijmn}, \end{aligned}$$
(12)

where the index set \(\Lambda _2\) is defined as:

$$\begin{aligned} \Lambda _2= & {} \{(i,j,m,n):i=1,\ldots , p-1, \, j=i+1,\ldots , p,\\&\, m=1,\ldots ,N-1, \, n=m+1,\ldots ,N \}. \end{aligned}$$

Therefore, the maximum CL estimator for both methods is given by \(\widehat{\varvec{\theta }}_a{=}{\text {argmax}}_{\varvec{\theta }}\, pl_a(\varvec{\theta })\), \(a=1,2\). Some comments are in order:

  • The function \(pl_{1}(\varvec{\theta })\) involves the marginal pairwise log-likelihood associated with the \(i-th\) component for \(i=1,\ldots ,p\) and the cross-pairwise log-likelihood. In fact, \(pl_{1}(\varvec{\theta })\) can be rewritten as

    $$\begin{aligned} pl_{1}(\varvec{\theta })= & {} {\displaystyle \sum _{i=1}^p} {\displaystyle \sum _{m=1}^{N-1}} {\displaystyle \sum _{n=m+1}^N} l_{iimn}(\varvec{\theta })w_{iimn}+ {\displaystyle \sum _{i=1}^{p-1}}{\displaystyle \sum _{j=i+1}^{p}} {\displaystyle \sum _{m=1}^{N}} {\displaystyle \sum _{n=m}^N} l_{ijmn}(\varvec{\theta })w_{ijmn}\\&+{\displaystyle \sum _{j=1}^{p-1}} {\displaystyle \sum _{i=j+1}^{p}}{\displaystyle \sum _{m=1}^{N-1}} {\displaystyle \sum _{n=m+1}^N} l_{ijmn}(\varvec{\theta })w_{ijmn}. \end{aligned}$$

    The first term is the contribution of each pairwise likelihood associated with the \(i-th\) component. The second and the third terms, instead, involve the cross-pairwise log-likelihoods. The reason to split the contribution of the cross-pairwise log-likelihood into two parts is that \(l_{ijmn}(\varvec{\theta }) \ne l_{jimn}(\varvec{\theta })\) because \(c_{ijmn}(\varvec{\theta }) \ne c_{jimn}(\varvec{\theta })\) under asymmetry. In the special case of multivariate symmetric models, \(pl_{1}(\varvec{\theta })\) simplifies to

    $$\begin{aligned} pl_{1}(\varvec{\theta })={\displaystyle \sum _{i=1}^p} {\displaystyle \sum _{m=1}^{N-1}} {\displaystyle \sum _{n=m+1}^N} l_{iimn}(\varvec{\theta })w_{iimn} +{\displaystyle \sum _{i=1}^{p-1}}{\displaystyle \sum _{j=i+1}^p} {\displaystyle \sum _{m=1}^{N}} {\displaystyle \sum _{n=1}^N} l_{ijmn}(\varvec{\theta })w_{ijmn}. \end{aligned}$$
    (13)

    Genton et al. (2015) use (13) in order to estimate multivariate max stable processes considering only multivariate symmetric covariances of a certain type.

  • The role of the weights \(w_{ijmn}\) in \(pl_{a}(\varvec{\theta })\), \(a=1,2\), is to save computational time and to improve statistical efficiency. Weight functions with compact support have clear computational advantages and can improve the statistical efficiency, as shown in Joe and Lee (2009), Davis and Yau (2011) and Bevilacqua et al. (2012). A possible choice for the weights \(w_{ijmn}\) in Eqs. (10) and (12) is the simple cutoff function:

    $$\begin{aligned} u_{mn}(\kappa _{ij}) = {\left\{ \begin{array}{ll}1,&{} h_{ijmn} \le 1 \\ 0, &{} h_{ijmn} > 1\end{array}\right. }, \end{aligned}$$
    (14)

    where \(\displaystyle h_{ijmn}=\frac{\Vert {\varvec{s}}_{m}-{\varvec{s}}_{n}\Vert }{\kappa _{ij}}\), or alternatively, a smoother function such as a correlation function with compact support can be used. An example is the Bohman function (Gneiting 2002):

    $$\begin{aligned} v_{mn}(\kappa _{ij}) = {\left\{ \begin{array}{ll}\displaystyle \left( 1-h_{ijmn}\right) \left( \frac{\sin (2 \pi h_{ijmn})}{2 \pi h_{ijmn}} \right) + \left( \frac{1-\cos (2 \pi h_{ijmn})}{2 \pi ^2 h_{ijmn}} \right) , &{} h_{ijmn}\le 1 \\ 0, &{} h_{ijmn}>1 \end{array}\right. }. \end{aligned}$$
    (15)

    The choice of the compact supports, \(\kappa _{ij}\)’s, in the weight function is not in general an easy task. Bevilacqua et al. (2012) proposed, in the univariate case, a formal criterion based on the optimization of the trace of the Godambe information matrix, but this method can be computationally hard. An ad hoc strategy is to fix the compact support as a fraction of the (empirically estimated) practical range of the covariance function (Bevilacqua and Gaetan 2015). We follow this strategy in the simulation study.

    Note that with the proposal denoted \(pl_1\), it is possible to use different types of weights. In particular, Eq. (10) shows that it is possible to use different weights for the marginal and cross-pairwise likelihoods. Instead, for the \(pl_2\) case, only weights of the second type are allowed. For instance, if \(p=2\), then we can use \(w_{11mn}\), \(w_{22mn}\) and \(w_{12mn}\) for \(pl_1\), while for \(pl_2\) we can just use \(w_{12mn}\). In this respect, \(pl_1\) is more flexible since it allows to consider different weight functions.

  • For the \(pl_1\) case, a bivariate Gaussian distribution must be evaluated at each sum and a 4-variate Gaussian distribution in the second case. The order of computation is, in the first case, \({\mathcal {O}}\bigg (\frac{Np(Np-1)}{2}\bigg )\), while in the second case, it is \({\mathcal {O}}\bigg (\frac{Np(N-1)(p-1)}{4}\bigg )\), when considering the unweighted version of the two methods. When considering the weighted versions, the order of computation depends also on the choice of the (compactly supported) weight functions. In the simulation study, we compare the computational performance of the two methods using both the weighted and unweighted versions.

  • One of the benefits of \(pl_{1}\) and \(pl_{2}\) is that, as outlined in Varin et al. (2011), composite likelihood requires only model assumptions on the lower-dimensional marginal densities and not detailed specification of the full joint distribution. Varin et al. (2011) call this feature robustness of composite likelihood. In our case, only a two- dimensional or a four-dimensional Gaussian assumption is required for \(pl_1\) and \(pl_2\), respectively. In this sense, \(pl_1\) is more robust than \(pl_2\).

  • Throughout the study, we have supposed to work in the isotopic case (Wackernagel 2003). Under this setting, each component of the MGRF is observed at the same location sites. In the heterotopic case, such random field \(Z_i (\cdot )\), \(i=1,\ldots ,p\) is observed at points \(\varvec{s}_{ni}\), \(n=1,\ldots ,N_i\), at possibly different location sites. In this case, a little modification of the index set \(\Lambda _{1}\) in \(pl_1\) is needed in order to take into account the different location sites and the different number of locations. Note that \(pl_2\) is not defined in this case unless \(N_i=N_j\) for \(i \ne j\).

In the Supplementary material, we show that the maximum CL estimators are consistent and asymptotically normal under increasing domain asymptotics. Under these results, the inverse of the Godambe information matrix \(G_{a}\), defined through

$$\begin{aligned} G_{{a}}(\varvec{\theta })=H_{{a}}(\varvec{\theta })J_{{a}}(\varvec{\theta })^{-1}H_{{a}}(\varvec{\theta })^T , \quad a=1,2, \end{aligned}$$
(16)

where

$$\begin{aligned} H_{{a}}(\varvec{\theta })=-{\mathrm {E}}[\nabla ^{2}pl_{{a}}(\varvec{\theta })],\quad J_{{a}}(\varvec{\theta })={\mathrm {E}}[\nabla pl_{{a}}(\varvec{\theta })\nabla pl_{{a}}(\varvec{\theta })\,^T ], \quad a=1,2, \end{aligned}$$
(17)

is the asymptotic variance of the \(pl_a\), \(a=1,2\) estimators. In the Supplementary material, we offer closed-form expressions for the matrices \(H_{{a}}(\varvec{\theta })\) and \(J_{{a}}(\varvec{\theta })\), for \(a=1,2\).

Computation of the standard errors requires consistent estimation of the matrices \(H_{{a}}(\varvec{\theta })\) and \(J_{{a}}(\varvec{\theta })\), for \(a=1,2\). This can be achieved through the plug-in estimates \(H_a(\widehat{\varvec{\theta }}_a)\) and \(J_a(\widehat{\varvec{\theta }}_a)\), \(a=1,2\). Nevertheless, the latter becomes computationally unfeasible for large data sets since it is of order \(O((Np)^4)\). In order to estimate \(J_a(\widehat{\varvec{\theta }}_a)\), we extend to the multivariate case the subsampling method as described in Heagerty and Lumley (2000). Provided that \(W^{-1} J_a(\widehat{\varvec{\theta }}_a)\) converges to a matrix \(J^*_{a} \) as \(n \rightarrow \infty \), where W is the sum of the weights involved in the estimation, we use the subsampling method in order to obtain an estimate \(\widehat{J}^*_a\) of \(J^*_{a} \) and then estimate \(J_a(\widehat{\varvec{\theta }}_a)\) through \(W \widehat{J}^*_a\). Given \(S_1,\ldots ,S_{M}\) subsets of \(\{1,2, \ldots , n\}\), associated with the observation points \(\{\varvec{s}_1,\varvec{s}_2, \ldots , \varvec{s}_n\}\), the estimator is defined as

$$\begin{aligned} \widehat{J}^*_a =\frac{1}{M} \sum _{k=1}^{M} \frac{1}{W_a^{(k)}} {\mathop {\mathop {\sum }\limits _{(i,j,m,n)\in \Lambda _a}}\limits _{(i',j',m',n')\in \Lambda _a}} [\nabla pl_a]_{ijmn} [\nabla pl_a]_{i'j'm'n'} ^{\intercal } , w_{ijmn} w_{i'j'm'n'},\quad a=1,2, \end{aligned}$$
(18)

where \(W_a^{(k)} =\sum _{(i,j,m,n)\in \Lambda _a} w_{ijmn}\). Here, the pairs (ij) and \((i',j')\) belong to \(S_k\) and the analytic expression of \([\nabla pl_a]_{ijmn}\), \(a=1,2\) is offered in the Supplementary material. The subsets are derived gathering the points that fall in a collection of overlapping subregions of the same shape of the region of observations but of smaller volume (Lee and Lahiri 2002). Then, the asymptotic covariance matrix of \(\widehat{\varvec{\theta }}_a\) can then be estimated by using the subsampling approximation

$$\begin{aligned} \widehat{G_a^{-1}}({\widehat{\varvec{\theta }}_a})= {H}^{-1}_a(\widehat{\varvec{\theta }}_a) (W\widehat{J}^*_a){H}^{-1}_a(\widehat{\varvec{\theta }}_a). \end{aligned}$$
(19)

and standard error estimation of each parameter is computed by taking the square root of the diagonal elements of \(\widehat{G_a^{-1}}({\widehat{\varvec{\theta }}_a})\). As outlined in Bevilacqua et al. (2012), computational reasons can drive the choice of the number of subsets M. For instance, if M grows at a rate \({\mathcal {O}}((Np)^{2/3})\), evaluation of \(J^*_a\) implies an order of computation \({\mathcal {O}}((Np)^2)\) that is the same of \(H_a\) and \(pl_a\).

Finally, as in the classical likelihood framework, model selection based on CL is feasible. Model selection is a major problem in multivariate spatial statistics, since when increasing the number of components of the vector random field, the complexity of the associated covariance models grows up quickly, and the choice of a parsimonious model can be troublesome. Following Varin and Vidoni (2005), the CL information criterion (CLIC) selects the model maximizing

$$\begin{aligned} \text {CLIC}(\widehat{\varvec{\theta }}_a)= pl_a(\widehat{\varvec{\theta }}_a)+\text {tr}(J_{{a}}(\widehat{\varvec{\theta }}_a)H_{{a}}(\widehat{\varvec{\theta }}_a)^{-1}), \end{aligned}$$
(20)

where the estimation of \(J_{{a}}(\widehat{\varvec{\theta }}_a)\) has been discussed previously.

4 Simulation Study

The aim of this section is to compare the performance of \(pl_1\) and \(pl_2\), with ML estimation method, from statistical and computational efficiency point of view. Multivariate CT, described in Sect. 2, is used as benchmark. We work under the bivariate case (\(p=2\)), and in particular, we consider N location sites, being uniformly distributed on the square \([0,1]^2\). In the simulation study, since we want to compare \(pl_a\), \(a=1,2\) with ML, we choose a relatively small number of observations by fixing \(N=300\) so that we are considering 600 observations. The maximum distance between the location sites is \(\kappa _{max}= 1.339\).

We consider two bivariate covariance models, the former being obtained through Eqs. (4) and (5), fixing \(\nu _{ij}=0.5\); this choice relates to bivariate exponential model, defined as

$$\begin{aligned} C(\varvec{h},\varvec{\theta })=\left[ \rho _{ij}\sigma _i\sigma _i e^{-\frac{\Vert \varvec{h} \Vert }{\psi _{ij}}}\right] _{i,j}^2, \end{aligned}$$
(21)

where \(\rho _{ii}=1\) and \(\varvec{\theta }=(\rho _{12},\psi _{11},\psi _{12},\psi _{22},\sigma ^2_1,\sigma ^2_2)^T\). Here, \(\sigma ^2_m>0\) and \(\alpha _{ii}>0\), \(i=1,\ldots 2\) are, respectively, the variance and scale parameters, while \(\psi _{12}>0\) and \(\rho _{12}\) are, respectively, the cross-scale and the correlation parameter that describes the marginal correlation between the components.

The second model is a special case of the bivariate LMC model in Eq. (2):

$$\begin{aligned} C(\varvec{h};\varvec{\theta })= \left[ \sum _{k=1}^2 \alpha _{ik}\alpha _{jk}e^{-\frac{\Vert \varvec{h} \Vert }{\beta _i}}\right] _{i,j=1}^p , \end{aligned}$$
(22)

where \(\beta _i>0\) and \([\alpha _{lm}]_{l,m=1}^2\) is a matrix of rank 2.

Multivariate CT is performed considering a separable bivariate taper function, given by \([T(\varvec{h},b)]_{i,j=1}^2\) (Bevilacqua et al. 2016), where \(T(\varvec{h},b)\) is a univariate compactly supported correlation function of the Wendland type i.e, \(T(\varvec{h}; b)=\left( 1+4\frac{\Vert \varvec{h} \Vert }{b}\right) \left( 1-\frac{\Vert \varvec{h} \Vert }{b}\right) ^{4}_+\). This is a special case of a general class of tapers proposed in Daley et al. (2015). In this case, the taper matrix is given by \(\varvec{T}(b)= \mathbf{1} \otimes W \) where \(\mathbf{1}\) is a \(2 \times 2\) unit matrix and W is the \(N \times N\) sparse correlation matrix associated with the Wendland function. In the simulation study, the compact support b has been fixed in order to achieve 2.5% of nonzero values in the tapered matrices.

4.1 Statistical Efficiency Comparison

Comparison in terms of statistical efficiency is taken through the sample relative mean squared error, defined through

$$\begin{aligned} \text {RE}_a(\varvec{\theta }_i)=\frac{\mathrm{MSE}_{\text {ML}}(\varvec{\theta }_i)}{\mathrm{MSE} _a(\varvec{\theta }_i)}, \quad a=1,2,CT \end{aligned}$$
(23)

for \(i=1,\ldots ,k\) where \(\mathrm{MSE}_{\text {ML}}(\varvec{\theta }_i)\), \(\mathrm{MSE} _{CT}(\varvec{\theta }_i)\) and \(\mathrm{MSE} _a(\varvec{\theta }_i)\) are the sample mean squared error of ML, CT and \(pl_a\), \(a=1,2\), respectively, associated with the generic parameter \(\varvec{\theta }_i\).

As global measure of relative efficiency, we consider

$$\begin{aligned} \text {GRE}_a(\varvec{\theta },k)=\left( \frac{\det (V_{\text {ML}}(\varvec{\theta }))}{\det (V_a(\varvec{\theta }))} \right) ^{1/k}, \quad a=1,2,CT \end{aligned}$$

where \(V_{\text {ML}}\), \(V_{\text {CT}}\) and \(V_a\), \(a=1,2\) are the sample variance–covariance matrices of ML, CT and \(pl_a\), respectively, k is the number of parameters involved in the estimation, and \(\det (F)\) is the determinant of the matrix F .

Table 1 Scenario A: \(\text {RE}_a(\varvec{\theta }_i)\) and \(\text {GRE}_a(\varvec{\theta },5)\) for the parameters of the bivariate exponential model with constraints with increasing marginal correlation between the two random fields when considering \(pl_a\) and its weighted version \(pl_a^w\), \(a=1,2\) and covariance tapering (\(l_{CT}\)) estimation methods.
Fig. 1
figure 1

Boxplots of the bivariate exponential model (scenario A) estimates using maximum likelihood (ML) covariance tapering (CT) and CL using (\(pl_a^w\)) and not using (\(pl_a\)) (\(a=1,2\)) compact support weight functions when the correlation between the components is 0.5.

We consider three scenarios (denoted “A,” “B,” “C”), and for each scenario, we simulate 1000 realizations from a bivariate zero mean GRF, using Cholesky decomposition. Then, we estimate using ML, CT and both weighted and unweighted \(pl_a\), \(a=1,2\). Let us give more details about the indicated scenarios:

  1. (A)

    The model in Eq. (21) with constraints on the scales parameters. Specifically, we set \(\sigma ^2_1=\sigma _2^2=1\), \(\psi _{ii}=\frac{pr_{ii}}{3}\), \(i=1,2\), \(\psi _{12}=\frac{\psi _{11}+\psi _{22}}{2}\) and \(\rho _{12}=0.1,0.5,0.9\), where \(pr_{11}=0.2\) and \(pr_{22}=0.15\) are the practical ranges for the first and second components. In this case, \(\varvec{\theta }=(\rho _{12},\psi _{11},\psi _{22},\sigma ^2_1,\sigma ^2_2)^T\) and we estimate through CT and \(pl_a\), for \(a=1,2\) with the following weight function :

    • For \(pl_1\), we choose \(w_{iimn}=u_{mn}(\kappa _{ii}) \) with \(\kappa _{ii}=\frac{pr_{ii}}{2} \), \(i=1,2\) and \(w_{12mn}=v_{mn}(\kappa _{12}) \) with \(\kappa _{12}=0.5(\kappa _{11}+\kappa _{22})\)

    • For \(pl_2\), we choose \(w_{12mn}=u_{mn}(\kappa )\) with \(\kappa =\frac{\kappa _{11}+\kappa _{22}+\kappa _{12}}{3} \) and for CT we fix \(b=\kappa \).

Note that in \(pl_1\) we choose the marginal compact supports as a fraction of the practical ranges, and for \(pl_2\) and CT, we fix the compact support equal to the practical range. Table 1 shows \(\text {RE}_a(\varvec{\theta }_i)\), \(i=1,\ldots ,5\) and \(\text {GRE}_a(\varvec{\theta },5)\) for \(a=1,2,CT\) when \(\rho _{12}=0.1,0.5,0.9\). In the same Table, we report the results for unweighted \(pl_1\) and \(pl_2\).

Note that, when increasing the correlation between the random fields from 0.1 to 0.9, the relative efficiency of the colocated and scale parameters changes substantially for \(pl_1\), but not for \(pl_2\). By contrast, the efficiencies of the variance parameters are not affected by the correlation. Table 1 also shows that using compactly supported weight functions is a good choice for both methods when looking at the global measure of relative efficiency. Nevertheless, the relative efficiency of variances parameters does not improve when considering the weighted version. Overall \(pl_2\) improves the global measure of relative efficiency of \(pl_1\) of 10 percent approximatively. CT performs satisfactorily, in particular for the \(\rho _{12}\) and variances parameters, but \(pl_2\) outperforms CT in terms of global measure of efficiency. Figure 1 depicts the boxplots of each parameter for ML, CT and weighted and unweighted \(pl_a\) \(a=1,2\).

Table 2 Scenario B: \(\text {RE}_a(\varvec{\theta }_i)\) and \(\text {GRE}_a(\varvec{\theta },5)\) for the parameters of the LMC model with constraints with increasing marginal correlation between the two random fields, when considering \(pl_a\) and its weighted version \(pl_a^w\), \(a=1,2\) and covariance tapering (\(l_{CT}\)) estimation methods.
  1. (B)

    The model in Eq. (22), with \(\beta _1=0.2/3\), \(\beta _2=0.15/3\), \(\alpha _{ii}=\sqrt{1-x}\) \(i=1,2\), and the constraint \(\alpha _{12}=\alpha _{21}=\sqrt{x}\), with \(x=0.003,0.067,0.29\). Under this setting, the bivariate random field has unit variance, with marginal practical ranges equal to \(pr_{11}=0.2\) and \(pr_{22}=0.15\) and correlation between the components approximatively equal to 0.1, 0.5, 0.9 when \(x=0.003, 0.067,0.29\), respectively. In this case, \(\varvec{\theta }=(\alpha _{11},\alpha _{12},\alpha _{22},\beta _1,\beta _2)^T\). We estimate with CT and \(pl_a\) \(a=1,2\), using the same weights and the same compact support b in CT method, as in scenario A. Table 2 shows \(RE_a(\varvec{\theta }_i)\) , \(i=1,2,3,4,5\) and \(\text {GRE}_a(\varvec{\theta },5)\) for \(a=1,2,CT\) when \(\rho _{12}=0.1,0.5,0.9\).

    Note that, when increasing the correlation between the random fields from 0.1 to 0.9, the relative efficiency of \(\alpha _{12}\) and the scale parameters \(\beta _1\) and \(\beta _2\) change substantially for \(pl_1\), but not for \(pl_2\). By contrast, the efficiencies \(\alpha _{11}\) and \(\alpha _{22}\) parameters are not affected by the correlation. Table 2 shows that also in this scenario, using compactly supported weight functions is a good choice for both methods from global relative efficiency point of view. As in scenario A, \(pl_2\) improves the global relative efficiency of \(pl_1\) of 10 percent approximatively. CT relative efficiency is overall good, in particular when estimating \(\alpha _{ii}\) , \(i=1,2\). Also in this scenario, \(pl_2\) outperforms CT in terms of global relative efficiency.

We replicate the simulation study using unconstrained versions of the covariance models in Eqs. (21) and (22), and the conclusions are very similar to those associated with scenarios A and B.

  1. (C)

    The asymmetric model defined in Eq. (6) with \(p=2\) and \(C^*_{ij}({\varvec{h}},\cdot )\) equal to the separable bivariate exponential model was obtained by considering \(\psi =\psi _{ij}\) for \(i,j=1,2\) in the scenario (A). As outlined in Li and Zhang (2011), \({\varvec{a}}_1\) and \({\varvec{a}}_2\) are not jointly identifiable, so we set \({\varvec{a}}_1=[0,0]^T\) and \({\varvec{a}}_2=[k,k]^T\). We set \(\sigma _1^2=\sigma _2^2=1\), \(\psi =0.1/3\), \(k=-0.1\) and consider different colocated correlation that is \(\rho _{12}=0.15, 0.25, 0.35\). In this case, \(\varvec{\theta }=(k,\rho _{12},\psi ,\sigma ^2_1,\sigma ^2_2)^T\), but we note some numerical evidence of multimodality of the CL surface when estimating \(\varvec{\theta }\). For this reason, we fix the colocated parameter and then we estimate with ML, CT and \(pl_a\) \(a=1,2\), using the same weights and the same compact support b of CT as in scenario (A) but with \(d=\kappa _{max}/3\). Table 3 shows \(\text {RE}(\varvec{\theta }_i)\) , \(i=1,2,3,4\) and \(\text {GRE}(\varvec{\theta },4)\) for CT, weighted and unweighted \(pl_a\) \(a=1,2\). Note that when increasing the correlation, \(pl_1\) performs better than \(pl_2\) when estimating the scale parameter and the relative efficiency associated with the asymmetry parameter is better for \(pl_2\). The global measure of efficiency suggests that \(pl_1\) performs very similar to \(pl_2\) for this specific model. Finally, CT is very inefficient when estimating the asymmetry parameter. This suggests that the use of symmetric bivariate tapers, as proposed in Daley et al. (2015), can be very inefficient when estimating asymmetric models.

Table 3 Scenario C: \(\text {RE}_a(\varvec{\theta }_i)\) and \(\text {GRE}_a(\varvec{\theta },4)\) for the parameters of the bivariate separable asymmetric exponential separable model with marginal correlation between the two random fields equal to 0.15, 0.25, 0.35 when considering \(pl_a\) and its weighted version \(pl_a^w\), \(a=1,2\) and covariance tapering (\(l_{CT}\)) estimation methods.

We can summarize our findings as follows:

  • The use of compactly supported weight functions allows to improve significantly the global statistical relative efficiency of \(pl_a\), \(a=1,2\).

  • \(pl_2\) generally outperforms \(pl_1\) in terms of global statistical relative efficiency.

  • CT shows in general good statistical relative efficiency, in particular for some specific parameters. Nevertheless, \(pl_2\), in its weighted version, performs slightly better than CT.

4.2 Computational efficiency comparison

In order to give an idea of the computational gains of \(pl_1\) and \(pl_2\) with respect to ML and CT, we consider \(n_w=N\times 2^w\) location sites, being uniformly distributed on the square \(W_w=[0,2^{w}]\times [0,2^{w}]\), \(w=0,1,\ldots , 7\). The case \(w=0\) has been used as simulation setting in the previous section. In Table 4, we report times in seconds (in terms of elapsed time using the R function proc.time) for the evaluations of the ML, CT, \(pl_a\), \(a=1,2\) functions, by considering the unweighted and weighted versions of both methods. Results in Table 4 have been obtained using an upcoming version of the package CompRandFld (Padoan and Bevilacqua 2015), where ML, CT and \(pl_a\), \(a=1,2\) have been implemented. While carrying out the experiment, we have used a 2.4 GHz processor with 16 GB of memory.

The compact support of the weight functions and the bivariate taper has been fixed equal to 0.3. Computational advantages are clear for both types of CL when increasing the number of location sites. In particular, when increasing the number of observations, \(pl_1\) slightly outperforms \(pl_2\), especially in the unweighted version. Considerable computational gains are obtained when \(pl_1\) and \(pl_2\) are evaluated with compactly supported weights. For instance, when \(w=5\), \(pl_1\) and \(pl_2\) are, respectively, 5800 and 20600 times faster than the standard likelihood. For CT, we use the sparse matrix implementation in the R package spam (Furrer and Sain 2010). The spam package allows users to separate structural and numerical computations needed for Cholesky factorization. The result is that for a given sparsity structure, the full factorization needs to be done once only. This can save a lot of time and memory requirement when the tapered likelihood function is evaluated repeatedly. The time in Table 4 is associated with the computation of the Cholesky factor, the log determinant of the covariance matrix and the quadratic form in (5), given a fixed sparsity structure. In Table 4, we also report the percentage of nonzero values in the covariance-tapered matrices. Although in our experiment the covariance-tapered matrices are highly sparse (0.03% of non zero elements when \(w=5\) for instance), \(pl_a\) \(a=1,2\) methods are clearly computationally preferable with respect to the CT approach when increasing the number of location sites. Further, computational gains in the CT approach could be obtained tapering the covariance matrix only (Furrer et al. 2016), this choice leading to a biased estimator. Note the times for ML and CT when \(w=6,7\) are not reported, since evaluations of ML and CT functions in these cases take a very long time in our machine. Our results are consistent with those of Bevilacqua and Gaetan (2015), where a comparison between CL based on pairs and CT is performed in the univariate case.

Since \(pl_a\), \(a=1,2\) are highly amenable to parallelization, further computational gains can be obtained using parallel computing as, for instance, in Eidsvik et al. (2014). The package CompRandFld uses a sequential implementation of \(pl_a\), \(a=1,2\) so, potentially, results in Table 4 could be considerably improved. Finally, a clear advantage of \(pl_a\), \(a=1,2\) with respect to ML and CT is in terms of memory storage, as it requires only \(2 \times 2\), or \(4 \times 4\) covariance matrices.

Table 4 Time in seconds for evaluating the standard likelihood l(.), \(pl_1\) and \(pl_2\) using and not using compactly supported weights functions with compact support equal to 0.3 and \(l_{CT}(0.3)\) with associated percentage of nonzero values in the covariance-tapered matrices for \(w=0,1,\ldots ,7\). When \(w=6,7\), time in seconds for l(.) and \(l_{CT}(0.3)\) is not reported for computational reasons.

5 A real data example

In marine ecosystems, rising atmospheric CO2 and climate change are associated with concurrent shifts in temperature, circulation, stratification, nutrient input, oxygen content and ocean acidification, with potentially wide-ranging biological effects (Doney et al. 2012). An association of great interest is the concurrent shifts in temperature and the relationship with the presence of nutrients in the ocean, because of identifying changing trends in global marine phytoplankton.

This section analyzes satellite ocean data from MODIS/NASA. Specifically, we analyze the monthly average of chlorophyll concentration (Microgram per liter \((\upmu \mathrm{g/L})\)) and sea surface temperature \((^{\circ }\mathrm{K})\) during march of 2011 in the north of Chile observed in a regularly spaced grid with a resolution of 4 km, for a total of 35295 locations sites, i.e., 70590 observations. ML and CT estimations are clearly unfeasible in this case. It is well known that there is a negative correlation between the two variables (Behrenfeld et al (2006) and Boyce et al. (2010)).

Fig. 2
figure 2

Maps of residuals of chlorophyll concentration (left) and sea surface temperature (right) in the Chilean coast.

We keep out the location sites where the measurement of at least one of the two variable has not been recorded (598 location sites). We then use spline regression, as implemented in the R package mgcv (Wood 2006), with a fixed number of knots, in order to remove the cyclic pattern of both variables along the longitude and latitude directions. A preliminary analysis (see the, for instance, the empirical cross-variograms in Fig. 2) suggests that the residuals can be considered as a realization of a zero mean weakly stationary and isotropic bivariate Gaussian random field. As outlined by an anonymous Referee, the empirical cross-variogram estimates the average variance of the first-order contrast, but it is not informative of possible local nonstationary behaviors. Figure 2 shows the maps of the residuals, and as expected, a local nonstationary behavior near the coast is apparent for both chlorophyll concentration and sea surface temperature in particular at latitude \([-26,22]\) and longitude \([-72,71]\).

We specify the bivariate covariance model in Eq. (21) in an increasing order of complexity (separable version, constrained version as in the setting A of the simulation study and unconstrained version). We then estimate through \(pl_1\), using the weight functions \(w_{11mn}=u_{11mn}(70) \), \(w_{22mn}=u_{22mn}(50) \) and \(w_{12mn}=k_{12mn}(60)\), as well as through \(pl_2\), using \(w_{12mn}=u_{12mn}(60) \). The choice of the different compact supports in the weight functions for \(pl_1\) is due to the different spatial dependence (see Fig. 3). Table 5 reports \(pl_a\), \(a=1,2\) estimates with the associated standard errors, computed using subsampling technique as described in Sect. 3. Note that the estimation of colocated parameter, variances and scale(s) parameters is very similar for both methods in each considered model. As expected, the estimation of the parameter \(\rho _{12}\), expressing the marginal correlation between chlorophyll concentration and sea surface temperature, is negative.

Table 5 reports the value of the CL information criteria (CLIC) for \(pl_a\) \(a=1,2\) for each covariance model considered. In both cases, the model selected is Model 3. Note that the flexibility of this model allows to estimate a smaller cross-scale parameter \(\alpha _{12}\) with respect the marginal scale parameters \(\alpha _{ii}\), \(i=1,2\).

Fig. 3
figure 3

Comparison of the empirical (cross) variograms estimation with full exponential variogram model (model 3) estimated with \(pl_1\). First semivariogram is associated with chlorophyll, and second semivariogram is associated with temperature.

Finally, Fig. 3 shows the comparison between the empirical semi(cross)-variograms and the estimated semi(cross)- variograms associated with the model C estimated using \(pl_2\). The comparison shows a good fitting of the estimated model C with the empirical (cross) variogram.

Table 5 \(pl_a\), \(a=1,2\) estimates with associated standard errors for models 1, 2, 3 and associated CL information criteria (CLIC).

6 Concluding remarks

CL is an appealing method of estimation when dealing with large datasets. As outlined in Bevilacqua and Gaetan (2015), CL is a large class of estimating functions, and for a given estimation problem, it is not clear how to choose within this class. Some insights are given in Castruccio et al. (2016), where different types of marginal CLs are compared from statistical efficiency viewpoint. In the Gaussian case, if the choice of the CL is driven by computational reasons only, then the CL based on pairs has clear computational advantages with respect to other types of CLs. For this reason, we consider this type of CL when estimating multivariate Gaussian random fields. In particular, in this paper we have proposed two possible approaches, based on weighted composite likelihood methods, for the estimation of MGRF. The first is based on cross-pairwise likelihood, while the second is based on pairs of marginal pairwise likelihoods. Through numerical examples, we have compared the two methods using the maximum likelihood and covariance tapering as benchmarks when estimating different types of multivariate covariance models. The second version generally outperforms the first from statistical efficiency point of view, but it cannot be used in the heterotopic case.

The numerical examples show that the weighted versions have better performance from statistical efficiency point of view and considerable computational gains with respect to the unweighted versions. CT is a good competitor of both methods; nevertheless, our methods are clearly preferable in terms of computational complexity when considering very large datasets. The benefit of our methods in terms of memory requirement with respect to ML and CT is apparent since it requires only very small covariance matrices storage. Moreover, since our methods are highly amenable to parallelization, further computational gains can be achieved, using parallel computing techniques as described, for instance, in Lee et al. (2010) and Suchard et al. (2010).

Our methods can be easily extended to the estimation of multivariate space time covariance models when considering fully symmetric covariance models. For estimation of asymmetric covariance models (Stein 2005; Gneiting et al. 2007), asymmetry in time should be taken into account in CL following the lines of section 3. Finally, since our methods require model assumptions on the lower-dimensional marginal densities, they can be very useful for multivariate non Gaussian random field estimation where, typically, the likelihood function does not have a simple form, as in Genton et al. (2015).