1 Introduction

A very rich collection of market models have been developed and thoroughly investigated for intraday financial data. Although univariate modeling is important for addressing certain kinds of problems, it is not enough to unveil the nature and dynamics of the financial market. Interactions between different financial instruments are left out of univariate studies. If the companies belong to related business sectors, are affected by same socio-political conditions, or are owned by the same business house then such interactions can arise in a systematic way. High dependence between several constituent stocks of a portfolio can increase the probability of a large loss. So accurate estimation of the dependence between assets is of paramount importance. Correlation dynamics models, therefore, have become an important aspect of the theory and practice in finance. Correlation trading, which is a trading activity to exploit the changes in dependence structure of financial assets, and correlation risk that capture the exposure to losses due to changes in correlation, have attracted the attention of many practitioners, see Krishnan et al. (2009). Other than these direct applications, accurate modelling of dependence is also important indirectly, in a range of practical scenarios. For example, basket options are widely used, although their accurate pricing is challenging. The primary reason is that they are cheaper to use for portfolio insurance. The cost-saving relies on the dependence structure between the assets, see Salmon et al. (2006). In the actuarial world, as shown in Embrechts et al. (2002), Monte Carlo-based approaches to joint modelling of risks, like Dynamic Financial Analysis, depend heavily on the dependence structure. Frey and McNeil (2002) and Breymann et al. (2003) showed that the choice of model and correlation have a significant impact on the tail of the loss distribution and measures of extreme risks. It follows from the above discussion that we need accurate multivariate modeling and analysis. In order to perform multivariate analysis, we need multivariate data. This means that we need to have data for all p(≥ 2) variables on n (sufficiently large) time points. For example, in case of daily financial data we would expect to observe the price of all p stocks on a particular day. This kind of data is called synchronously observed data. On the other hand, if we don’t have observations for one or several variables (or stocks) on a particular time point then we call it nonsynchronous /asynchronous data. An example of such data is intraday stock price data. Within a particular day we can not expect to observe transactions in all stocks simultaneously. In Fig. 1, we have shown the transaction/arrival times of two stocks within a small time interval. This cannot be handled as a missing data problem where a fraction of observations are missing. Here, it is extremely rare to observe transactions in two stocks at the same point of time.

Figure 1
figure 1

Transaction times for two stocks Facebook and Apple for first few ticks on a particular day (10/05/2017). The scale on the x-axis is seconds

The effect of asynchronicity can be quite serious on the estimation of model parameters. One such phenomenon, reported by Epps (1979), is called Epps effect. Empirical results reported in that paper showed that the realized covariance between stock returns decreases as sampling frequency increases. Later the same phenomenon has been reported in several other studies on different stock markets, see Zebedee and Kasch-Haroutounian (2009) and foreign exchange market, see Muthuswamy et al. (2001). It is also empirically shown , see Renò (2003), that taking into account only the synchronous, or nearly synchronous, alleviates this underestimation problem.

Several studies have been devoted to the estimation of the covariance from intraday data. Mancino and Sanfelici (2011) analysed the performance of the Fourier estimator originally proposed by Malliavin et al. (2009). Peluso et al. (2014) adopted a Bayesian dynamic linear model and treated asynchronous trading as missing observations in an otherwise synchronous series. Corsi and Audrino (2012) proposed two covariance estimators, adapted to the case of rounding in the price time stamps, which can be seen as a general way of computing the Hayashi-Yoshida estimator (see Hayashi et al. 2005) at a lower frequency. Zhang (2011) proposed a method called two-scale realized covariance estimator (TSCV) which combines two-scale sub-sampling and previous tick method that can simultaneously remove the bias due to microstructure noise and asynchronicity. In the similar line, average realized volatility matrix (ARVM) has been proposed to modify the TSCV estimator such no bias-correction is required for the off-diagonal elements, see Hwang and Shin (2018). Fan et al. (2012) studies TSCV under high-dimensional setting. Aït-Sahalia et al. (2010) proposed quasi–maximum likelihood estimator of the quadratic covariance. Attempts have been made to resolve the problems of high-frequency data by introducing appropriate filtering technique, see Ho and Xin (2019).

The correlation coefficient only captures the linear dependence. In this work, we aspire to focus on estimation of non-linear dependence structure through copula. Apart from modelling the complete dependence structure, one of many other advantages of copula is the flexibility it offers to model the complex relationship between variables in a rather simple manner. It allows us to model the marginal distributions as per need and takes care of the dependence structure separately. It is also one of the most important tools to model tail dependence, which is the probability of extremely large or small return on one asset given that the other asset yielded an extremely large or small return, see Xu and Li (2009). For this reason, copula is also a useful tool for modelling the joint distribution of default times. It is used for pricing Credit Spread Basket, Credit Debt Obligation, First to Default, N-to Default and other Credit derivative baskets, see Malgrat (2013). They also shows how copula helps to correlate the systematic risk to idiosyncratic risk. Zhang and Zhu (2016) developed a class of copula structured multivariate maxima and moving maxima processes which is a flexible and statistically workable model to characterize joint extremes.

In this paper, we will elaborately discuss the impact of asynchronicity on several measures of association in a general class of copulas. We explain why there is a serious underestimation of the measures of association and demonstrate the need for careful treatment. We propose an alternative method for the estimation of correlation. Moreover we prescribe methods for accurate estimation of the associated copula. We also show that the estimation of some commonly used measures of associations, like Kendall’s tau, is challenging. The rest of the paper is organized as follows. In Section 2 we deal with the elliptical copula parameter estimation for nonsynchronous data and prove the main theorems. Section 3 deals with a more general class of copulas. In Sections 4 and 5 the results of simulation and real data analysis are shown. We present the conclusions in Section 6. All the proofs are given in Appendix Appendix.

2 Estimation of Elliptical Copula

Suppose there are two stocks whose log-prices at time t ∈ (0,T) are denoted by Xt and Yt. By \({R_{t}^{1}}\) and \({R_{t}^{2}}\) we denote the corresponding log-returns. Although in the ideal world of the Black-Scholes model, the log returns are assumed to follow a Gaussian distribution, the stylized facts about financial market suggest that a distribution with a heavier tail needs to be considered. In the multivariate scenario, the search for such a model is challenging. In such situations copula appears to be a central tool at our disposal.

In Section 4, the results of a simulation study are reported where the effect of asynchronicity on the estimation of the correlation coefficient has been shown. The simulation results display severe underestimation. Before attempting to understand the problem and propose a remedy, we will present an algorithm to synchronize the data to make it suitable for standard multivariate analysis. We should note that some studies (see Hayashi et al. (2005), Buccheri et al. 2020) attempt to calculate integrated covariance without synchronizing the data.

2.1 Pairing Method

The prices of the stocks are observed at random times when transactions take place. As a transaction in one stock would not influence the transaction time in the other, it is reasonable to assume that the observation times of the two stocks are independent point processes. Therefore, if we have log prices of the first stock along with its time of occurrence as \((X_{i},{t_{i}^{1}}), i=1,2,..,n_{1}\) and that of the second stock as \((Y_{j},{t_{j}^{2}}), j=1,2,...,n_{2}\), then \({t_{i}^{1}}\)’s and \({t_{j}^{2}}\)’s are independent. Here n1 and n2 are the number of observations of first and second stock respectively, available on a particular day.

Before fitting a copula model, the observations of two stock prices need to be paired such that they can be treated as synchronously observed. The conventional synchronizing methods, like previous tick sampling require a set (or sample) of n time points τi,i = 1(1)n at which we would like to observe a synchronized pair. For each stock, the tick information observed just previous to each such sampled time point τi is chosen to construct the synchronized pair (\(X_{\tau _{i}},Y_{\tau _{i}}\)), yielding n such pairs.

It is evident from the above discussion that the number of synchronized pairs is less than both n1 and n2, unless we allow repetition. This means many observations in each stock will be removed and not to be used for further analysis. Generalized sampling times are defined as the following.

DEFINITION 1

Ait-Sahalia et al. (2005) Suppose we have M stocks. Let \({t_{k}^{i}}\) be the k-th arrival time of the i-th asset. Then {τj: 1 ≤ jn}, are called generalized sampling times if

  1. 1.

    0 = τ0 < τ1 < ... < τn = T.

  2. 2.

    \((\tau _{j-1},\tau _{j}]\cap \{{t_{k}^{i}}:k=1,...,n_{i}\}\neq \varnothing \) for some i = 1,...,M.

  3. 3.

    \(\max \limits _{1\leq j\leq n}\delta _{j}\rightarrow 0\) in probability, where δj = τjτj− 1.

In the above-mentioned method, an observation is uprooted from its original time point and assigned to a sampled time point τj, for some j. In contrast, we want to retain the actual times of the prices that are chosen to be paired. In other words, instead of having a pair like \((X_{\tau _{j}},Y_{\tau _{j}})\), we want to have a pair \((X_{t_{{k_{i}^{1}}}^{1}},Y_{t_{{k_{i}^{2}}}^{2}})\) where \(t_{{k_{i}^{1}}}^{1}\text {and }t_{{k_{i}^{2}}}^{2}\) are the times at which the i-th pair of stock-prices were observed. To emphasise this, we call the algorithm as the ‘pairing method’ (in contrast to ‘synchronizing method’). The pairing method, to be followed throughout in this paper, is described through the following algorithm (A0):

figure a

The pairs created by this algorithm are identical to the pairs created by “refresh time sampling” (see Barndorff-Nielsen et al. 2011) but accommodates more information by retaining the transaction times. Instead of writing \((X_{t_{{k_{i}^{1}}}^{1}},Y_{t_{{k_{i}^{2}}}^{2}})\) we shall henceforth write \((X_{t({k_{i}^{1}})},Y_{t({k_{i}^{2}})})\).

In Fig. 2a and b, \({t_{j}^{1}}\) and \({t_{i}^{2}}\) are paired together as (\(t({k_{l}^{1}}),t({k_{l}^{2}})\)). The figures illustrate how the next pair is going to be chosen using the algorithm. In Fig. 2a, \(t_{j+1}^{1}<t_{i+1}^{2}\). So \(t(k_{l+1}^{2})=t_{i+1}^{2}\) and \(t(k_{l+1}^{1})\) is chosen to be the largest of the arrival times in the first stock that are less than \(t_{i+1}^{2}\). In Fig. 2b, \(t_{j+1}^{1}>t_{i+1}^{2}\). So \(t(k_{l+1}^{1})=t_{j+1}^{1}\) and \(t(k_{l+1}^{2})\) is chosen to be the largest of the arrival times in the first stock that are less than \(t_{j+1}^{1}\). The pairs are represented by the arrows.

Figure 2
figure 2

Two consecutive pairs represented by arrows. Price of the first stock observed at \({t^{1}_{j}}\) (red circle) is paired with the price of the second stock observed at \({t^{2}_{i}}\) (red square). For Panel (a), price of the first stock observed at \(t^{1}_{j+2}\) (blue circle) is paired with the price of the second stock observed at \(t^{2}_{i+1}\) (blue square). In Panel (b), price of the first stock observed at \(t^{1}_{j+1}\) (blue circle) is paired with the price of the second stock observed at \(t^{2}_{i+2}\) (blue square)

2.2 Estimation of Correlation Coefficient

Suppose we have n paired observations from the algorithm A0. Now we can proceed to calculate the correlation coefficient. We denote a pair of centered and scaled log price processes by (Xt,Yt) and assume that this is independent of the arrival processes. Since the log returns \(X_{t({k_{i}^{1}})}-X_{t(k_{i-1}^{1})}\) and \(Y_{t({k_{i}^{2}})}-Y_{t(k_{i-1}^{2})}\) are calculated over two nonidentical time-intervals, namely \((t({k_{i}^{1}}),t({k_{i}^{2}}))\) and \((t(k_{i-1}^{1}),t(k_{i-1}^{2}))\), the correlation between the returns is heavily dependent on the length of overlapping and non-overlapping portions of these two time-intervals. To see this, first suppose \(X_{t({k_{i}^{1}})}-X_{t(k_{i-1}^{1})}={\sum }_{i=m}^{l}(X_{t_{i+1}}-X_{t_{i}})\) for some m and l. Here {ti : i = 1(1)(n1 + n2)} is the set of combined (ordered) time points at which a transaction (in any of the stocks) is noted. Then one of these four configurations is true:

$$ \left[ \begin{array}{rcl}1. \quad Y_{t({k_{i}^{2}})}-Y_{t(k_{i-1}^{2})}&=&{\sum}_{i=m+1}^{l-1}(Y_{t_{i+1}}-Y_{t_{i}})\\ 2. \quad Y_{t({k_{i}^{2}})}-Y_{t(k_{i-1}^{2})}&=&{\sum}_{i=m-1}^{l-1}(Y_{t_{i+1}}-Y_{t_{i}})\\ 3. \quad Y_{t({k_{i}^{2}})}-Y_{t(k_{i-1}^{2})}&=&{\sum}_{i=m+1}^{l+1}(Y_{t_{i+1}}-Y_{t_{i}})\\ 4. \quad Y_{t({k_{i}^{2}})}-Y_{t(k_{i-1}^{2})}&=&{\sum}_{i=m-1}^{l+1}(Y_{t_{i+1}}-Y_{t_{i}}) \end{array}\right] $$
(1)

See Fig. 3a and b for illustrations of the first two configurations where two consecutive pairs of log-prices are \((X_{t(k_{i-1}^{1})},Y_{t(k_{i-1}^{2})})\) and \((X_{t({k_{i}^{1}})},Y_{t({k_{i}^{2}})})\) with their corresponding transaction times \((t(k_{i-1}^{1}),t(k_{i-1}^{2}))\) and \((t({k_{i}^{1}}),t({k_{i}^{2}}))\).

Figure 3
figure 3

Panel (a): Two consecutive pairs of log-returns \((X_{t(k_{i-1}^{1})},Y_{t(k_{i-1}^{2})})\) and \((X_{t({k_{i}^{1}})},Y_{t({k_{i}^{2}})})\). Panel (a): corresponding transaction times being \((t(k_{i-1}^{1}),t(k_{i-1}^{2}))\), \((t({k_{i}^{1}}),t({k_{i}^{2}})).\) Panel (b): corresponding transaction times being \((t({k_{i}^{1}}),t({k_{i}^{2}}))\), \((t(k_{i+1}^{1}),t(k_{i+1}^{2}))\)

DEFINITION 2

We define a random variable Ii, denoting the overlapping time interval of i th interarrivals corresponding to \(X_{t({k_{i}^{1}})}-X_{t(k_{i-1}^{2})}\) and \(Y_{t({k_{i}^{2}})}-Y_{t(k_{i-1}^{2})}\) as

$$ I_{i}=\begin{cases} t({k_{i}^{2}})-t(k_{i-1}^{2}) & \text{if}\ Y_{t({k_{i}^{2}})}-Y_{t(k_{i-1}^{1})}={\sum}_{i=m+1}^{l-1}(Y_{t_{i+1}}-Y_{t_{i}})\\ t({k_{i}^{2}})-t(k_{i-1}^{1}) & \text{if}\ Y_{t({k_{i}^{2}})}-Y_{t(k_{i-1}^{1})}={\sum}_{i=m-1}^{l-1}(Y_{t_{i+1}}-Y_{t_{i}})\\ t({k_{i}^{1}})-t(k_{i-1}^{2}) & \text{if}\ Y_{t({k_{i}^{2}})}-Y_{t(k_{i-1}^{1})}={\sum}_{i=m+1}^{l+1}(Y_{t_{i+1}}-Y_{t_{i}})\\ t({k_{i}^{1}})-t(k_{i-1}^{1}) & \text{if}\ Y_{t({k_{i}^{2}})}-Y_{t(k_{i-1}^{1})}={\sum}_{i=m-1}^{l+1}(Y_{t_{i+1}}-Y_{t_{i}}). \end{cases} $$

For example in Fig. 3a, \(I_{i}=t({k_{i}^{2}})-t(k_{i-1}^{2})\) and for Fig. 3b, \(I_{i}=t({k_{i}^{2}})-t(k_{i-1}^{1})\).

DEFINITION 3

We define \(\hat {\theta }\) as the following,

$$ \hat{\theta} = \hat{\rho}\frac{\sqrt{m_{1}.m_{2}}}{m(I)}, $$

where for l = 1,2 \(m_{l}=\frac {1}{n}{\sum }_{i=1}^{n}(t({k_{i}^{l}})-t(k_{i-1}^{l}))\), \(m(I)=\frac {1}{n}{\sum }_{i=1}^{n}I_{i}\) and \(\hat {\rho }\) is the sample correlation coefficient based on the pairs algorithm (A0).

With these definitions and notations we are now ready to state the first theorem.

We consider the following assumptions (\(\mathcal {A}\)):

\(\mathcal {A}_{1}\): The log price processes follows independent and stationary increment property and the increments of the two processes over the same time interval have correlation 𝜃.

\(\mathcal {A}_{2}\): The observation times (arrival process) of two stocks are independent Renewal processes and \(n\rightarrow \infty \) as \(n_{1},n_{2}\rightarrow \) \(\infty \).

\(\mathcal {A}_{3}\): Estimation is based on paired data obtained by algorithm A0.

Theorem 1.

Under the assumptions \(\mathcal {A}_{1}-\mathcal {A}_{3},\)

  1. 1.

    \(\hat {\theta }\) is a consistent estimator of the true correlation coefficient 𝜃.

  2. 2.

    Moreover,

    $$ \sqrt{n}(\hat{\theta}-\theta)\stackrel{d}{\rightarrow}N(0,\gamma^{2}(1-\rho^{2})^{2}+\rho^{2}{\sigma_{0}^{2}}), $$

    where \({\sigma _{0}^{2}}\), defined in Appendix Appendix, depends only on the distribution of the arrival times and

    $$ \gamma=\frac{\sqrt{E(t({k_{2}^{1}})-t({k_{1}^{1}}))E(t({k_{2}^{2}})-t({k_{1}^{2}}))}}{E(I_{1})}. $$

Proof of Theorem 1 is given in Appendix Appendix.

According to this theorem, in order to get a consistent estimator, we need to multiply the usual sample correlation coefficient, based on the paired observations (by algorithm A0), by a correction factor. The correction factor is a function of only \(t_{{k_{i}^{1}}}\) and \(t_{{k_{i}^{2}}}\) for i = 1(1)n i.e. it is only dependent on the arrival process.

2.3 Nonlinear dependence and elliptical copula

So far we were dealing with linear dependence through the correlation coefficient. In this section we will deal with nonlinear dependence through copula. A copula is formally defined as follows.

DEFINITION 4

A d-dimensional distribution function C(u1,u2,...,ud) : [0,1]d \(\rightarrow [0,1]\), where the margins satisfy Cj(uj) = C(1,1,...,uj,...1) = uj for all uj ∈ [0,1] and j = 1,....,d, is called a copula.

It is clear from the definition that the copula is a distribution function with uniform margins. There are many families of copula functions, like Elliptical copula, Archimedean copula, vine copula etc. In this section, we will restrict our attention to elliptical copula. Elliptical copula has the following form

$$ C_{R}(u_{1},u_{2},...,u_{d})=F_{R}(F^{-1}(u_{1}),F^{-1}(u_{2}),...,F^{-1}(u_{d})), $$

where FR(x1,x2,....,xd) is a d-dimensional elliptical distribution function with correlation matrix R and F(.) is the marginal distribution of FR (Hyrš and Schwarz, 2015). The Gaussian copula is the most widely used elliptical copula which mimics the dependence structure of a multivariate Gaussian distribution. But it does not capture the nonlinear dependence. It is well-known that a Gaussian copula with correlation coefficient zero reduces to independent copula. But this is not true in general. For example in case of another common elliptical copula, namely the t copula, the parameter captures the linear dependence but the form of the copula function accommodates for nonlinear dependence. We will now discuss the effect of asynchronicity on copula estimation.

By Sklar’s theorem (see Nelsen 2007), the distribution function of the log returns R1 and R2 can be expressed as F(r1,r2) = C(F1(r1),F2(r2);𝜃), where C is the unique copula associated with F and F1, F2 are the distribution function of the scaled returns on a unit interval i.e.

$$ R^{2}_{{k^{2}_{i}}}=\frac{(Y_{t({k^{2}_{i}})}-Y_{t(k^{2}_{i-1})})}{(t({k^{2}_{i}})-t(k^{2}_{i-1}))}~ \text{and}~ \\ R^{1}_{{k^{1}_{i}}}=\frac{(X_{t({k^{1}_{i}})}-X_{t(k^{1}_{i-1})})}{(t({k^{1}_{i}})-t(k^{1}_{i-1}))}. $$

Asynchronicity not only affects the estimation of 𝜃, but also the estimation of the copula function because R1 and R2 are assumed to be observed synchronously. The convergence of \(\hat {C}(\hat {F}_{1}(r^{1}),\hat {F}_{2}(r^{2});\hat {\theta })\), where \(\hat {F}_{1}(.)\) and \(\hat {F}_{2}(.)\) are empirical distribution functions of R1 and R2, needs more than the convergence of \(\hat {\theta }\). The next theorem tries to address this concern. Before stating the theorem we will make an additional assumption which ensures that the probability of both the missing value of the scaled return at \(t_{{k_{i}^{1}}}\) and observed value of the scaled return at \(t_{{k_{i}^{2}}}\) lying in an interval of length 2δ is in the order of \(\frac {\delta }{n^{\psi }}\) for ψ > 0. \(\mathcal {A}_{4}: P[\mid R_{{k^{1}_{i}}}^{l}-R_{{k^{2}_{i}}}^{l}\mid \leq 2\delta ]=O(\frac {\delta }{n^{\psi }})\) for l = 1,2 with ψ > 0 and \(\mid \mathrm {max_{i}}(R_{{k_{i}^{1}}}-R_{{k_{i}^{2}}})\mid <M\) where M is a positive real number. This assumption is reasonable due to two reasons. First, as δ gets smaller, we expect the probability \(P[\mid R_{{k^{1}_{i}}}^{l}-R_{{k^{2}_{i}}}^{l}\mid \leq 2\delta ]\) also to be smaller. Second, as we are looking for high-frequency data, increase in n implies higher liquidity. Therefore the arrival times between two consecutive transactions reduces. Now, as we assumed an underlying diffusion process driven by Geometric Brownian Motion, the fluctuations in price (and therefore returns) would be much lesser for a short interarrival. Therefore the probability would be less for a higher value of n.

Theorem 2.

If the true underlying copula is an elliptical copula then under \(\mathcal {A}_{1}-\mathcal {A}_{4}\), \(C(\hat {F}_{1}(r_{1}),\tilde {F}_{2}(r_{2});\hat {\theta })\) is uniformly convergent to the true copula, where \(\hat {F}_{1}(.)\) and \(\tilde {F}_{2}(.)\) are the empirical distribution functions of the marginals of R1 and R2 computed from the scaled paired data and \(\hat {\theta }\) is defined as in Theorem 1.

Proof of Theorem 2 is given in Appendix A.

2.4 γ and Expected Loss of Data

Recall that the correction factor γ in Theorem 1 is a function of the arrival process only. It is worthwhile to express γ in terms of the underlying parameters of the arrival processes. In the next theorem, we will try to do so. But the implication of the theorem goes beyond this purpose. Remember that all the synchronization methods we discussed have one problem. It results in loss of data, which is evident from Fig. 2. The second observation of the first stock will not be included in any of the pairs and therefore will be wasted. So one can ask the question that what proportion of observations (of each stock) will be wasted by using our pairing method (A0). This can be answered if we can compare average interarrival length in a stock (for example \(E({t_{i}^{1}}-t_{i-1}^{1})\) for the first stock) and average interarrival length formed by the pairs (\(E(t_{k_{i}}^{1}-t_{k_{i-1}}^{1})\) for the first stock). One important point to note here is that even if the two initial point processes \({{t_{i}^{1}}}:i=1(1)n_{1}\) and \({{t_{i}^{2}}}:i=1(1)n_{2}\) are independent, the point processes after pairing the observations \({t_{k_{i}}^{1}}:i=1(1)n\) and \({t_{k_{i}}^{2}}:i=1(1)n\) are not independent. This is due to the fact that the pairing method (A0) involves arrivals of both the stocks. Due to that fact, we will see in the next theorem both \(E(t_{k_{i}}^{1}-t_{k_{i-1}}^{1})\) and \(E(t_{k_{i}}^{2}-t_{k_{i-1}}^{2})\) involves λ1,λ2, the parameters of the two point processes.

Theorem 3.

Suppose the two underlying point processes are Poisson processes with parameters λ1 and λ2 then,

  1. 1.

    \(E(I) = \frac {1}{2}{\sum }_{n=1}^{\infty }n[(\frac {1}{\lambda _{1}}+\frac {1}{\lambda _{2}})(p_{n}+q_{n})].\)

  2. 2.

    \(E(t_{k_{i}}^{1}-t_{k_{i-1}}^{1}) = \frac {\eta _{1}}{\lambda _{1}}.\)

  3. 3.

    \(E(t_{k_{i}}^{2}-t_{k_{i-1}}^{2}) = \frac {\eta _{2}}{\lambda _{2}}.\)

where

$$ p_{n}=F_{B(1,n+1)}\left( \frac{\lambda_{2}}{\lambda_{1}+\lambda_{2}}\right)-F_{B(1,n)}\left( \frac{\lambda_{2}}{\lambda_{1}+\lambda_{2}}\right), $$
$$ q_{n}=F_{B(1,n+1)}\left( \frac{\lambda_{1}}{\lambda_{1}+\lambda_{2}}\right)-F_{B(1,n)}\left( \frac{\lambda_{1}}{\lambda_{1}+\lambda_{2}}\right), $$

and for i = 1,2

$$ \begin{array}{@{}rcl@{}} \eta_{i}&=&\sum\limits_{k=1}^{\infty}\left[\Big\{F_{B(1,k+1)}\big(1-\frac{\lambda_{i}}{\lambda_{1}+\lambda_{2}}\big)-F_{B(1,k)}\big(1-\frac{\lambda_{i}} {\lambda_{1}+\lambda_{2}}\big)\Big\}\right]\\ &&+\sum\limits_{k=1}^{\infty}\left[k\Big \{F_{B(1,k+1)}\big(\frac{\lambda_{i}}{\lambda_{1}+\lambda_{2}}\big)-F_{B(1,k)}\big(\frac{\lambda_{i}}{\lambda_{1}+\lambda_{2}}\big)\Big \}\right]. \end{array} $$

with FB(a,b) denoting the cumulative distribution function (cdf) of the Beta (a,b) distribution.

Proof of Theorem 3 is given in Appendix A.

As a consequence of this theorem we have

$$ \gamma=\frac{\sqrt{\frac{\eta_{1}\eta_{2}}{\lambda_{1}\lambda_{2}}}}{\frac{1}{2}{\sum}_{n=1}^{\infty}n\Big[\big(\frac{1}{\lambda_{1}}+\frac{1}{\lambda_{2}}\big)\big(p_{n}+q_{n}\big)\Big]}. $$
(2)

Before moving to the next section, we like to note two points. First, If one of the stock has much lower liquidity than the other then the number of paired observations will reduce significantly. The extent of this reduction can be obtained precisely from Theorem 3. Secondly, the parameter of elliptical copula is the correlation matrix. As a result, all the methods described above can be applicable for multivariate analysis with dimension more than 2.

3 Extension to general copula

In this section, we will deal with a more general class of copulas. As the argument in Section 2 is entirely based on the correlation coefficient it can not be directly extended to a larger class of copulas. This is precisely because for a general copula there is no direct relation between the Pearson’s correlation coefficient and the copula parameter.

We propose to use Kendall’s tau to capture the copula dependence. The definition of Kendall’s tau is

$$ \rho_{\tau}(X,Y) := E(\text{sign}((X-\tilde{X})(Y-\tilde{Y}))), $$

where \(\tilde {X}\) and \(\tilde {Y}\) are identical but independent copies of X and Y. The relation between Kendall’s tau and the copula is captured through the following equation.

$$ \rho_{\tau}(X,Y)=4{{\int}_{0}^{1}}{{\int}_{0}^{1}}C(u_{1},u_{2})dC(u_{1},u_{2})-1. $$
(3)

If X and Y be random variables with an Archimedean copula C generated by ϕ in Ω then

$$ \rho_{\tau}(X,Y)=1+4{{\int}_{0}^{1}}\frac{\phi(t)}{\phi'(t)}dt. $$
(4)

For the elliptical copulas a simplified form can be derived,

$$ \rho_{\tau}(X,Y)=\frac{2}{\pi}\text{arcsin}\rho. $$
(5)

So we can study how Kendall’s tau is affected by asynchronicity. Thereupon we will gauge the impact on the copula parameter using the above mentioned relation.

3.1 Underestimation of Kendall’s Tau

To have a closer look at the dependence, in this section we consider the conditional distribution of return given the underlying configuration. The problem with nonsynchronous data is that any two independent pairs of returns can not be taken as identical copies of each other. To see this, consider Fig. 4, where arrival times of the 1st stock are denoted by triangles and arrival times of the second stock are denoted by circles. After applying the pairing method, suppose the first circle and first triangle represent the location of the first pair of prices. Similarly, the second circle and the second triangle represent the next pair. From the figure, it is evident that these two pairs forms an example of the second configuration (see Eq. 1). Similarly, the 3rd and 4th pair constitutes an example of the 4th configuration. So the corresponding returns may not be considered as identically distributed. In this subsection, we will measure the Kendall’s tau using only the returns with same configuration. Figure 5 represents the arrival times of two pairs of the same configuration.

Figure 4
figure 4

Two independent observations (pairs) of two different configurations

Figure 5
figure 5

An example of arrival times of two pairs with same configuration

As illustrated in Fig. 5, suppose we have two non-overlapping inter-arrivals u1 and u2 for the first stock and 𝜖1 + u1 + η1 and 𝜖2 + u2 + η2 for the second stock, with arrival times denoted by the triangles and circles respectively. The log returns corresponding to the inter-arrivals of the first stock are given by \({R^{1}_{1}}=R^{1}(u_{1})\) and \({R^{1}_{2}}=R^{1}(u_{2})\). Similarly, the log returns corresponding to the intervals of the second stock are denoted by \({R^{2}_{1}}=R^{2}(\epsilon _{1}+u_{1}+\eta _{1})=R^{2}(\epsilon _{1})+R^{2}(u_{1})+R^{2}(\eta _{1})\) (due to independent increment property) and \({R^{2}_{2}}=R^{2}(\epsilon _{2}+u_{2}+\eta _{2})=R^{2}(\epsilon _{2})+R^{2}(u_{2})+R^{2}(\eta _{2})\). In the following section, we will focus on the two specific configurations (1).

Define,

$$ A=(R^{1}(I_{1})-R^{1}(I_{2}))(R^{2}(I_{1})-R^{2}(I_{2})) $$

and

$$ B=\begin{cases} (R^{1}(I_{1})-R^{1}(I_{2}))(R^{2}({I_{1}^{c}})-R^{2}({I_{2}^{c}})) & \mathrm{for\ 4th\ configuration}\\ (R^{1}({I_{1}^{c}})-R^{1}({I_{2}^{c}}))(R^{2}(I_{1})-R^{2}(I_{2})) & \mathrm{for\ 1st\ configuration} \end{cases} $$

where Ii and \({I_{i}^{c}}\) are respectively overlapping and non-overlapping regions of the i th pair of returns. In the above example, length(I1) = u1, length(I2) = u2, \(\text {length}({I_{1}^{c}})=\epsilon _{1}+\eta _{1}\) and \(\text {length}({I_{2}^{c}})=\epsilon _{2}+\eta _{2}\). Note that for the first and fourth configurations, E(sign(A)) gives us true Kendall’s tau. We cannot calculate E(sign(A)) because R2(I1) and R2(I2) are not observed. Instead, we observe \(R^{2}(I_{1}\cup {I_{1}^{c}})\) and \(R^{2}(I_{2}\cup {I_{2}^{c}})\). Therefore the observed Kendall’s tau is E(sign(A + B)). In this section, we will try to find out the relation between E(sign(A)) and E(sign(A + B)).

In order to establish our result, we need some assumptions. Suppose X and Y are positively associated random variables. Let (X1,Y1) and (X2,Y2) be two identical copies of (X,Y ). Then, given the information that Y1Y2 > 0, we would expect that X1X2 is more likely to be positive than negative. Intuitively, positive association would also suggest that given the information \(Y_{1}-Y_{2}\in S\subset \mathbb {R}^{+}\), X1X2 is more likely to be positive. This notion is not in general captured by any known measure of association. For each of the following we define (X1,Y1) and (X2,Y2) as two identical copies of (X,Y ), U = X1X2 and V = Y1Y2.

Assumptions(\(\mathbf {{\mathscr{B}}}\)) stated below, tries to capture the above idea. We expand on this more in Appendix Appendix.

\(\mathbf {{\mathscr{B}}_{1}}\): If P(UV > 0) > 1/2 (or < 1/2) then for all M > 0,

P(U > 0∣0 < V < M) ≥ 1/2 (or < 1/2) and

P(V > 0∣0 < U < M) ≥ 1/2 (or < 1/2).

\(\mathbf {{\mathscr{B}}_{2}}\): If P(UV > 0) > 1/2 (or < 1/2) then for all M > 0,

P(UV > 0∣ ∣V ∣ > M) > 1/2 (or < 1/2) and

P(UV > 0∣ ∣V ∣ > M) > 1/2 (or < 1/2).

Before stating the main theorems, we will first state some Lemmas which will help us to prove the theorems.

Lemma 1.

E(sign(A)∣sign(A)≠sign(B)) = E(sign(A)).

Proof of Lemma 1 is given in Appendix A.

Lemma 2.

E(sign(A)∣sign(A)≠sign(B),∣A∣ < ∣B∣) = E(sign(A)∣∣A∣ < ∣B∣).

This is a straightforward consequence of Lemma 1 and the independence of {sign(A)≠sign(B)} and {∣A∣ < ∣B∣}.

Theorem 4.

Under the Assumption \({\mathscr{B}}\), for the pairs with 1st and 4th configuration,

$$ \mid \tilde{\rho_{\tau}}\mid >\mid \rho_{\tau}\mid, $$

where ρτ is the Kendall’s tau calculated on the paired data with 1st and 4th configurations, i.e. ρτ = E(sign(X1X2)(Y1Y2)), where (X1,Y1) and (X2,Y2) are independent pairs of the same configurations.

Proof of Theorem 4 is given in Appendix A. Now we will show that \(\text {sign}(\rho _{\tau })=\text {sign}(\hat {\rho }_{\tau })\).

Theorem 5.

For the pairs with 1st and 4th configuration,

$$ \rho_{\tau}=E\big[\text{sign}(A\mid \text{sign}(A)\neq \text{sign}(B)\&\mid A\mid >\mid B\mid )\big]P(\mid A\mid >\mid B\mid ), $$

where ρτ is the Kendall’s tau calculated on the paired data with 1st and 4th configurations, i.e. ρτ = E(sign(X1X2)(Y1Y2)), where (X1,Y1) and (X2,Y2) are independent pairs of the same configurations.

Proof of Theorem 5 is given in Appendix Appendix. By Lemma 2,

$$ \rho_{\tau}=E\big[\text{sign}(A\mid \mid A\mid >\mid B\mid )\big]P(\mid A\mid >\mid B\mid ). $$

This together with Assumption \({\mathscr{B}}\) implies that, \(\text {sign}(\rho _{\tau })=\text {sign}(\hat {\rho }_{\tau })\).

Theorems 4 and 5 together imply that the estimator of Kendall’s tau obtained after pairing the observed asynchronous data underestimates the true parameter under the assumption \({\mathscr{B}}\) for the 1st and 4th configurations. Similar results can be established under the other two configurations.

3.2 Corrected Estimator

Similar to Section 2, we would like to find a correction factor, that only depends on the arrival times, for a more general class of copula. For elliptical copula, the value of the correction factor is not dependent on the value of the parameter. This is evident from Fig. 6 (left panel), showing the true and uncorrected mean estimated parameter of Gaussian copula for simulated nonsynchronous data. We generate the arrival times according to a pre-specified Poisson process. We can see that the true and estimated parameters lie along the regression line where the intercept term of the regression line is insignificant. This suggests that the corrected estimator should be a constant times the uncorrected one. This constant was the correction factor derived in Theorem 1.

Figure 6
figure 6

True vs mean estimated correlation obtained from 100 simulations for Gaussian copula (left) and true vs mean estimated Kendall’s tau obtained from 100 simulations for Clayton copula (right). The margins in both the cases are taken to be Gaussian with 0 mean and 1 standard deviation

On the other hand in the figure for Clayton copula (right panel, Fig. 6), we can see that a straight line would not be a good candidate to model the relation between the true and uncorrected estimated Kendall’s tau. This means we should not aspire to find a simple multiplicative correction factor that would give us the value of the true parameter. On inspection, a second degree polynomial seems to be a good model. But the same procedure, a second-degree polynomial seems to be appropriate for the Gumbel copula as well. We therefore, use a quadratic model to obtain the corrected estimator. The detailed steps are outlined below:

  1. 1.

    From the observed data, estimate the two arrival processes independently.

  2. 2.

    Estimate the univariate marginal distributions.

  3. 3.

    Using the pairing algorithm described in Section 2.2, pair the observations.

  4. 4.

    With paired data, we can now see which copula fits best to the data. It can be obtained through AIC or BIC criterion.

  5. 5.

    Estimate the Kendall’s tau (uncorrected) from this paired data.

  6. 6.

    Prefix K copula parameters (or equivalently Kendall’s tau). For each parameter, with the information of the underlying copula, arrival processes, and marginals, we now simulate N nonsynchronous samples (the technique of generating nonsynchronous data is discussed in Section 4).

  7. 7.

    For each sample, calculate uncorrected estimate and plot the estimates and the true Kendall’s tau in a plot like Fig. 8 (right panel).

  8. 8.

    Fit a suitable quadratic regression for such a plot.

  9. 9.

    From the regression equation, find the corrected Kendall’s tau corresponding to the estimated value of the Kendall’s tau (obtained from step 5).

Note that the above procedure yields an interval estimator for Kendall’s tau by considering the confidence interval in the regression. In Section 4 we study the coverage probability of such intervals through simulations and compare them to other interval estimates.

4 Simulation

We simulated data of synchronized log-returns of two stocks for n1 + n2 time points. The time points are generated by a poisson Process. Corresponding n1 + n2 returns are drawn randomly from a bivariate distribution determined by a pre-specified copula and margins. These n1 + n2 pairs are then transformed appropriately to represent log-prices on the corresponding interarrivals. Now from the first stock, we randomly delete n2 time points and their corresponding prices. The remaining n1 data points constitute the data for the first stock. For the second stock, we keep the time points which were deleted from the first stock and delete rest of the time points. These time points, along with their corresponding log-prices, constitute data for the second stock. So now we have nonsynchronous data for the two stocks.

4.1 Estimation of copula parameter of elliptical copulas

In the following simulation study, we test the performance of the method prescribed in Theorem 1 to estimate the copula parameter. To do so, we first choose a Gaussian copula and generate 100 instances of nonsynchronous data by the method mentioned above. Initially, both n1 and n2 are taken to be the same. The mean, variance and mean square error of the 100 estimates are reported in Tables 1 and 2. In Fig. 7, we show the boxplots for ρ = 0.8. The boxplot on the left corresponds to the corrected estimate and those on the middle and right corresponds to uncorrected estimates obtained from refresh time sampling and previous tick sampling respectively. The horizontal line suggests the true parameter.

Table 1 Mean and standard deviation of estimates from 100 simulations for different ρ and sample size
Table 2 Calculated mean square error for the results in Table 1
Figure 7
figure 7

Boxplots for ρ = 0.25 (left) and ρ = 0.8 (right); the first boxplot is for the corrected estimates of 100 simulations with sample size 2000. The middle and the last plots correspond to the estimates from refresh time sampling and previous tick sampling respectively

From the table, we see that both previous tick and refresh time sampling fail to capture the magnitude of true dependence. In fact the previous tick method is the worst choice for synchronization.

We carry out the same analysis with t copula, with different marginal distributions with different degrees of freedom, which is a more realistic scenario for intraday financial data. The result is similar i.e. not only does our prescribed correction give a good estimate but also the uncorrected method returns a biased estimate and the bias is significant. The result of 100 simulations with parameter -0.4 is summarized in Table 3.

Table 3 Simulation for t copula for different marginals with ρ = − 0.4

4.2 Interval Estimation of Kendall’s Tau in Non-elliptical Copula

We take three approaches to interval estimation of the true Kendall’s tau and apply those on simulated data from several Archimedean copulas. In the first approach, we follow the method described in Section 3.2 and get the 95% confidence interval. Here we are assuming a quadratic relation between the true and the estimated parameters. The blue dotted lines in the right panel of Fig. 8 show the confidence intervals for Clayton copula.

Figure 8
figure 8

Left: Intervals in which estimated Kendall’s tau lies in 95% of times for Clayton copula, Right: Prediction interval for the regression line for Clayton copula

The second approach is similar to the first one, but we don’t fit a regression line. Instead, for each true Kendall’s tau, we plot the interval that contains the (under)estimated Kendall’s tau 95% of times. In the left panel of Fig. 8, we plot the intervals (horizontal) against the true Kendall’s tau. Now we calculate the confidence interval for true Kendall’s tau as the vertical interval corresponding to the estimated Kendall’s tau (see the red vertical lines corresponding to 0.1, 0.2 and 0.32 in the figure). This is a completely non-parametric approach and relies on inversion of the acceptance regions of hypothesis tests for Kendall’s tau.

In the third approach, we deliberately mis-specify the underlying copula as a Gaussian copula and use Theorem 1 to calculate the confidence interval (using the relation between correlation coefficient and Kendall’s tau for elliptical copula, see Eq. 5). The results of these three approaches to interval estimation are given in Table 4 in Section 4.

Table 4 Coverage probability (CP) and Interval length (IL) for three methods

The coverage probability and interval lengths from the three methods of interval estimation, described in Section 4.2, are shown in Table 4. An important takeaway from this table comes from the last column which demonstrates that the effect of model misspecification can be quite serious. Note that the second method, being completely non-parametric, does not assume anything about the shape of the dependence function between the true parameter and the uncorrected estimate. The first method assumes a quadratic model. This assumption reduces computations by a huge amount. From the table we see that the coverage probability of the first method is always at least the target value of 95%. So the assumption of a quadratic model does not compromise the coverage probability. The intervals are a little larger than the second method, so the first method is more conservative. Another observation is that the length of the intervals do not depend much on the value of the underlying parameter.

5 Real Data Analysis

We analyze real financial intraday data to see which kind of copula is most likely to be encountered in practice and obtain the corresponding parameter estimates. We use AIC to compare and select the best copula. In many of the cases, we find that the t-copula is a good choice to model bivariate intraday data. To see the impact of asynchronicity for real data we record the relative extent of correction to be undertaken. The intraday data for Apple and Facebook stocks are plotted in Fig. 9. These have been modelled by bivariate t copula for three consecutive days. For all three days both the uncorrected and the corrected estimates are evaluated in Table 5. The percentage change in values of uncorrected and corrected estimates is reported in the third column. We notice that almost 30 to 35% of data being lost or deleted after constructing the pairs by algorithm \(\mathcal {A}_{0}\).

Figure 9
figure 9

Facebook (black) and Apple (gray) data after subtracting the mean for two consecutive days 10.05.2017 and 11.05.2017

Table 5 Copula estimation for the joint distribution of Apple and Facebook data

We also perform the same analysis for a couple of other stocks and the results we obtained are very similar. For example, when we consider Amazon and Netflix on three nearly consecutive days, the percentage changes in copula parameter with t copula are 41.75%, 39.84%, 42.76% respectively.

6 Conclusion and Future Directions

Both simulations and real data analysis clearly show that the impact of asynchronicity can be very serious if not tackled properly. We discuss some of the methods to circumvent the problem. Careful pre-processing of intraday data is necessary to model or infer about the underlying realities. We propose a consistent estimator of the correlation coefficient and more generally of elliptical copula function. For a more general class of copulas, where there is a one-one relation between the Kendall’s tau and the copula, we suggest a way of estimating the copula parameter. Alongside the point estimates, three ways of interval estimation is discussed and compared. From the results it is evident that the impact of asynchronicity can be quite serious under model mis-specification. The real data analysis corroborates our findings. For the two chosen stocks, as the correlation is very less, the absolute change in the value after the correction is not much. But the relative change is significantly high, as we expected.

There are several directions in which one can extend this work. Firstly, we did not assume the presence of microstructure noise. In the presence of noisy observations, the estimator may demand further modifications. The estimation procedure can be further challenging if the parameter is time-dependent. As time-dependent copula modelling is gaining popularity in financial data analysis, it is worthwhile to investigate the effect of asynchronicity in time-varying parameter estimation. Another question one can look into is that how asynchronicity affect the estimation of popular risk measures like Value at Risk (VaR) of portfolios involving multiple assets.