Keywords

1 Introduction

Gaia [5] is a European Space Agency (ESA) space mission, launched in December 2013, whose main objective is to compile a large-scale astronomical survey of about one billion stars (\({\approx }1\%\)) of our Galaxy and its Local Group. The satellite will scan the entire sky for about 5 years yielding an unprecedented catalog in both size and precision of positions, distances and proper motion measures. Additionally, it will perform multi-epoch photometry (70 transits per object on average) which renders the satellite suitable too for studies of stellar variability. Amongst the many variability types present in the stellar zoo, one in particular is of paramount importance: the Classical Cepheids. Classical Cepheids represent the first calibrator in the cosmic distance ladder used to infer the structure and evolution of our Universe, and our current knowledge about the Big Bang, the inflationary period, the dark matter problem or dark energy relies on the period-luminosity relation for Classical Cepheids [14]. Therefore, a precise and accurate understanding of the population of Classical Cepheids is central to all cosmological studies.

In this paper we address the problem of inferring the true properties of this population of variable stars from the petabyte-size Gaia catalog. In order to populate the catalog, and as part of a much larger framework to deliver a data set of scientific quality, the Data Processing and Analysis Consortium (DPACFootnote 1) developed a pipeline to characterize the time series observed and classify them. A key element of this process is that the time sampling of stellar brightness time series will have the imprint of the satellite intrinsic frequencies (amongst other, the spinning and precessing frequencies, a description of which is out of the scope of this paper). As a consequence, some (but not all) of the derived frequencies will be affected by aliasing which results in biased samples.

The objective is to characterize the phenomenon of aliasing in the Gaia catalog, correct for it, and reconstruct the real distribution of LMC Classical Cepheids properties. In order to achieve these goals, we tackle the problem under the Bayesian paradigm [6, 7] and adopt the knowledge representation language of Bayesian Networks (BN) [8, 11]. This framework allows a hierarchical representation of the problem in which the time series gathered by Gaia are the product of a generative process which ultimately depends on the parameters of the population of stars. Given that the computation of the posterior probabilities of our model are analytically intractable, the inference mechanism of our proposal is founded in Markov chain Monte Carlo (MCMC) simulation techniques [13].

We have validated our models using a data base of 36688 synthetic Classical LMC Cepheids time series generated according to controlled prescriptions based on current understanding of the true distributions and the satellite characteristics. Our results prove that we are ready for the second Gaia data release expected for 2018. This will be the first data release to include photometric time series (although this still needs to be confirmed).

The structure of the rest of the paper is as follows. In Sect. 2 we describe our model and the MCMC technique used for the inference of the parameters of interest. In Sect. 3 we validate the model with the simulated data base in a scenario of extreme aliasing and describe the results of this validation procedure. Finally, in Sect. 4 we summarize the contributions of this work and some of its limitations, and give pointers to future developments.

2 Hierarchical Modelling of the Distribution of Pulsation Properties of Classical Cepheid Variable Stars

2.1 The Hierarchical Model

Figure 1 and Table 1 depict the structure of the DAG associated to the model and summarize the meanings of the nodes and the types of their distributions. We classify the nodes into a hierarchy of three levels. The hierarchy distinguishes between evidential nodes (observations), the rest of nodes inside the rectangle or plate, which is replicated N times (one per star), and the nodes outside the rectangle. In the following paragraphs we describe the parameters and probability distributions for each level and its contribution to the joint probability distribution.

Fig. 1.
figure 1

Graph structure of our proposed Bayesian Graphical Model (BGM). Most fixed parameters are not included in the graph, with the exception of those enclosed inside a square. See the text and Table 1 for node descriptions.

Table 1. Description of parameters. NI = non informative

2.1.1 Likelihood

In the bottom level of our graph we present the evidential nodes, that is, the variables measured directly or derived by the DPAC

$$\begin{aligned} \mathcal {D}=\left( \nu _{\mathrm {rec},i},A_{\mathrm {rec},i},m_{G_{\mathrm {rec}},i}\right) . \end{aligned}$$
(1)

These nodes, depicted by double circles, are the output/recovered frequency \(\nu _{\mathrm {rec},i}\), the amplitude \(A_{\mathrm {rec},i}\) and the apparent G-magnitude \(m_{G_{\mathrm {rec}},i}\) for the i-th star.

Recovered Frequencies. Most of the pairs \(\left( \nu _{\mathrm {input}},\nu _{\mathrm {rec}}\right) \) in the simulated data base fall on straights lines of the form:

$$\begin{aligned} \nu _{\mathrm {rec}}=\pm \nu _{\mathrm {input}}\pm k_{1}\nu _{s}\pm k_{2}\nu _{p}, \end{aligned}$$
(2)

where \(k_{1}\in \left\{ 0,3,7\right\} \), \(k_{2}\in \left\{ 0,\ldots ,19\right\} \), \(\nu _{s}\approx \frac{1}{0.25}=4\mathrm {d}^{-1}\) is the rotational frequency of Gaia and \(\nu _{p}=\frac{1}{63}\mathrm {d}^{-1}\) is its precessional frequency. We refer to each line as a locus/category of recovered frequencies. Excluding the line \(\nu _{\mathrm {rec}}=\nu _{\mathrm {input}}\), all these loci correspond to spurious (aliased) frequencies. Based on that, we parameterize the i-th recovered frequency as the following mixture of Gaussian distribution

$$\begin{aligned} f\left( \nu _{\mathrm {rec},i}\mid \log (\nu _{i}),T_{\nu _{\mathrm {rec},i}}\right) = \sum _{j=1}^{M}\delta _{T_{\nu _{\mathrm {rec},i}}}^{j}\mathsf {N}\left( \left( -1\right) ^{j-1}10^{\log \left( \nu _{i}\right) }+b_{j},\tau _{\nu _{\mathrm {rec}}}\right) . \end{aligned}$$
(3)

In Eq. 3 the Kronecker deltas \(\delta _{T_{\nu _{\mathrm {rec},i}}}^{j}\) dictate the Gaussian component to which \(\nu _{\mathrm {rec},i}\) belongs according to the value of the categorical variable \(T_{\nu _{\mathrm {rec},i}}\) (described in Sect. 2.1.2). The mean of each component represents the locus in which the input frequency has been recovered, i.e. the identity locus, with \(b_{j}=0\) for \(j=1\), or some locus of spurious (aliased) frequencies for \(j>1\). We assume the same precision \(\tau _{\nu _{\mathrm {rec}}}=10000\) for all components.

Recovered Amplitudes. To gain insight into the form of the conditional distribution of the recovered amplitude given the input amplitude we have checked the hypothesis that recovered amplitudes are also biased by the aliasing phenomenon, just as recovered frequencies are. By analysing the relationship between loci of frequencies and pairs \(\left( A_{\mathrm {input}},A_{\mathrm {rec}}\right) \), we have discovered that for a perfect recovery the distribution \(A_{\mathrm {rec}}\mid A\) is skewed to lower amplitudes with a central parameter approximately equal to the input amplitude. Otherwise, for loci of aliased frequencies we have observed that the skewness of the recovered amplitude increases as the input amplitude does according to a certain slope to be determined as part of the model. To account for this fact we have fitted two linear regression models

$$\begin{aligned} A_{\mathrm {rec},i}=\beta _{1}^{j}A_{\mathrm {in},i}+\beta _{0}^{j}+\epsilon _{i}^{j},\; j=1,2, \end{aligned}$$
(4)

with \(j=1\) corresponding to the identity locus and \(j=2\) to the loci \(\nu _{\mathrm {rec}}=\pm \nu _{\mathrm {in}}+7\nu _{s}-3\nu _{p}\) Footnote 2. For the identity locus, we have assumed a skewed Student t distribution [2] with one degree of freedom (skewed Cauchy) for the error component \(\epsilon _{i}^{1}\sim \mathsf {st}\left( 0,\omega ,\alpha ,1\right) \) where \(\omega \) and \(\alpha \) denote respectively the shape and scale parameters. For the locus \(\nu _{\mathrm {rec}}=\pm \nu _{\mathrm {in}}+7\nu _{s}-3\nu _{p}\) we have assumed that \(\epsilon _{i}^{2}\sim \mathsf {t}\left( 0,\omega ,1\right) \). Based on that, we model the conditional distribution for the recovered amplitude \(A_{\mathrm {rec},i}\) by means of the mixture of two skewed Student t distributions

$$\begin{aligned} \begin{aligned} f\left( A_{\mathrm {rec},i}\mid A_{i},T_{\nu _{\mathrm {rec},i}}\right)&= \delta _{T_{\nu _{\mathrm {rec},i}}}^{1}\mathsf {ST}\left( A_{i},0.020,-2.395,1\right) \\&+\sum _{j=2}^{M}\delta _{T_{\nu _{\mathrm {rec},i}}}^{j}\mathsf {ST}\left( 0.749\cdot A_{i},0.0266,0,1\right) , \end{aligned} \end{aligned}$$
(5)

where the location parameters \(\xi _{1}=A_{i}\), \(\xi _{j}=0.749\cdot A_{i},\forall j=2,\ldots ,M\), the scale \(\omega \) and the shapes \(\alpha \) have been obtained from the fitting of the two linear models of Eq. 4 and taken as constants in our BGM.

Recovered Apparent Magnitudes. We parameterize the distribution of the i-th recovered apparent G magnitude by means of a Gaussian distribution with mean \(m_{G,i}\) and precision \(\tau _{G(rec)}=2.5\mathrm {E}{+5}\) (to be adjusted when real Gaia data become available)

$$\begin{aligned} f\left( m_{G_{rec},i}\mid m_{G,i}\right) =\mathsf {N}\left( m_{G,i},\tau _{G_{rec}}\right) . \end{aligned}$$
(6)

The conditional distribution of the data given their parents is then given by

$$\begin{aligned} \begin{aligned} p\left( \mathcal {D}\mid \varvec{\theta }_{1}\right) =&\prod _{i=1}^{N}f_{1}\left( \nu _{\mathrm {rec},i}\mid \log (\nu _{i}),T_{\nu _{\mathrm {rec},i}}\right) \cdot f_{2}\left( A_{\mathrm {rec},i}\mid A_{i},T_{\nu _{\mathrm {rec},i}}\right) \\ {}&\cdot f_{3}\left( m_{G_{\mathrm {rec}},i}\mid m_{G,i}\right) . \end{aligned} \end{aligned}$$
(7)

2.1.2 First Level Random Parameters

These are

$$\begin{aligned} \varvec{\theta }_{1}=\left( \log \left( \nu _{i}\right) ,A_{i},m_{G,i},T_{\nu _{\mathrm {rec},i}},T_{\nu _{i}}\right) . \end{aligned}$$
(8)

In \(\varvec{\theta }_{1}\), we distinguish two classes of nodes. The input nodes are, for the i-th star, the real frequency \(\log \left( \nu _{i}\right) \), the real amplitude \(A_{i}\) and the real apparent G-magnitude \(m_{G,i}\). The categorical nodes \(T_{\nu _{\mathrm {rec},i}}\) and \(T_{\nu _{i}}\) determine the component of a node modelled by a mixture of distributions. \(T_{\nu _{i}}\) and \(T_{\nu _{\mathrm {rec},i}}\) are respectively associated with the real frequency and the recovered frequency and amplitude. In Fig. 1 all the nodes at this level replicate with the plate. They depend on (amongst other) non informative orphan nodes outside the plate.

Categories of Recovered Frequencies. The node \(T_{\nu _{\mathrm {rec},i}}\) takes a value \(j\in \left\{ 1,\ldots ,M\right\} \) if the i-th frequency has been recovered in the j-th locus, which occurs with a probability \(\pi _{ij}\). In this paper we assume that the main factor determining the aliasing phenomenon in Gaia is the ecliptic latitude \(\beta \) of the stars. The influence of \(\beta \) over the rate of correct detections of periodic signals by Gaia has been studied in [4] where it is shown that for high values of \(\beta \), typical of LMC sources, the relation between the rate of correct detections and \(\beta \) is approximately linear with a negative slope. Based on that, we make \(\pi _{ij}\) depend on the ecliptic latitude \(\beta _{i}\) and parameterize this dependence by a multinomial logistic regression submodel with a softmax transfer function. We model the conditional distribution of \(T_{\nu _{\mathrm {rec},i}}\) as

$$\begin{aligned} p\left( T_{\nu _{\mathrm {rec},i}}\mid \left\{ \varvec{\lambda }_{j}\right\} _{j=2}^{M}\right) = \mathrm {\mathsf {Cat}}\left( M,\left\{ \pi _{ij}\left( \beta _{i}',\varvec{\lambda }_{j}\right) \right\} _{j=1}^{M}\right) , \end{aligned}$$
(9)

with

$$\begin{aligned} \pi _{ij}\left( \beta _{i}',\varvec{\lambda }_{j}\right) =\frac{e^{\varvec{\lambda }_{j}^{T}\cdot \left( 1,\beta _{i}'\right) }}{\sum _{l=1}^{M}e^{\varvec{\lambda }_{l}^{T}\cdot \left( 1,\beta _{i}'\right) }}, \end{aligned}$$
(10)

where we have rescaled the predictor \(\beta _{i}\) by subtracting the mean and dividing by two times the standard deviation, i.e. \(\beta _{i}'=\frac{\beta _{i}-\overline{\beta }}{2\cdot \mathrm {sd}\left( \beta \right) }\), which guaranties that the mean and the standard deviation are respectively 0 and 0.5.

Input Frequencies and Categories. The marginal distribution of the (decadic) logarithm of the input frequency in the synthetic data set created by the DPAC Quality Assessment group was sampled from a mixture of five Gaussian distributions [1]. In our BGM, we parameterize it by the mixture of only three componentsFootnote 3

$$\begin{aligned} \begin{aligned} f\left( \log \left( \nu _{i}\right) \mid T_{\nu _{i}},\mu _{\nu },\varvec{\theta }_{\nu },\tau _{\nu },\varvec{\omega }_{\nu }\right)&= \delta _{T_{\nu _{i}}}^{1}\mathsf {N}\left( \mu _{\nu },\tau _{\nu }\right) \\ {}&+ \delta _{T_{\nu _{i}}}^{2}\mathsf {N}\left( \mu _{\nu }+\sqrt{\tau _{\nu }^{-1}}\theta _{\nu 1},\tau _{\nu }\omega _{\nu 1}^{-2}\right) \\&+\delta _{T_{\nu _{i}}}^{3}\mathsf {N}\left( \mu _{\nu }+\sqrt{\tau _{\nu }^{-1}}\theta _{\nu 1}+\sqrt{\tau _{\nu }^{-1}}\omega _{\nu 1}\theta _{\nu 2},\tau _{\nu }\omega _{\nu 1}^{-2}\omega _{\nu 2}^{-2}\right) . \end{aligned} \end{aligned}$$
(11)

In Eq. 11, \(\mu _{\nu }\) and \(\tau _{\nu }\) denote, respectively, the mean and the precision of the first component of the mixture. \(\left( \theta _{\nu 1},\theta _{\nu 2}\right) \) and \(\left( \omega _{\nu 1},\omega _{\nu 2}\right) \) denote, respectively, the perturbation parameters which affect the mean and the scale parameter of a given component to obtain the mean and scale parameter of the next component [12]. The Kronecker deltas \(\delta _{T_{\nu _{i}}}^{j}\) have the same role as in Eq. 3 but now the categorical variable \(T_{\nu _{i}}\) represents the class of the real frequency. For \(T_{\nu _{i}}\) we assign the distribution

$$\begin{aligned} p\left( T_{\nu _{i}}\right) =\mathrm {\mathsf {Cat}}\left( 3,w_{\nu 1},w_{\nu 2},w_{\nu 3}\right) , \end{aligned}$$
(12)

where \(w_{\nu j}\) are the mixing proportions of the mixture.

Input Amplitudes. This distribution has been simulated based on the OGLE III catalogue of Classical Cepheids [15], as

$$\begin{aligned} f\left( A\mid \log \left( \nu \right) \right) ={\left\{ \begin{array}{ll} \mathsf {N}\left( -0.5\cdot \log \left( \nu \right) +0.2,0.15\right) &{} \log \left( \nu \right) <-1\\ \mathsf {N}\left( 0.7,0.15\right) &{} \log \left( \nu \right) >-1 \end{array}\right. } \end{aligned}$$
(13)

In our BGM we parameterize this variable as

$$\begin{aligned} \begin{aligned} f\left( A_{i}\mid \log (\nu _{i}),a_{A},b_{A},\mu _{A},\tau _{A}\right)&= \varvec{1}_{\left\{ \log \left( \nu _{i}\right) <-1\right\} }\mathsf {N}\left( a_{A}\cdot \log \left( \nu _{i}\right) +b_{A},\tau _{A}\right) \\&+\varvec{1}_{\left\{ \log \left( \nu _{i}\right) >-1\right\} }\mathsf {N}\left( \mu _{A},\tau _{A}\right) , \end{aligned} \end{aligned}$$
(14)

where \(\varvec{1}_{S}\) denotes the indicator function of a subset S, \(a_{A}\) and \(b_{A}\) are, respectively, the slope and the intercept of the regression line of A on \(\log \left( \nu \right) \) when \(\log \left( \nu \right) <-1\), \(\mu _{A}\) denotes the mean of the amplitude when \(\log \left( \nu \right) >-1\), and \(\tau _{A}\) denotes the precision, which we take equal in both cases.

Input Apparent G Magnitudes. Based on Eqs. 12 and 13 of [14] and discarding the distance r to the sources, we parameterize this node as

$$\begin{aligned} \begin{aligned}&f\left( m_{G,i}\mid \log \left( \nu _{i}\right) ,a_{G1},b_{G1},a_{G2},b_{G2},\tau _{G}\right) \\&=\varvec{1}_{\left\{ \log \left( \nu _{i}\right) <-1\right\} }\mathsf {N}\left( a_{G1}\cdot \log \left( \nu _{i}\right) +b_{G1},\tau _{G}\right) \\ {}&+\varvec{1}_{\left\{ \log \left( \nu _{i}\right) >-1\right\} }\mathsf {N}\left( a_{G2}\cdot \log \left( \nu _{i}\right) +b_{G2},\tau _{G}\right) . \end{aligned} \end{aligned}$$
(15)

The conditional distribution of the first level of random parameters given the parameters of the top level is then

$$\begin{aligned} \begin{aligned} p\left( \varvec{\theta }_{1}\mid \varvec{\theta }_{2}\right)&=\prod _{i=1}^{N}g_{1}\left( T_{\nu _{rec,i}}\mid \left\{ \varvec{\lambda }_{j}\right\} _{j=2}^{M}\right) \cdot g_{2}\left( A_{i}\mid \log (\nu _{i}),a_{A},b_{A},\mu _{A},\tau _{A}\right) \\\cdot&g_{3}\left( m_{G,i}\mid \log (\nu _{i}),\mathbf {a}_{G},\mathbf {b}_{G},\tau _{G}\right) \cdot g_{4}\left( \log \left( \nu _{i}\right) \mid T_{\nu _{i}},\lambda _{\nu },\varvec{\theta }_{\nu },\tau _{\nu },\varvec{\omega }_{\upsilon }\right) \\\cdot&g_{5}\left( T_{\nu _{i}}\mid \varvec{w}_{\nu }\right) \end{aligned} \end{aligned}$$
(16)

2.1.3 Top Level Random Parameters

These hyperparameters are

$$\begin{aligned} \varvec{\theta }_{2}=\left( a_{A},b_{A},\mu _{A},\tau _{A},\mathbf {a}_{G},\mathbf {b}_{G},\tau _{G},\mu _{\nu },\varvec{\theta }_{\nu },\tau _{\nu },\varvec{\omega }_{\nu },\varvec{w}_{\nu },\varLambda \right) . \end{aligned}$$
(17)

\(\varvec{\theta }_{2}\) include the orphan nodes in the graph. We only have a vague (or non informative) prior knowledge about their distributions. The nodes denoted by a and b represent the slopes and intercepts of the distributions of the real amplitude and apparent G-magnitude given the frequency. The nodes denoted by \(\tau \) and \(\mu \) represent precisions and means. The nodes denoted by \(\varLambda \) represent the coefficients of the logistic regression submodel of Eq. 10. The rest of nodes are associated with the parameterization of the real frequency of Eq. 11. For these latter hyperparameters we take the non informative priors

$$\begin{aligned} p\left( \varvec{w}_{\nu }\right)&=\mathsf {Dir}\left( 1,1,1\right) \end{aligned}$$
(18)
$$\begin{aligned} p\left( \mu _{\nu }\right)&=\mathsf {N}\left( 0,0.001\right) \end{aligned}$$
(19)
$$\begin{aligned} p\left( \theta _{\nu j}\right)&=\mathsf {N}\left( 0,0.01\right) \end{aligned}$$
(20)
$$\begin{aligned} p\left( \tau _{\nu }\right)&=\mathsf {Gamma}\left( 0.001,0.001\right) \end{aligned}$$
(21)
$$\begin{aligned} p\left( \omega _{\nu j}\right)&=\mathsf {U}\left( 0,1\right) \end{aligned}$$
(22)

For the hyperparameters of the logistic regression submodel of Eq. 10 \(\varvec{\lambda }_{j}=\left( \lambda _{0j},\lambda _{1j}\right) \) with \(j\in \left\{ 2,\ldots ,M\right\} \), we assign the weakly informative priors \(p\left( \lambda _{kj}\right) =\mathsf {t}\left( 0,\frac{1}{2.5^{2}},7\right) , k\in \left\{ 0,1\right\} \). This election provides a minimal prior information to constrain the range of coefficients \(\lambda _{kj}\) once the covariate \(\beta _{i}\) has been rescaled [6]. This approximation is used to enhance the convergence rate of our model.

For the parameters \(a_{A}, b_{A}, \lambda _{A}\) of the input amplitude distribution of Eq. 14 and the parameters \(a_{G1}, b_{G1}, a_{G2}, b_{G2}\) of the input apparent G magnitude of Eq. 15 we take \(\mathsf {N}\left( 0,0.001\right) \) non informative priors. And for the precisions \(\tau _{A}\) and \(\tau _{G}\) we take \(\mathsf {Gamma}\left( 0.001,0.001\right) \) priors. For all these priors the full conditional distribution of the node is available in closed form.

The distribution (hyperprior) of the top level parameters is then

$$\begin{aligned} \begin{aligned} p\left( \varvec{\theta }_{2}\right)&=h_{1}\left( a_{A}\right) \cdot h_{2}\left( b_{A}\right) \cdot h_{3}\left( \mu _{A}\right) \cdot h_{4}\left( \tau _{A}\right) \cdot h_{5}\left( \mathbf {a}_{G}\right) \cdot h_{6}\left( \mathbf {b}_{G}\right) \cdot h_{7}\left( \tau _{G}\right) \\&\cdot h_{8}\left( \varvec{w}_{\nu }\right) \cdot h_{9}\left( \mu _{\nu }\right) \cdot h_{10}\left( \varvec{\theta }_{\nu }\right) \cdot h_{11}\left( \tau _{\nu }\right) \cdot h_{12}\left( \varvec{\omega }_{\nu }\right) \cdot h_{13}\left( \varLambda \right) . \end{aligned} \end{aligned}$$
(23)

2.1.4 Joint Distribution of the Parameters and Data

From Eqs. 7, 16 and 23 we formulate the joint PDF associated to the graphical mode by

$$\begin{aligned} p\left( \varvec{\theta },\mathcal {D}\right) =p\left( \mathcal {D}\mid \varvec{\theta }\right) \cdot p\left( \varvec{\theta }\right) = p\left( \mathcal {D}\mid \varvec{\theta }_{1}\right) \cdot p\left( \varvec{\theta }_{1}\mid \varvec{\theta }_{2}\right) \cdot p\left( \varvec{\theta }_{2}\right) . \end{aligned}$$
(24)

2.2 Computation

The joint posterior distribution of the \(22+5N\) parameters of the model described in Sect. 2.1 is given by

$$\begin{aligned} \pi ^{*}\left( \varvec{\theta }\right) =\pi \left( \varvec{\theta }\mid \mathcal {D}\right) \propto \mathcal {L}\left( \varvec{\theta }_{1}\right) \cdot p\left( \varvec{\theta }_{1}\mid \varvec{\theta }_{2}\right) \cdot p\left( \varvec{\theta }_{2}\right) . \end{aligned}$$
(25)

Our goal is to infer the marginal a posteriori distribution \(\pi ^{*}\left( \varvec{\theta }_{2}\right) \) of the top level hyperparametersFootnote 4. The marginalization to obtain samples from \(\pi ^{*}\left( \varvec{\theta }_{2}\right) \) can be accomplished by a general MCMC procedure in which, once a sample for the joint posterior has been obtained, the procedure retains only the values of \(\varvec{\theta }_{2}\) and discards the rest. The joint posterior distribution of Eq. 25 can be efficiently sampled by means of a Gibbs sampling scheme (see Sect. 4.2 of [9]). To reduce our model to the programming language level we have used the BUGS [10] probabilistic language and the OpenBUGS software environment.

3 Application to the Gaia Simulated Database of Classical Cepheids

In this section we evaluate the effectiveness of our model to infer the real distributions of hyperparameters in an extreme scenario of systematic biases in the recovered data. In order to do so, we have constructed a dataset \(\mathcal {T}=\left\{ \left( A_{\mathrm {rec},i},\nu _{\mathrm {rec},i},m_{G,\mathrm {rec},i}\right) \right\} _{1}^{854}\varsubsetneq \mathcal {D}\) composed of 500 randomly selected instances from the locus \(\nu _{\mathrm {rec}}=\nu _{\mathrm {in}}\) and all instances (354) from the locus \(\nu _{\mathrm {rec}}=\pm \nu _{\text {in}}+7\nu _{S}-3\nu _{p}\). Figure 2 shows the systematic biases for the empirical frequency distribution (histogram) vs the true probability density function (PDF) and for the empirical conditional distributions of the recovered amplitude given the input amplitude for the three loci (the identity locus and the \(\nu _{\mathrm {rec}}=\pm \nu _{\text {in}}+7\nu _{S}-3\nu _{p}\) loci), whose observed parameters are included in the training set.

Fig. 2.
figure 2

Biases in the frequencies (left) and amplitudes (right) present in the training set.

We have trained the model using the OpenBUGS MCMC engine. We have divided the training in two stages and generated three Markov chains (more properly realizations) in each, with a total of 30000 iterations. We have used the first 20000 iterations as a burn-in phase, and discarded them after using them for convergence assessment. Thereafter, we obtain 10000 samples from each chain in a second stage (30000 in total). We will assume that these samples were drawn from the posterior distribution of the parameters of interest.

Table 2. Summary statistics of parameters of interests.

3.1 Convergence Analysis

To evaluate the convergence within and between the three chains we have selected the first 20000 iterations of the algorithm and computed the mean autocorrelation (ACR) (after 200 lags) and the upper bound of a credible interval (at 95%) for the corrected GR statistic [3]. The results of the analysis are summarized in the second and third columns of Table 2. Since the ACR function should decrease to zero as the lag increases and the upper bound for the corrected scale reduction factor (CSRF) should approach unity if the chain is reaching its stationary distribution, we conclude that the worst scenario (high autocorrelation) is encountered in the chains of the parameters specifying the second Gaussian component of \(\log \left( \nu \right) \), namely the mixing proportion \(w_{\nu 2}\), the mean \(\mu _{\nu 2}\) and the standard deviation \(\sigma _{\nu 2}\). In particular, chains for \(\sigma _{\nu 2}\) show the worst behaviour with a mean ACR after 200 lags of about 0.8 and a CSRF upper bound of 1.25. In contrast, the best scenario is found in the chains of the parameters of the conditional distributions of apparent G-magnitude and amplitude (given the frequency) when \(\log \left( \nu \right) >-1\), and by chains of logit coefficients. For the slope \(a_{G2}\), the intercept \(b_{G2}\), the mean \(\mu _{A}\) and the logit coefficients \(\lambda _{\beta j},\lambda _{0j}, j\in \left\{ 1,2\right\} \) the mean ACR is nearly zero after lags greater than 50 and the CSRF bound is close to unity.

3.2 Posterior Distributions and Comparison with Real Parameters

In this section we evaluate the ability of our model to retrieve the real distributions of the frequency, amplitude and apparent G-magnitude of the simulated Cepheids sample from the recovered values in the training set \(\mathcal {T}\). We first compute summary statistics (means and 2.5–97.5% percentiles) for the samples of the posterior distributions of the hyperparameters inferred by the model. Then, we have compared the posterior means with the parameters of the real theoretical distributions used to generate the simulated sample. Finally, we have constructed theoretical distributions using the posterior means and compared them with the true theoretical distributions and the empirical distribution in the set \(\mathcal {I}=\left\{ \left( A_{\mathrm {in},i},\nu _{\mathrm {in},i},m_{G,\mathrm {in},i}\right) \right\} _{1}^{854}\).

Fig. 3.
figure 3

Posterior versus real distributions.

The results of our analysis are shown in Table 2 and Fig. 3. We do not include in the table the parameters used to generate the real frequency \(\log \left( \nu \right) \), because it is difficult to make a correspondence with the inferred parameters due to the different number of Gaussian components. But if we observe the comparison graph to the left of Fig. 3, we conclude that the fitting of \(\log \left( \nu \right) \) with three components (dotted line), reconstructs the real PDF (solid line) successfully.

For the parameters of the conditional distribution \(A_{\text {in}}\mid \log \left( \nu _{\text {in}}\right) \) we fitted the piecewise linear model of Eq. 14. The middle rows of Table 2 and the graph at the right of Fig. 3 show that the system underestimates the true value of the mean \(\mu _{A}\) when \(\log \left( \nu _{\text {in}}\right) >-1\).

4 Summary and Conclusions

We have presented a two-level BGM to infer the real distributions of amplitude, frequency and apparent G-magnitude of the Large Magellanic Cloud population of Classical Cepheids from the values recovered by the Gaia DPAC pipeline. We have modelled the real frequency by a mixture of three Gaussian distributions and used piecewise linear models (with a fixed knot value depending on the frequency) to model the dependency of the true amplitude and G-magnitude on the true frequency. We have tackled the problem of aliasing in the DPAC frequency recovery module which arises as a result of the Gaia scanning law. We have modelled the recovery probabilities in various loci of aliased frequencies using a logistic regression submodel based on the ecliptic latitude predictor. We have modelled the recovered frequencies and amplitudes as generated from mixtures of distributions where the mixing proportions are the recovery probabilities. Although our model has not yet solved completely the aliasing problem (we have only used some predefined configurations of aliased data, and we have restricted the application to a very narrow range of ecliptic latitudes in which the relationship between the recovery probability of aliased frequencies and the ecliptic latitude is monotone) it represents a major step forward. The next step will necessarily consist in extending the analysis to the full celestial sphere by clustering the full variety of time samplings (and corresponding window functions) into discrete bands of ecliptic longitudes and latitudes.