1 Introduction

Linear mixed models (LMM) have been extensively used for repeated measurements, which emerge in longitudinal studies and clustered data. This type of model includes both fixed and random effects. The expectation maximization (EM) algorithm is used to carry out the parameter estimations of LMM and repeating the EM algorithm until convergence. The several, serially executed iterations of the algorithm cause computational burden and require efficient memory management, especially for big data sets. Recently, more sophisticated methods and speedup strategies have fostered an increase in the computational speed and decrease the required memory allocations.

Statistical methods are implemented in parallel within the compass of some of the current studies, albeit they are limited. In the study of Renaut (1998), multi-splitting methods are used in the least squares problems and the decomposed matrices are solved in parallel. In one of the recent studies, several parallel statistical methods are reviewed and the domain decomposition is used by defining the projector operators map vectors to split the data at some part of the study (Guo 2012). Maclaurin and Adams (2014) tries to speed up MCMC by defining a binary auxiliary variable with the conditional likelihood to decrease the evaluation time of likelihoods. Wolfe et al. (2008) introduces a fully distributed EM algorithm and uses it in three separate MapReduce topologies.

There are several studies, including parallel computing for basic statistics, but the analytics of complicated models are limited with big data in that memory and time constraints are major concerns. Parallelizing statistical algorithms efficiently has become crucial for faster performance given the gradual increase in data magnitude and the availability of multi-core environments. Development in the message-passing clusters brings more programming opportunities for statistical analysis.

‘Divide and recombine’ strategies with one of the R packages, named plyr, is introduced in Wickham (2011). In this package, a data set is broken up into smaller pieces. Each piece calculates a result independently and the results are combined back in main memory. The list of R packages including high-performance and parallel computing can be found in https://cran.r-project.org/web/views/HighPerformanceComputing.html.

Parallel programming used in this study adheres to the ‘divide and recombine’ strategy. In LMM content, Tran et al. (2016) used the same strategy with a different aspect. They developed a hybrid algorithm method in Bayesian framework for generalized LMM within a static setting. Broderick et al. (2013) developed a similar method to Tran et al. (2016), but with dynamic variables instead of static variables. Our study is similar to Tran et al. (2016) and Neiswanger et al. (2014), in terms of splitting the full data into smaller subsets, running the algorithm steps separately, and finally combining the results.

We prefer to use the R language for the implementation as R has fast vector calculations and numerous, open-source statistical model routines already available. Larger data sets, however, require more attention to carry out the data manipulations and the statistical analyses. Parallel computing is a great candidate to handle the increase in data size. The proposed method is implemented in R using lmmpar package which is designed by the authors for this type of models and it can be found in https://cran.r-project.org/package=lmmpar (Gokalp Yavuz and Schloerke 2017).

In the next section, we give a brief introduction about the LMM setting. One of the EM-type algorithms, ECME, is defined in the third section. We then introduce the parallel programming and our proposed parallel LMM algorithm in the fourth section. Simulation studies and conclusion are discussed at the fifth and sixth sections, respectively.

2 Linear mixed models

The LMM is defined for a continuous response variable as following

$$\begin{aligned} \begin{aligned} \mathbf {y}_i&= \mathbf {X}_i \mathbf {{\varvec{\beta }}}+ \mathbf {Z}_i \mathbf {u}_i+ \mathbf {e}_i , i = 1,2\ldots , n,\\ \mathbf {u}_i&\sim \mathrm {N}_q\left( \mathbf {0},\mathbf {{\varvec{\psi }}}\right) ,\\ \mathbf {e}_i&\sim \mathrm {N}_{n_i } \left( \mathbf {0},\mathbf {R}_i\right) ,\\ \end{aligned} \end{aligned}$$
(1)

where \(\mathbf {y}_i\) denotes an \(n_i\) dimensional vector of continuous responses for the \(i^{th}\) subject, \(\mathbf {{\varvec{\beta }}}\) denotes a p dimensional vector of unknown population parameters, \(\mathbf {u}_i\) denotes a q dimensional vector of unknown individual effects, \(\mathbf {X}_i\) is a known \((n_i \times p)\) design matrix, \(\mathbf {Z}_i\) is a known \((n_i \times q)\) design matrix, \(\mathbf {e}_i\) denotes an \(n_i\) dimensional vector of residual errors assumed to be independent of \(\mathbf {u}_i\). \({\varvec{\psi }}\) is a \((q \times q)\) covariance matrix of random effects; while \(\mathbf {R}_i\) is a \((n_i \times n_i)\) covariance matrix of residual errors (Laird and Ware 1982). It is often assumed that \(\mathbf {R}_i= \sigma ^2 \mathbf {I}_{n_i}\) for simplicity. \(\sigma ^2\) and \(\mathbf {{\varvec{\psi }}}\) are positive definite.

The joint distribution of \(\left( \mathbf {y}_i^T,\mathbf {u}_i^T \right) ^T\) is

$$\begin{aligned} \left[ \begin{array}{c@{\quad }c}\mathbf {y}_i\\ \mathbf {u}_i\end{array} \right] \sim \mathrm {N}_{n_i+q} \left( \left[ \begin{array}{c@{\quad }c}\mathbf {X}_i\mathbf {{\varvec{\beta }}}\\ 0 \end{array} \right] , \left[ \begin{array}{c@{\quad }c} \mathbf {Z}_i\mathbf {{\varvec{\psi }}}\mathbf {Z}_i^T+\mathbf {R}_i &{} \mathbf {Z}_i\mathbf {{\varvec{\psi }}}\\ \mathbf {{\varvec{\psi }}}\mathbf {Z}_i^T &{} \mathbf {{\varvec{\psi }}} \end{array} \right] \right) . \end{aligned}$$
(2)

The marginal distribution of the response variable is obtained as

$$\begin{aligned} \begin{aligned} \mathbf {y}_i \sim N_{n_i}\left( \mathbf {X}_i\mathbf {{\varvec{\beta }}}, \sigma ^2\mathbf {W}_i^{-1}\right) , \end{aligned} \end{aligned}$$
(3)

where \(\mathbf {W}_i=\left( \mathbf {Z}_i\mathbf {D}\mathbf {Z}_i+\mathbf {R}_i\right) ^{-1}\) and \(\mathbf {D}=\sigma ^{-2}{\varvec{\psi }}\). The maximum-likelihood (ML) estimates of \(\mathbf {{\varvec{\beta }}}, \sigma ^2\) and \(\mathbf {D}\) is found by maximizing the following log-likelihood function of (3).

$$\begin{aligned} \begin{aligned} \mathrm {L}_0\left( \mathbf {{\varvec{\beta }}}, \sigma ^2, \mathbf {D}\right)&\propto \left( \sigma ^2\right) ^{-N/2} \prod _{i=1}^n |\mathbf {W}_i|^{1/2} \\&exp\left\{ -\frac{1}{2\sigma ^2} \left( \mathbf {y}_i-\mathbf {X}_i\mathbf {{\varvec{\beta }}}\right) ^T\mathbf {W}_i\left( \mathbf {y}_i-\mathbf {X}_i\mathbf {{\varvec{\beta }}}\right) \right\} , \end{aligned} \end{aligned}$$
(4)

where \(N = \sum _{i=1}^n n_i\).

Given \(\mathbf {y}=\left( \mathbf {y}_1,\ldots , \mathbf {y}_n\right) \) and \(\left( {\varvec{\beta }}, \sigma ^2, \mathbf {D}\right) \), the random effects, \(\mathbf {u}_1,\ldots ,\mathbf {u}_n\) are independently and normally distributed with the following moments

$$\begin{aligned} \mathrm {E}\left( {\varvec{u}}_i | {\varvec{y}},{\varvec{\beta }}, \sigma ^2, {\varvec{D}}\right) = {\varvec{U}}_i {\varvec{Z}}_i^T {\varvec{R}}_i^{-1}\left( {\varvec{y}}_i - {\varvec{X}}_i {\varvec{\beta }}\right) , \end{aligned}$$
(5)
$$\begin{aligned} \mathrm {V}\left( {\varvec{u}}_i | {\varvec{y}},{\varvec{\beta }}, \sigma ^2, {\varvec{D}}\right) = \sigma ^2 {\varvec{U}}_i, \end{aligned}$$
(6)

where

$$\begin{aligned} {\varvec{U}}_i = \left( {\varvec{D}}^{-1} + {\varvec{Z}}_i^T {\varvec{R}}_i^{-1} {\varvec{Z}}_i\right) ^{-1}. \end{aligned}$$
(7)

Detailed proofs and empirical Bayes point and interval estimates for random effects are found in Schafer (1998).

3 ECME algorithm

Two methods are commonly used to maximize (4): Expectation Maximization (EM) and Newton-Raphson (NR). EM (Dempster et al. 1977) is easy and stable, and it is proven that the likelihood increases at each iteration of this algorithm. The computations for parameter estimations are tractable for the log-likelihood function with EM algorithm. Therefore, we prefer to use an EM-type algorithm to maximize the likelihood function of LMM.

The EM algorithm is composed of two steps. The first step is finding the conditional expectation of the complete data log-likelihood with respect to unknown data given the observed data and the current parameter estimations. The second step includes the maximization of this expectation. These two steps are repeated until the parameters have converged.

Using the general EM scheme, the LMM random effects are treated as missing data and their conditional expectations are evaluated with (5). During the maximization step, some of the unknown parameters are fixed while the remaining parameters are maximized first. Once optimal values are found, the originally fixed parameters are maximized. Thus, the algorithm is no longer EM, it is called Expectation/Conditional Maximization Either (ECME) (Liu and Rubin 1994).

The maximum likelihood parameters of LMM, are inferred by maximizing the log-likelihood function given in Eq. (4) with ECME algorithm. They are given below similar to the ones defined in Schafer (1998)

$$\begin{aligned} {\varvec{U}}_i^{(k)} = \left( {\varvec{D}}^{-1(k)} + {\varvec{Z}}_i^T {\varvec{R}}_i^{-1} {\varvec{Z}}_i\right) ^{-1}, \end{aligned}$$
(8)
$$\begin{aligned} {\varvec{W}}_i^{(k)} = {\varvec{R}}_i^{-1} - {\varvec{R}}_i^{-1} {\varvec{Z}}_i {\varvec{U}}_i^{(k)} {\varvec{Z}}^T {\varvec{R}}_i^{-1}, \end{aligned}$$
(9)
$$\begin{aligned} {\varvec{\beta }}^{(k)} = \left( \sum _{i=1}^n {\varvec{X}}_i^T {\varvec{W}}_i^{(k)} {\varvec{X}}_i \right) ^{-1} \left( \sum _{i=1}^n {\varvec{X}}_i^T {\varvec{W}}_i^{(k)} {\varvec{y}}_i \right) , \end{aligned}$$
(10)
$$\begin{aligned} \sigma _i^{2(k+1)} = \frac{1}{\mathrm {N}}\sum _{i=1}^n \left( {\varvec{y}}_i-{\varvec{X}}_i {\varvec{\beta }}^{(k)}\right) {\varvec{W}}_i^{(k)}\left( {\varvec{y}}_i-{\varvec{X}}_i {\varvec{\beta }}^{(k)}\right) , \end{aligned}$$
(11)
$$\begin{aligned} {\varvec{u}}_i^{(k)} = {\varvec{U}}_i^{(k)} {\varvec{Z}}_i^T {\varvec{R}}_i^{-1} \left( {\varvec{y}}_i - {\varvec{X}}_i {\varvec{\beta }}^{(k)}\right) , \end{aligned}$$
(12)
$$\begin{aligned} {\varvec{D}}^{(k+1)} = \frac{1}{n} \sum _{i=1}^n \left( \sigma ^{-2(k)} {\varvec{u}}_i^{(k)} {\varvec{u}}_i^{(k)T} + {\varvec{U}}_i^{(k)}\right) . \end{aligned}$$
(13)

The likelihood evaluated at cycle (k) is as follows

$$\begin{aligned} \begin{aligned} \mathrm {L}\left( {\varvec{\beta }}^{(k)}, \sigma ^{2(k)}, {\varvec{D}}^{(k)}\right)&= -\frac{N}{2} \log \sigma ^{2(k)} -\frac{n}{2}\log \left| {\varvec{D}}^{(k)}\right| \\&\quad +\frac{1}{2} \sum _{i=1}^n\log \left| {\varvec{U}}_i^{(t)}\right| -\frac{N}{2} \left( \frac{\sigma ^{2(k+1)}}{\sigma ^{2(k)}} \right) . \end{aligned} \end{aligned}$$
(14)

The ECME algorithm updates the unknown parameters \({\varvec{\beta }}, \sigma ^2, {\varvec{D}}\) each step until (14) converges. Maximum likelihood estimates of \({\varvec{\beta }}, \sigma ^2\) and \({\varvec{D}}\) will be consistent, if \(n \rightarrow \infty \) and \(n_i\) remains bounded, with the difference of these estimates from the true parameters by terms of size \(O_{p}(n^{-1/2})\) (Liu and Rubin 1994).

4 Parallel programming

With parallel programming, a computation-intensive process is divided into several parts or tasks. The complete set of tasks are able to be processed at a faster rate in parallel, as the workload is spread between multiple cores simultaneously. Detailed overview of the parallel methods in point of the statistical computing is found in Kontoghiorghes (2005).

A task in which each core does the same statistical job with different subsets of the data is called domain decomposition. Parallelization on the tasks of a statistical method is called functional decomposition. We use domain decomposition and implement the computations in parallel. Since the calculations are implemented over a big set of matrices, it is faster to split each matrix into smaller pieces and run the required calculations on multiple cores. The achievable speedup using domain decomposition alone is limited as the demand on the hardware usage is higher (Nagel and Rickert 2001).

Different worker processes can be currently managed by the parallel package (R Core Team 2017). Other common packages such as plyr and foreach (Ooi et al. 2019b) wrap around the parallel package with integration help from the package doParallel (Ooi et al. 2019a). plyr is used for its clean interface and foreach for determining which variables need to be exported for the algorithm to work properly.

One of the new packages designed for big matrices is bigmemory (Kane et al. 2013). Data is not duplicated to each process with bigmemory. Instead, each matrix is put into shared memory (with big.matrix() ) that local cores may access. This memory management style requires less overall memory when using many local processes. Converting data into a big.matrix() optimizes the communication cost for the algorithm. As our algorithm only needs to read the big.matrix() data, each core may work independently on its own section of the problem without fear of concurrency issues.

More specifically, we generate a big column vector from \({\varvec{y}}\), \({\varvec{X}}\) and \({\varvec{Z}}\) matrices by using big.matrix package in R similar to MapReduce programming model. In our algorithm, the subjects are allocated to shared memory, and the matrices are separated by each subject to shared memory. So, a heavy matrix is decomposed to many sub-matrices to perform the parallel processing. This part can be seen as the Map procedure. Following the decomposition, each sub-matrix is processed independently. This part can be seen as Reduce process.

It is worth to note that the processors do not communicate in our proposed algorithm, they send the related message to the main memory. The final calculations are implemented in this main memory with the received compound messages.

Before introducing our algorithm for LMM, we want to give some details about one of the concepts used to check the performance of the parallel computing at the next section.

4.1 Speedup for parallel computing

An increase in the number of processors does not guarantee a linear increase in speedup for parallel computing because of waiting time, cost of communication, and shared memory. Also, it may not be possible to parallelize the entire code, especially in the case of complex models. The serial part of the code causes some limitations. The speedup measurement is used to evaluate the performance advantage of the parallel computing.

Let \(S_p\) denotes the speedup;

$$\begin{aligned} S_p = \frac{T_s}{T_p}, \end{aligned}$$
(15)

where \(T_s\) is the sequential algorithm execution time, and \(T_p\) is the parallel algorithm execution time with p processors.

In parallel computing, there are theoretical upper limits for the speedup. Amdahl’s law is one of them. It is applied to predict the speedup with the usage of multiple processors. Amdahl’s law states

$$\begin{aligned} S_p = \frac{1}{f_s + \frac{f_p}{C}}, \end{aligned}$$
(16)

where \(f_s\) is the serial fraction of the code, \(f_p\) is the parallel fraction of the code, and C is the number of the processors.

The maximum number of cores used in this study is 16. Using Amdahl’s law which shows the evaluation of speedup versus number of processors for different fractions of parallel part of the code, we do not expect more than 9x speedup for the parallel calculations. Keep in mind that 90% of our code is executed in parallel. (For the Amdahl’s law, please refer: https://en.wikipedia.org/wiki/Amdahl’s_law)

4.2 Linear mixed models with parallel programming

The ECME algorithm introduced in the previous section is implemented using the R language with a ‘divide and combine’ approach. Steps 8, 9, and 12 are implemented for each subject, and steps 10, 11, and 13 are mainly average value combinations of these values with observed data. We aim to divide these calculations and process the corresponding segments of the algorithm in parallel.

We perform calculations of 8, 9, and 12 on each core, \(c = 1,2,\ldots ,C\), where C is the maximum number of cores and \(C < n\). In other words, the data is divided into C parts and the E-step is executed on different cores as each subject is independent given a current estimate. (Repeats are dependent, since they are gathered from the same subject).

Let \(\eta \) and \({\varvec{\theta }}= \left( {\varvec{\beta }}, \sigma ^2, {\varvec{D}}\right) \) denote the sufficient statistics and parameters, respectively. The data \({\varvec{A}}\) is divided into C splits and the sufficient statistics for the given parameters are:

$$\begin{aligned} \eta ^{(i)} = \mathrm {E}_{\theta }\left( \eta | {\varvec{A}}_i\right) . \end{aligned}$$
(17)

For the first core, the first sufficient statistic is calculated as

$$\begin{aligned} \eta _1^{(i)} = {\varvec{X}}_i^T {\varvec{W}}_i^{(current)} {\varvec{X}}_i, \end{aligned}$$
(18)

where \({\varvec{W}}_i\) is defined in Eq. 3. The parallel sufficient statistic for the first combination is

$$\begin{aligned} \eta _1^{(c)} = \sum _{i=1}^{N/C}{\varvec{X}}_i^T {\varvec{W}}_i^{(current)} {\varvec{X}}_i. \end{aligned}$$
(19)

Note that for the jth process, i is defined from \([(j-1)\frac{\mathrm {N}}{C}+1]\) to \([j\frac{\mathrm {N}}{C}]\). For the ease of notation, we assume that the number of processes is an exact divisor of the total number of observations. If not, the system optimally sets it.

The final combination for the first sufficient statistics is

$$\begin{aligned} \eta _1 = \sum _{j=1}^{C} \eta _1^{(j)}. \end{aligned}$$
(20)

Depending on the notation of (20), \(\hat{{\varvec{\beta }}}\) will be

$$\begin{aligned} \hat{{\varvec{\beta }}} = \left( \eta _1\right) ^{-1}\eta _2. \end{aligned}$$
(21)

where

$$\begin{aligned} \eta _2 = \sum _{j=1}^{C} \eta _2^{(j)}, \quad \mathrm {and}\quad \eta _2^{(c)} = \sum _{i=1}^{N/C}{\varvec{X}}_i^T {\varvec{W}}_i^{(current)} {\varvec{y}}_i. \end{aligned}$$

.

More explicitly, the parallel maximum likelihood estimator (PMLE) in a linear mixed model is

$$\begin{aligned} \tilde{{\varvec{\beta }}} = \left( \sum _{j=1}^C \sum _{i=[(j-1)\frac{\mathrm {N}}{C}+1]}^{[j\frac{\mathrm {N}}{C}]}{\varvec{X}}_{ji}^T \hat{{\varvec{W}}}_{ji} {\varvec{X}}_{ji} \right) ^{-1} \left( \sum _{j=1}^C \sum _{i=[(j-1)\frac{\mathrm {N}}{C}+1]}^{[j\frac{\mathrm {N}}{C}]}{\varvec{X}}_{ji}^T \hat{{\varvec{W}}}_{ji} {\varvec{y}}_{ji} \right) . \end{aligned}$$
(22)

Similarly, variance covariance matrices for error and random terms are defined as below:

$$\begin{aligned} \hat{\sigma }^2 = \frac{1}{N} \sum _{j=1}^C \sum _{i=[(j-1)\frac{\mathrm {N}}{C}+1]}^{[j\frac{\mathrm {N}}{C}]} \left( {\varvec{y}}_{ji}-{\varvec{X}}_{ji} \hat{{\varvec{\beta }}}\right) \hat{{\varvec{W}}}_{ji} \left( {\varvec{y}}_{ji}-{\varvec{X}}_{ji} \hat{{\varvec{\beta }}}\right) , \end{aligned}$$
(23)
$$\begin{aligned} \hat{{\varvec{D}}} = \frac{1}{n} \sum _{j=1}^C \sum _{i=[(j-1)\frac{\mathrm {N}}{C}+1]}^{[j\frac{\mathrm {N}}{C}]} \left( \hat{\sigma }^2 {\varvec{u}}_i {\varvec{u}}_i^T + {\varvec{U}}_i\right) . \end{aligned}$$
(24)

Note that (22), (23), and (24) include the estimations of \({\varvec{U}}_i\), \({\varvec{W}}_i\), and \({\varvec{u}}_i\) which are defined in (8), (9), and (12), respectively. \({\varvec{U}}_i\) is used to find the variance-covariance matrix of random effects. \({\varvec{W}}_i\) is the variance-covariance matrix of the response variable and \({\varvec{u}}_i\) is the random effect predictions of the \(i^{th}\) subject.

Assume that \(\tilde{{\varvec{\beta }}}\) in 22 can be written as

$$\begin{aligned} \tilde{{\varvec{\beta }}} = \frac{1}{C} \sum _{j=1}^C \hat{{\varvec{\beta }}_j} \end{aligned}$$
(25)

where C is the number of cores and \(\hat{{\varvec{\beta }}_j}\) is defined similar to the equation (10) for each core. Then, the similar statistical properties of the parameter can be found in the study of Guo et al. (2015).

As defined previously, the ECME algorithm is used for our estimations and the parameter estimations are found using (22), (23), and (24). The general shape of the ECME algorithm for all of the parameters is implemented in pseudo code below:

figure a

5 Simulation study

The aim of this section is to validate the efficiency of the proposed parallel computing in LMM. The LMM is defined as:

$$\begin{aligned} \begin{aligned} \mathbf {y}_i&= \mathbf {X}_i \mathbf {{\varvec{\beta }}}+ \mathbf {Z}_i \mathbf {u}_i+ \mathbf {e}_i , i = 1,2...n,\\ \mathbf {u}_i&\sim \mathrm {N}_q\left( \mathbf {0},\mathbf {{\varvec{\psi }}}\right) ,\\ \mathbf {e}_i&\sim \mathrm {N}_{n_i } \left( \mathbf {0},\mathbf {R}_i\right) .\\ \end{aligned} \end{aligned}$$
(26)

The definitions of all terms in this model are the same as the model in (1). This is a subject specific model including a random effect \({\varvec{u}}_i\), which is predicted (estimated) for each subject. In the model, i denotes a subject and \(i^{th}\) subject has \(n_i\) repeats. Each subject is independent of each other while a dependency exists between repeats. For the ease of the computation, all vectors and matrices are stacked vertically. As an example, the response variable is defined as \({\varvec{y}}= ({\varvec{y}}_1^T, {\varvec{y}}_2^T, ..., {\varvec{y}}_n^T)^T\), where \({\varvec{y}}_i = (y_{i1}, y_{i2}, ..., y_{in_i})\).

Alternatively \({\varvec{X}}\) and \({\varvec{Z}}\) matrices can be defined as arrays for fixed and random effects, respectively. With this design, they store n rectangular matrices each with \(n_i\) rows and p or q columns for \({\varvec{X}}\) and \({\varvec{Z}}\), respectively. We prefer to use the matrix format, as the ‘bigmemory’ package is appropriate for matrices and vectors.

Simulation studies were conducted with the R package lmmpar, which was designed for this study. When LMM was run with ‘lmm’ package, the computations become unfeasible with the increase in the number of parameters.

Dense and sparse \({\varvec{\beta }}\) parameters are used in the simulation model to see the performances for both situations. The dense vector of \({\varvec{\beta }}\) is generated from the normal distribution with mean 10 and variance 1. The sparse vector of \({\varvec{\beta }}\) is generated from the binomial distribution with 0.1 probability. Both dense and sparse parameters are common in application, so the performance of each situation should be tested. \(\mathbf {R}_i= \sigma ^2 \mathbf {I}_{n_i}\) and \(\sigma ^2 = 1\) for simplicity. The variance-covariance matrix of random effects varies for different scenarios. For example, for \(2 \times 2\) matrix, it is taken as

$$\begin{aligned} {\varvec{D}}= \left[ \begin{array}{c@{\quad }c} 16 &{} 0\\ 0 &{} 0.025 \end{array} \right] . \end{aligned}$$

Three main components affect the performance of a parallel computing: the size of vectors and matrices, the data layout, and the number of cores. So, the number of cores, observations, and parameters are altered to see the change in the speed of calculations in the model (26).

We generate the data with different conditions by assigning alternative values for n and p, such as \(n = 10^5, 10^6\), and \(2 \times 10^6\) and \(p = 5\) and 51 for each scenario. Our algorithm is expected to run faster than existing algorithms as the data size increases. Data size was enlarged by increasing the number of observations and parameters.

Figure 1 depicts the number of cores versus elapsed time for dense and sparse \({\varvec{\beta }}\) and for different values of n and p. Columns of the panels define the number of variables while rows define the type of \({\varvec{\beta }}\). It is named as dense, if \({\varvec{\beta }}\) is generated from the normal distribution and it is named as sparse, if \({\varvec{\beta }}\) is generated from the binomial distribution.

Fig. 1
figure 1

Number of cores versus elapsed time (log2 formatting on y-scales)

The decrease in the elapsed time is obvious with the increase in the number of cores used for parallel computing in Fig. 1. The speedup graph, which shows the performance of the executed code for different scenarios, is reported in Fig. 2. The purple line is the reference, which shows the ideal situation. The sub-linearity is the expected area for a parallel computing. According to Amdahl’s Law, it is not expected to have more than our achieved \(9\times \) speedup, even for the best case scenario.

Fig. 2
figure 2

Speedup versus number of cores

Table 1 reports the system.time results and parameters used, most importantly, the user time, user child time, elapsed time, cores used, observations (n), and number of parameters (p) for the dense \({\varvec{\beta }}\). The elapsed time is used to find the speed up and to draw Fig. 1.

In the simulation set with one million observations, it takes just under 67 s with 16 cores, while it takes over 500 s without parallel computing to complete the same job.

Table 1 Simulation results for dense beta

The initial values of \({\varvec{\beta }}\) and the parameter estimations are reported in Table 2 for the dense parameter scenario. These values are used to see how well the model fits the data. The estimated parameter values in Table 2 are within a reasonable tolerance of the true values.

Table 2 Parameter estimations for Dense \({\varvec{\beta }}\), \(p=5\)

Parameter estimations in model (26) with small dimensions are reported in Tables 2 and 5 for dense and sparse \({\varvec{\beta }}\), respectively. On the other hand, we preferred to report the difference between the initials and the estimations for higher dimensional parameters for the ease of readability. Figure 3 depicts these differences for dense \({\varvec{\beta }}\) in three different number of observation scenarios. The spread in differences shrinks with the increase in the number of observations.

Fig. 3
figure 3

Difference between initials and estimations, Dense Beta

Table 3 includes the variance covariance estimations for random effects and variance estimation for the error terms for dense \({\varvec{\beta }}\).

Table 3 Variance-covariance estimations for dense and big \({\varvec{\beta }}\), \(p=51\)

The same scenarios are repeated for sparse \({\varvec{\beta }}\) to see how the algorithm performs with sparse matrices which are highly common in big data. The simulation results are located in Table 4. Parameter estimations are reported in Table 5 for sparse \({\varvec{\beta }}\). Similar to the dense \({\varvec{\beta }}\) scenario, the differences between the initial and estimated parameters are plotted for three different number of observations. Differences are found in Fig. 4. The variance estimations are found in Table 6 for the detailed investigation of the readers.

Table 4 Simulation results for sparse beta
Table 5 Parameter estimations for Sparse \({\varvec{\beta }}\), \(p=5\)
Fig. 4
figure 4

Difference between initials and estimations, Sparse Beta

Table 6 Variance-covariance estimations for sparse and big \({\varvec{\beta }}\), \(p=51\)

6 Conclusion and discussion

This study is a contribution to the application of parallel programming for complex statistical analyses using big data. We use the EM algorithm for the maximum likelihood estimations of a LMM with a continuous response variable, as it is easy and stable. However, it requires a large amount of memory for big data and is a highly time consuming procedure.

Simulation results show that the elapsed time of the lmmpar function is faster than the classical approach with a single core for all scenarios. This gives us a modeling flexibility for LMM with big data. The modeling approach can easily be extended for other types of statistical models.

It is crucial to find the better strategies for complex statistical models, since we see throughout our studies that parallel computing does not always guarantee to speedup in processing time or saving in memory usage. Finding an adequate strategy is the main aim in parallel computing programming for complex models. A limitation of this study is that the algorithm requires initial values and they are gathered from the basic model. We use initial values of the parameters for the simulated data.

This method is defined for normally distributed random effects and error terms, but it can easily be extended for more robust models such as t-LMM and Laplace-LMM. We also plan to extend the study for different settings of LMM (Yavuz and Arslan 2018; Pinheiro et al. 2001) to see the robustness of the parameters. Also, we are planning to extend the study to utilize the Apache Hadoop environment. Additionally, simulation results for the sparse matrices of \({\varvec{\beta }}\) are negligible for the zero values of the parameters. This naturally extends to include a variable selection method for sparse parameters. We plan to use a LASSO-type variable selection method for the follow-up study.