The time-varying (longitudinal) characteristics of large information flows represent a special case of the complexity and the dynamic multi-scale nature of big biomedical data that we discussed in the DSPA Motivation section. Previously, in Chap. 4, we saw space-time (4D) functional magnetic resonance imaging (fMRI) data, and in Chap. 16 we discussed streaming data, which also has a natural temporal dimension. Now we will go deeper into managing, modeling and analyzing big longitudinal data.

In this Chapter, we will expand our predictive data analytic strategies specifically for analyzing big longitudinal data. We will interrogate datasets that track the same type of information, for the same subjects, units or locations, over a period of time. Specifically, we will present time series analysis, forecasting using autoregressive integrated moving average (ARIMA) models, structural equation models (SEM), and longitudinal data analysis via linear mixed models.

19.1 Time Series Analysis

Time series analysis relies on models like ARIMA (Autoregressive integrated moving average) that utilize past longitudinal information to predict near future outcomes. Times series data tent to track univariate, sometimes multivariate, processes over a continuous time interval. The stock market, e.g., daily closing value of the Dow Jones Industrial Average index, electroencephalography (EEG) data, and functional magnetic resonance imaging provide examples of such longitudinal datasets (timeserties).

The basic concepts in time series analysis include:

  • The characteristics of (second-order) stationary time series (e.g., first two moments are stable over time) do not depend on the time at which the series process is observed.

  • Differencing – a transformation applied to time-series data to make it stationary. Differences between consecutive time-observations may be computed by yt = yt − yt − 1. Differencing removes the level changes in the time series, eliminates trend, reduces seasonality, and stabilizes the mean of the time series. Differencing the time series repeatedly may yield a stationary time series. For example, a second order differencing:

    $$ {\displaystyle \begin{array}{l}{y}_t^{{\prime\prime} }={y_t}^{\prime }-{y_{t-1}}^{\prime}\\ {}=\left({y}_t-{y}_{t-1}\right)-\left({y}_{t-1}-{y}_{t-2}\right)\\ {}={y}_t-2{y}_{t-1}+{y}_{t-2}\end{array}}. $$
  • Seasonal differencing is computed as a difference between one observation and its corresponding observation in the previous epoch, or season (e.g., annually, there are m = 4 seasons), like in this example:

    $$ {y_t}^{{\prime\prime\prime} }={y}_t-{y}_{t-m}\kern1em \mathrm{where}\ m=\mathrm{number}\ \mathrm{of}\ \mathrm{seasons}. $$
  • The differenced data may then be used to estimate an ARMA model.

We will use the Beijing air quality PM2.5 dataset as an example to demonstrate the analysis process. This dataset measures air pollutants - PM2.5 particles in micrograms per cubic meter over a period of 8 years (2008–2016). It measures the hourly average of the number of particles that are of size 2.5 microns (PM2.5) once per hour in Beijing, China.

Let’s first import the dataset into R .

figure a

The Value column records PM2.5 AQI (Air Quality Index) for 8 years. We observe that there are some missing data in the Value column. By looking at the QC.Name column, we only have about 6.5% (4408 observations) missing values. One way of solving data-missingness problems, where incomplete observations are recorded, is to replace the absent elements by the corresponding variable mean.

figure b

Here we first reassign the missing values into NA labels. Then we replace all NA labels with the mean computed using all non-missing observations. Note that the floor() function casts the arithmetic averages as integer numbers, which is needed as AQI values are expected to be whole numbers.

Now, let’s observe the trend of hourly average PM2.5 across 1 day. You can see a significant pattern: The PM2.5 level peeks in the afternoons and is the lowest in the early mornings. It exhibits approximate periodic boundary conditions (these patterns oscillate daily) (Fig. 19.1).

figure c
Fig. 19.1
figure 1

Time course of the mean, top-20%, and bottom-20% air quality in Beijing (PPM2.5)

Are there any daily or monthly trends? We can start the data interrogation by building an ARIMA model and examining detailed patterns in the data.

19.1.1 Step 1: Plot Time Series

To begin with, we can visualize the overall trend by plotting PM2.5 values against time. This can be achieved using the plyr package.

figure d

The dataset is recorded hourly, and the 8-year time interval includes about 69,335 h of records. Therefore, we start at the first hour and end with 69, 335th h. Each hour has a univariate PM2.5 AQI value measurement, so frequency=1.

From this time series plot, Fig. 19.2, we observe that the data has some peaks but most of the AQIs stay under 300 (which is considered hazardous).

Fig. 19.2
figure 2

Raw time-series plot of the Beijing air quality measures (2008–2016)

The original plot seems have no trend at all. Remember we have our measurements in hours. Will there be any difference if we use monthly average instead of hourly reported values? In this case, we can use Simple Moving Average (SMA) technique to smooth the original graph.

To accomplish this, we need to install the TTR package and utilize the SMA() method (Fig. 19.3).

figure e
Fig. 19.3
figure 3

Simple moving monthly average PM2.5 air quality index values

Here we chose n to be 24 ∗ 30 = 720, and we can see some pattern. It seems that for the first 4 years (or approximately 35,040 h), the AQI fluctuates less than the last 5 years. Let’s see what happens if we use exponentially-weighted mean, instead of arithmetic mean.

figure f

The pattern seems less obvious in this graph, Fig. 19.4. Here we used exponential smoothing ratio of 2/(n + 1).

Fig. 19.4
figure 4

Exponentially-weighted monthly mean of PM2.5 air quality

19.1.2 Step 2: Find Proper Parameter Values for ARIMA Model

ARIMA models have 2 components: autoregressive (AR) part and moving average (MA) part. An ARMA(p, d, q) model is a model with p terms in AR, q terms in MA, and d representing the order difference. Differencing is used to make the original dataset approximately stationary. ARMA(p, d, q) has the following analytical form:

$$ \left(1-\sum \limits_{i=1}^p{\phi}_i{L}^i\right){\left(1-L\right)}^d{X}_t=\left(1+\sum \limits_{i=1}^q{\theta}_i{L}^i\right){\epsilon}_t. $$

19.1.3 Check the Differencing Parameter

First, let’s try to determine the parameter d. To make the data stationary on the mean (remove any trend), we can use first differencing or second order differencing. Mathematically, first differencing is taking the difference between two adjacent data points:

$$ {y_t}^{\prime }={y}_t-{y}_{t-1}. $$

While second order differencing is differencing the data twice:

$$ {y}_t^{\ast }={y_t}^{\prime }-{y_{t-1}}^{\prime }={y}_t-2{y}_{t-1}+{y}_{t-2}. $$

Let’s see which differencing method is proper for the Beijing PM2.5 dataset. Function diff() in R base can be used to calculate differencing. We can plot the differences by plot. ts () (Fig. 19.5).

figure g
Fig. 19.5
figure 5

First- and second-order differencing of the AQI data

Neither of them appears quite stationary. In this case, we can consider using some smoothing techniques on the data like we just did above (bj.month<-SMA( ts , n =720)). Let’s see if smoothing by exponentially-weighted mean (EMA) can help making the data approximately stationary (Fig. 19.6).

figure h
Fig. 19.6
figure 6

Monthly-smoothed first- and second-order differencing of the AQI data

Both of these EMA-filtered graphs have tempered variance and appear pretty stationary with respect to the first two moments, mean and variance.

19.1.4 Identifying the AR and MA Parameters

To decide the auto-regressive (AR) and moving average (MA) parameters in the model we need to create autocorrelation factor (ACF) and partial autocorrelation factor (PACF) plots. PACF may suggest a value for the AR-term parameter q, and ACF may help us determine the MA-term parameter p. We plot the ACF and PACF using the approximately stationary time series, bj.diff object (Fig. 19.7).

figure i
Fig. 19.7
figure 7

Autocorrelation factor (ACF) and partial autocorrelation factor (PACF) plots of bj.diff

  • Pure AR model, (q = 0), will have a cut off at lag p in the PACF.

  • Pure MA model, (p = 0), will have a cut off at lag q in the ACF.

  • ARIMA(p, q) will (eventually) have a decay in both.

All spikes in the plots are outside of the (normal) insignificant zone in the ACF plot while two of them are significant in the PACF plot. In this case, the best ARIMA model is likely to have both AR and MA parts.

We can examine for seasonal effects in the data using stats::stl(), a flexible function for decomposing and forecasting the series, which uses averaging to calculate the seasonal component of the series and then subtracts the seasonality. Decomposing the series and removing the seasonality can be done by subtracting the seasonal component from the original series using forecast::seasadj(). The frequency parameter in the ts () object specifies the periodicity of the data or the number of observations per period, e.g., 30, for monthly smoothed daily data (Fig. 19.8).

figure j
Fig. 19.8
figure 8

Trend and seasonal decomposition of the time-series

The augmented Dickey-Fuller (ADF) test, tseries::adf.test can be used to examine the timeseries stationarity. The null hypothesis is that the series is non-stationary. The ADF test quantifies if the change in the series can be explained by a lagged value and a linear trend. Non-stationary series can be corrected by differencing to remove trends or cycles.

figure k

We see that we can reject the null and therefore, there is no statistically significant non-stationarity in the bj.diff timeseries.

19.1.5 Step 3: Build an ARIMA Model

As we have some evidence suggesting d = 1, the auto.arima() function in the forecast package can help us to find the optimal estimates for the remaining pair parameters of the ARIMA model, p and q.

figure l

Finally, the optimal model determined by the step-wise selection is ARIMA (1, 1, 4). The residual plot is show on Fig. 19.9.

Fig. 19.9
figure 9

ACF of the time-series residuals

We can also use external information to fit ARIMA models. For example, if we want to add the month information, in case we suspect a seasonal change in PM2.5 AQI, we can use the following script.

figure m

We want the model AIC and BIC to be as small as possible. In terms of AIC and BIC, this model is not drastically different compared to the last model without Month predictor. Also, the coefficient of Month is very small and not significant (according to the t-test) and thus can be removed.

We can examine further the ACF and the PACF plots and the residuals to determine the model quality. When the model order parameters and structure are correctly specified, we expect no significant autocorrelations present in the model residual plots.

figure n

There is a clear pattern present in ACF/PACF plots, Fig. 19.10, suggesting that the model residuals repeat with an approximate lag of 12 or 24 months. We may try a modified model with a different parameters, e.g., p = 24 or q = 24. We can define a new displayForecastErrors() function to show a histogram of the forecasted errors (Figs. 19.11 and 19.12).

Fig. 19.10
figure 10

ARIMA(1,1,4) model plot, ACF and PACF plots of the resiguals for bj.month

figure o
Fig. 19.11
figure 11

An improved ARIMA(1,1,24) model plot, ACF and PACF plots of the resiguals for bj.month

figure p
figure q
Fig. 19.12
figure 12

Diagnostic plot of the residuals of the ARIMA(1,1,24) time-series model for bj.month

19.1.6 Step 4: Forecasting with ARIMA Model

Now, we can use our models to make predictions for future PM2.5 AQI. We will use the function forecast() to make predictions. In this function, we have to specify the number of periods we want to forecast. Using the smoothed data, we can make predictions for the next month, July 2016. As each month has about 24 × 30 = 720 h, we specify a horizon h = 720 (Fig. 19.13).

figure r
Fig. 19.13
figure 13

Prospective out-of-range prediction intervals of the ARIMA(1,1,4) time-series model

When plotting the forecasted values with the original smoothed data, we include only the last 3 months in the original smoothed data to see the predicted values clearer. The shaded regions indicate ranges of expected errors. The darker (inner) region represents by 80% confidence range and the lighter (outer) region bounds by the 95% interval. Obviously near-term forecasts have tighter ranges of expected errors, compared to longer-term forecasts where the variability naturally expands. A live demo of US Census data is shown on Fig. 19.14.

Fig. 19.14
figure 14

Live Demo: Interactive US Census ARIMA modeling

19.2 Structural Equation Modeling (SEM )-Latent Variables

Timeseries analyses provide effective strategies to interrogate longitudinal univariate data. What happens if we have multiple, potentially associated, measurements recorded at each time point?

SEM is a general multivariate statistical analysis technique that can be used for causal modeling/inference, path analysis, confirmatory factor analysis (CFA ), covariance structure modeling, and correlation structure modeling. This method allows separation of observed and latent variables. Other standard statistical procedures may be viewed as special cases of SEM, where statistical significance may be less important, and covariances are the core of structural equation models.

Latent variables are features that are not directly observed but may be inferred from the actually observed variables. In other words, a combination or transformation of observed variables can create latent features, which may help us reduce the dimensionality of data. Also, SEM can address multi-collinearity issues when we fit models because we can combine some high collinearity variables to create a single (latent) variable, which can then be included into the model.

19.2.1 Foundations of SEM

SEMs consist of two complementary components: (1) a path model, quantifying specific cause-and-effect relationships between observed variables, and (2) a measurement model, quantifying latent linkages between unobservable components and observed variables. The LISREL (LInear Structural RELations) framework represents a unifying mathematical strategy to specify these linkages, see Grace 2006.

The most general kind of SEM is a structural regression path model with latent variables, which account for measurement errors of observed variables. Model identification determines whether the model allows for unique parameter estimates and may be based on model degrees of freedom (dfM ≥ 0) or a known scale for every latent feature. If ν represents the number of observed variables, then the total degrees of freedom for a SEM, \( \frac{\nu \left(1+\nu \right)}{2} \), corresponds to the number of variances and unique covariances in a variance-covariance matrix for all the features, and the model degrees of freedom, \( d{f}_M=\frac{\nu \left(1+\nu \right)}{2}-l \), where l is the number of estimated parameters.

Examples include:

  • Just-identified model (dfM = 0) with unique parameter estimates,

  • Over-identified model (dfM > 0) desirable for model testing and assessment,

  • Under-identified model (dfM < 0) is not guaranteed unique solutions for all parameters. In practice, such models occur when the effective degrees of freedom are reduced due to two or more highly-correlated features, which presents problems with parameter estimation. In these situations, we can exclude or combine some of the features boosting the degrees of freedom.

The latent variables’ scale property reflects their unobservable, not measurable, characteristics. The latent scale, or unit, may be inferred from one of its observed constituent variables, e.g., by imposing a unit loading identification constraint fixing at 1.0 the factor loading of one observed variable.

An SEM model with appropriate scale and degrees of freedom conditions may be identifiable subject to Bollen’s two-step identification rule. When both the CFA path components of the SEM model are identifiable, then the whole SR model is identified, and model fitting can be initiated.

  • For the confirmatory factor analysis (CFA ) part of the SEM, identification requires (1) a minimum of two observed variables for each latent feature, (2) independence between measurement errors and the latent variables, and (3) independence between measurement errors.

  • For the path component of the SEM, ignoring any observed variables used to measure latent variables, model identification requires: (1) errors associated with endogenous latent variables to be uncorrelated, and (2) all causal effects to be unidirectional.

The LISREL representation can be summarized by the following matrix equations:

$$ \mathrm{measurement}\ \mathrm{model}\ \mathrm{component}\ \left\{\begin{array}{l}x={\varLambda}_x\xi +\delta, \\ {}y={\varLambda}_y\eta +\epsilon .\end{array}\right. $$

And

$$ \mathrm{path}\ \mathrm{model}\ \mathrm{component}\kern0.5em \eta = B\eta +\varGamma \xi +\zeta, $$

where:

  • xp × 1 is a vector of observed exogenous variables representing a linear function of ξj × 1, vector of exogenous latent variables,

  • δp × 1 is a vector of measurement error , Λx is a p × j matrix of factor loadings relating x to ξ,

  • yq × 1 is a vector of observed endogenous variables,

  • ηk × 1 is a vector of endogenous latent variables,

  • ϵq × 1 is a vector of measurement error for the endogenous variables, and

  • Λy is a q × k matrix of factor loadings relating y to η.

Let’s also denote the two variance-covariance matrices, Θδ(p × p) and Θϵ(q × q), representing the variance-covariance matrices among the measurement errors δ and ϵ, respectively. The third equation describing the LISREL path model component as relationships among latent variables includes:

  • Bk × k a matrix of path coefficients describing the relationships among endogenous latent variables,

  • Γk × j as a matrix of path coefficients representing the linear effects of exogenous variables on endogenous variables,

  • ζk × 1 as a vector of errors of endogenous variables, and the corresponding two variance-covariance matrices Φj × j of the latent exogenous variables, and

  • Ψk × k of the errors of endogenous variables.

The basic statistic for a typical SEM implementation is based on covariance structure modeling and model fitting relies on optimizing an objective function, min{f(Σ, S)}, representing the difference between the model-implied variance-covariance matrix, Σ, predicted from the causal and non-causal associations specified in the model, and the corresponding observed variance-covariance matrix S, which is estimated from observed data. The objective function, f(Σ, S) can be estimated as shown below, see Shipley 2016.

In general, causation implies correlation, suggesting that if there is a causal relationship between two variables, there must also be a systematic relationship between them. Specifying a set of theoretical causal paths, we can reconstruct the model-implied variance-covariance matrix, Σ, from total effects and unanalyzed associations. The LISREL strategy specifies the following mathematical representation:

$$ \varSigma =\left|\begin{array}{cc}{\varLambda}_yA\left({\varGamma \varPhi \varGamma}^{\prime }+\varPsi \right){A}^{\prime}\varLambda {\prime}_y+{\varTheta}_{\epsilon }& {\varLambda}_y A\varGamma \varPhi \varLambda {\prime}_x\\ {}{\varLambda}_x{\varPhi \varGamma}^{\prime }{A}^{\prime}\varLambda {\prime}_y& {\varLambda}_x\varPhi \varLambda {\prime}_x+{\varTheta}_{\delta}\end{array}\right|, $$

where A = (I − B)−1. This representation of Σ does not involve the observed and latent exogenous and endogenous variables, x, y, ξ, η. Maximum likelihood estimation (MLE) may be used to obtain the Σ parameters via iterative searches for a set of optimal parameters minimizing the element-wise deviations between Σ and S.

The process of optimizing the objective function f(Σ, S) can be achieved by computing the log likelihood ratio, i.e., comparing the likelihood of a given fitted model to the likelihood of a perfectly fit model. MLE estimation requires multivariate normal distribution for the endogenous variables and Wishart distribution for the observed variance-covariance matrix, S.

Using MLE estimation simplifies the objective function to:

$$ f\left(\varSigma, S\right)=\ln \mid \varSigma \mid + tr\left(S\times {\varSigma}^{-1}\right)-\ln \mid S\mid - tr\left(S{S}^{-1}\right), $$

where tr() is the trace of a matrix. The optimization of f(Σ, S) also requires independent and identically distributed observations and positive definite matrices, Σ, S. The iterative MLE optimization generates estimated variance-covariance matrices and path coefficients for the specified model. More details on model assessment (using Root Mean Square Error of Approximation, RMSEA, and Goodness of Fit Index) and the process of defining a priori SEM hypotheses are available in Lam & Maguire, 2012.

19.2.2 SEM Components

The R Lavaan package uses the following SEM syntax, Table 19.1, to represent relationships between variables. We can follow the following table to specify Lavaan models:

Table 19.1 Lavaan syntax for specifying the relations between variables and their variance-covariance structure

For example in R we can write the following model model<-

' # regressions

$$ y1+y2\sim f1+f2+x1+x2 $$
$$ f1\sim f2+f3 $$
$$ f2\sim f3+x1+x2 $$

# latent variable definitions

$$ f1=\sim y1+y2+y3 $$
$$ f2=\sim \mathrm{y}4+y5+y6 $$
$$ f3=\sim y7+y8+y9+y10 $$

# variances and covariances

$$ y1\sim \sim y1 $$
$$ y1\sim \sim y2 $$
$$ f1\sim \sim f2 $$

# intercepts

$$ y1\sim 1 $$
$$ f1\sim 1 $$

'

Note that the two "" symbols (in the beginning and ending of a model description) are very important in the R -syntax.

19.2.3 Case Study – Parkinson’s Disease (PD )

Let’s use the PPMI dataset in our class file as an example to illustrate SEM model fitting.

19.2.3.1 Step 1 – Collecting Data

The Parkinson’s Disease Data represents a realistic simulation case-study to examine associations between clinical, demographic, imaging and genetics variables for Parkinson’s disease. This is an example of Big Data for investigating important neurodegenerative disorders.

19.2.3.2 Step 2 – Exploring and Preparing the Data

Now, we can import the dataset into R and recode the ResearchGroup variable into a binary variable.

figure s

This large dataset has 1,746 observations and 31 variables with missing data in some of them. A lot of the variables are highly correlated. You can inspect high correlation using heat maps, which reorders these covariates according to correlations to illustrate clusters of high-correlations (Fig. 19.15).

figure t
figure u
Fig. 19.15
figure 15

Pair-wise correlation structure of the Parkinson’s disease (PPMI) data.

And here are some specific correlations

figure v

One way to solve this substantial multivariate correlation issue is to create some latent variables. We can consider the following model.

figure w

Here we try to create three latent variables: Imaging, DemoGeno, and UPDRS. Let’s fit a SEM model using cfa(), a confirmatory factor analysis function. Before fitting the data, we need to scale them. However, we don’t need to scale our binary response variable. We can use the following code for normalizing the data.

figure x

19.2.3.3 Step 3 – Fitting a Model on the Data

Now, we can start to build the model. The cfa() function we will use is part of the lavaan package.

figure y

Here we can see some warning messages. Both our covariance and error term matrices are not positive definite. Non-positive definite matrices can cause the estimates of our model to be biased. There are many factors that can lead to this problem. In this case, we might create some latent variables that are not a good fit for our data. Let’s try to delete the DemoGeno latent variable. We can add Weight, Sex, and Age directly to the regression model.

figure z

When fitting model2, the warning messages are gone. We can see that falsely adding a latent variable can cause those matrices to be not positive definite. Currently, the lavaan functions sem() and cfa() are the same.

figure aa
figure ab
figure ac

19.2.4 Outputs of Lavaan SEM

In the output of our model, we have information about how to create these two latent variables (Imaging, UPDRS) and the estimated regression model. Specifically, it gives the following information.

  1. 1.

    First six lines are called the header contains the following information:

    • Lavaan version number.

    • Lavaan convergence information (normal or not), and #number of iterations needed.

    • The number of observations that were effectively used in the analysis.

    • The estimator that was used to obtain the parameter values (here: ML).

    • The model test statistic, the degrees of freedom, and a corresponding p-value.

  2. 2.

    Next, we have the Model test baseline model and the value for the SRMR

  3. 3.

    The last section contains the parameter estimates, standard errors (if the information matrix is expected or observed, and if the standard errors are standard, robust, or based on the bootstrap). Then, it tabulates all free (and fixed) parameters that were included in the model. Typically, first the latent variables are shown, followed by covariances and (residual) variances. The first column (Estimate) contains the (estimated or fixed) parameter value for each model parameter; the second column (Std.err) contains the standard error for each estimated parameter; the third column (Z-value) contains the Wald statistic (which is simply obtained by dividing the parameter value by its standard error); and the last column contains the p-value for testing the null hypothesis that the parameter equals zero in the population.

19.3 Longitudinal Data Analysis-Linear Mixed Models

As mentioned earlier, longitudinal studies take measurements for the same individual repeatedly through a period of time. Under this setting, we can measure the change after a specific treatment. However, the measurements for the same individual may be correlated with each other. Thus, we need special models that deal with this type of internal multivariate dependencies.

If we use the latent variable UPDRS (created in the output of SEM model) rather than the research group as our response we can obtain a longitudinal analysis model. In longitudinal analysis, time is often an important model variable.

19.3.1 Mean Trend

According to the output of model fit, our latent variable UPDRS is a combination of three observed variables-UPDRS_part_I, UPDRS_part_II, and UPDRS_part_III. We can visualize how average UPDRS values differ among the research groups over time.

figure ad

The above code stores the latent UPDRS and Imaging variables into mydata. By now, we are experienced with using the package ggplot2 for data visualization. Now, we will use it to set the x and y axes as time and UPDRS, and then display the trend of the individual level UPDRS.

figure ae

This graph is a bit messy without a clear pattern emerging. Let’s see if group-level graphs may provide more intuition. We will use the aggregate() function to get the mean, minimum and maximum of UPDRS for each time point. Then, we will use separate color for the two research groups and examine their mean trends (Fig. 19.16).

figure af
Fig. 19.16
figure 16

Average UPDRS scores of the two cohorts in the PPMI dataset, patients (1) and controls (0)

Despite slight overlaps in some lines, the resulting graph illustrates better the mean differences between the two cohorts. The control group (1) appears to have relative lower means and tighter ranges compared to the PD patient group (0). However, we need further data interrogation to determine if this visual (EDA) evidence translates into statistically significant group differences.

Generally speaking we can always use the General Linear Model ing (GLM) framework. However, GLM may ignore the individual differences. So, we can try to fit a Linear Mixed Model (LMM) to incorporate different intercepts for each individual participant. Consider the following GLM:

$$ UPDR{S}_{ij}\sim {\beta}_0+{\beta}_1\ast {Imaging}_{ij}+{\beta}_2\ast {ResearchGroup}_i+{\beta}_3\ast {timeVisit}_j+{\beta}_4\ast {ResearchGroup}_i\ast tim{e}_v isi{t}_j+{\beta}_5\ast {Age}_i+{\beta}_6\ast {Sex}_i+{\beta}_7\ast {Weight}_i+{\epsilon}_{ij}. $$

If we fit a different intercept, bi, for each individual (indicated by FID_IID), we obtain the following LMM model:

$$ UPDR{S}_{ij}\sim {\beta}_0+{\beta}_1\ast Imaging+{\beta}_2\ast ResearchGroup+{\beta}_3\ast {timeVisit}_j+{\beta}_4\ast {ResearchGroup}_i\ast {timeVisit}_j+{\beta}_5\ast {Age}_i+{\beta}_6\ast {Sex}_i+{\beta}_7\ast {Weight}_i+{b}_i+{\epsilon}_{ij}. $$

The LMM actually has two levels:

Stage 1

$$ {Y}_i={Z}_i{\beta}_i+{\epsilon}_i, $$

where both Zi and βi are matrices.

Stage 2

The second level allows fitting random effects in the model.

$$ {\beta}_i={A}_i\ast \beta +{b}_i. $$

So, the full model in matrix form would be:

$$ {Y}_i={X}_i\ast \beta +{Z}_i\ast {b}_i+{\epsilon}_i. $$

In this case study, we only consider random intercept and avoid including random slopes, however the model can indeed be extended. In other words, Zi = 1 in our simple model. Let’s compare the two models (GLM and LMM). One R package implementing LMM is lme4.

figure ag
figure ah
figure ai

Note that we use the notation ResearchGroup*time_visit that is identical to ResearchGroup + time_visit + ResearchGroup*time_visit. Here R will include both terms and their interaction into the model. According to the model outputs, the LMM model has a relatively smaller AIC. In terms of AIC, LMM may represent a better model fit than GLM.

19.3.2 Modeling the Correlation

In the summary of the LMM model, we can see a section called Correlation of Fixed Effects. The original model made no assumption about the correlation (unstructured correlation). In R, we usually have the following 4 types of correlation models.

  • Independence: No correlation:

    $$ \left(\begin{array}{ccc}1& 0& 0\\ {}0& 1& 0\\ {}0& 0& 1\end{array}\right). $$
  • Exchangeable: Correlations are constant across measurements:

    $$ \left(\begin{array}{ccc}1& \rho & \rho \\ {}\rho & 1& \rho \\ {}\rho & \rho & 1\end{array}\right). $$
  • Autoregressive order 1(AR(1)): Correlations are stronger for closer measurements and weaker for more distanced measurements:

    $$ \left(\begin{array}{ccc}1& \rho & {\rho}^2\\ {}\rho & 1& \rho \\ {}{\rho}^2& \rho & 1\end{array}\right). $$
  • Unstructured: Correlation is different for each occasion:

    $$ \left(\begin{array}{ccc}1& {\rho}_{1,2}& {\rho}_{1,3}\\ {}{\rho}_{1,2}& 1& {\rho}_{2,3}\\ {}{\rho}_{1,3}& {\rho}_{2,3}& 1\end{array}\right). $$

In the LMM model, the output also seems unstructured. So, we needn’t worry about changing the correlation structure. However, if the output under unstructured correlation assumption looks like an Exchangeable or AR(1) structure, we may consider changing the LMM correlation structure accordingly.

19.4 GLMM/GEE Longitudinal Data Analysis

If the response is a binary variable like ResearchGroup, we need to use General Linear Mixed Model (GLMM) instead of LMM. The marginal model of GLMM is called GEE. However, GLMM and GEE are actually different.

In situations where the responses are discrete, there may not be a uniform or systematic strategy for dealing with the joint multivariate distribution of Yi = {(Yi1, Yi2, …, Yin)}T, . That’s where the GEE method comes into play as it’s based on the concept of estimating equations. It provides a general approach for analyzing discrete and continuous responses with marginal models.

GEE is applicable when:

  1. 1.

    β, a generalized linear model regression parameter, characterizes systematic variation across covariate levels,

  2. 2.

    The data represents repeated measurements, clustered data, multivariate response, and

  3. 3.

    The correlation structure is a nuisance feature of the data.

Notation

  • Response variables: \( \left\{{Y}_{i,1},{Y}_{i,2},\dots, {Y}_{i,{n}_t}\right\} \), where i ∈ [1, N] is the index for clusters or subjects, and j ∈ [1, nt] is the index of the measurement within cluster/subject.

  • Covariate vector: \( \left\{{X}_{i,1},{X}_{i,2},\dots, {X}_{i,{n}_t}\right\} \).

The primary focus of GEE is the estimation of the mean model: E(Yi, j∣ Xi, j) = μi, j, where

$$ g\left({\mu}_{i,j}\right)={\beta}_0+{\beta}_1{X}_{i,j}(1)+{\beta}_2{X}_{i,j}(2)+{\beta}_3{X}_{i,j}(3)+\dots +{\beta}_p{X}_{i,j}(p)={X}_{i,j}\times \beta . $$

This mean model can be any generalized linear model. For example: P(Yi, j = 1∣ Xi, j) = πi, j (marginal probability, as we don’t condition on any other variables):

$$ g\left({\mu}_{i,j}\right)=\ln \left(\frac{\pi_{i,j}}{1-{\pi}_{i,j}}\right)={X}_{i,j}\times \beta . $$

Since the data could be clustered (e.g., within subject, or within unit), we need to choose a correlation model. Let’s introduce some notation:

$$ {V}_{i,j}=\mathit{\operatorname{var}}\left({Y}_{i,j}|{X}_i\right), $$
$$ {A}_i=\mathit{\operatorname{diag}}\left({V}_{i,j}\right), $$

the paired correlations:

$$ {\rho}_{i,j,k}= corr\left({Y}_{i,j},{Y}_{i,k}|{X}_i\right), $$

the correlation matrix:

$$ {R}_i=\left({\rho}_{i,j,k}\right),\mathrm{for}\ \mathrm{all}\ j\ \mathrm{and}\ k, $$

and the paired predictor-response covariances are:

$$ {V}_i=\mathit{\operatorname{cov}}\left({Y}_i|{X}_i\right)={A}_i^{1/2}{R}_i{A}_i^{1/2}. $$

Assuming different correlation structures in the data leads to alternative models, see the examples above.

Notes

  • GEE is a semi-parametric technique because:

    • The specification of a mean model, μi, j(β), and a correlation model, Ri(α), does not identify a complete probability model for Yi

    • The model {μi, j(β), Ri(α)} is semi-parametric since it only specifies the first two multivariate moments (mean and covariance) of Yi. Higher order moments are not specified.

  • Without an explicit likelihood function, to estimate the parameter vector β (and perhaps the covariance parameter matrix Ri(α)) and perform a valid statistical inference that takes the dependence into consideration, we need to construct an unbiased estimating function:

  • \( {D}_i\left(\beta \right)=\frac{\partial {\mu}_i}{\partial \beta } \), the partial derivative, w.r.t. β, of the mean-model for subject i.

  • \( {D}_i\left(j,k\right)=\frac{\partial {\mu}_{i,j}}{\partial {\beta}_k} \), the partial derivative, w.r.t. β, , the partial derivative, w.r.t. the kth regression coefficient (βk), of the mean-model for subject i and measurement (e.g., time-point) j.

Estimating (cost) function:

$$ U\left(\beta \right)=\sum \limits_{i=1}^N{D}_i^T\left(\beta \right){V}_i^{-1}\left(\beta, \alpha \right)\left\{{Y}_i-{\mu}_i\left(\beta \right)\right\}. $$

Solving the Estimating Equations leads to parameter estimating solutions:

$$ 0=U\left(\hat{\beta}\right)=\sum \limits_{i=1}^N\underset{\mathrm{scale}}{\underbrace{D_i^T\left(\hat{\beta}\right)}}\underset{\mathrm{variance}\ \mathrm{weight}}{\underbrace{\left({V}_i^{-1}\hat{\beta},\alpha \right)}}\underset{\mathrm{model}\ \mathrm{mean}}{\underbrace{\left\{{Y}_i-{\mu}_i\left(\hat{\beta}\right)\right\}}}. $$

Scale: A change of scale term transforming the scale of the mean, μi, to the scale of the regression coefficients (covariates).

Variance weight: The inverse of the variance-covariance matrix is used to weight in the data for subject i, i.e., giving more weight to differences between observed and expected values for subjects that contribute more information.

Model Mean: Specifies the mean model, μi(β), compared to the observed data, Yi. This fidelity term minimizes the difference between actually-observed and mean-expected (within the ith cluster/subject). See also the SMHS EBook.

19.4.1 GEE Versus GLMM

There is a difference in the interpretation of the model coefficients between GEE and GLMM. The fundamental difference between GEE and GLMM is in the target of the inference: population-average vs. subject-specific. For instance, consider an example where the observations are dichotomous outcomes (Y), e.g., single Bernoulli trials or death/survival of a clinical procedure, that are grouped/clustered into hospitals and units within hospitals, with N additional demographic, phenotypic, imaging and genetics predictors. To model the failure rate between genders (males vs. females) in a hospital, where all patients are spread among different hospital units (or clinical teams), let Y represent the binary response (death or survival).

In GLMM, the model will be pretty similar with the LMM model.

$$ \mathit{\log}\left(\frac{P\left({Y}_{ij}=1\right)}{P\left({Y}_{ij}=0\right)}|{X}_{ij},{b}_i\right)={\beta}_0+{\beta}_1{x}_{ij}+{b}_i+{\epsilon}_{ij}. $$

The only difference between GLMM and LMM in this situation is that GLMM used a logit link for the binary response.

With GEE, we don’t have random intercept or slope terms.

$$ \mathit{\log}\left(\frac{P\left({Y}_{ij}=1\right)}{P\left({Y}_{ij}=0\right)}|{X}_{ij},{b}_i\right)={\beta}_0+{\beta}_1{x}_{ij}+{\epsilon}_{ij}. $$

In the marginal model (GEE), we are ignoring differences among hospital-units and just aim to obtain population (hospital-wise) rates of failure (patient death) and its association with patient gender. The GEE model fit estimates the odds ratio representing the population-averaged (hospital-wide) odds of failure associated with patient gender.

Thus, parameter estimates (\( \hat{\beta} \)) from GEE and GLMM models may differ because they estimate different things.

Let’s compare the results of the GLM and GLMM models for our PPMI dataset.

figure aj
figure ak

In terms of AIC, the GLMM model is a lot better than the GLM model.

Try to apply some of these longitudinal data analytics on the fMRI data we discussed in Chap. 4 (Visualization).

19.5 Assignment: 19. Big Longitudinal Data Analysis

19.5.1 Imaging Data

Review the 3D/4D MRI imaging data discussion in Chap. 4. Extract the time courses of several time series at different 3D spatial locations, some near-by, and some farther apart (distant voxels). Then, apply time-series analyses, report findings, determine if near-by or farther-apart voxels may be more correlated.

Example of extracting time series from 4D fMRI data:

figure al
figure am

19.5.2 Time Series Analysis

Use Google Web-Search Trends and Stock Market Data to:

  • Plot time series for the variable Job.

  • Apply TTR to smooth the original graph by month.

  • Determine the differencing parameter.

  • Decide the auto-regressive (AR) and moving average (MA) parameters.

  • Build an ARIMA model, forecast the Job variable over the next year and evaluate this model.

19.5.3 Latent Variables Model

Use the Hand written English Letters data to:

  • Explore the data and evaluate the correlations between covariates.

  • Justify the application of a latent variable model.

  • Apply proper data conversion and scaling.

  • Fit a Structural Equation Model (SEM) using the lavaan::cfa() function for these data by adding proper latent variable.

  • Summarize and interpret the outputs.

  • Use the model you found above to fit GEE and GLMM models using the latent variable as response and compare the models using AIC. (Hint: add a fake variable as random effect for GLMM).