Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

It is well known that longitudinal data can deal with different concepts than cross-sectional data (see Baltes & Nesselroade 1979; McArdle & Nesselroade 2014). That is, cross-sectional data has many good opportunities for “between person differences” but it cannot deal with “within a person changes.” The first dependency that is created and observed is that the same person is used at multiple occasions. This dependency has been used in multivariate modeling a great deal. Because the same person has multiple inputs and outcomes we can deal with this in different ways. All of the individual changes that can be examined are due to the longitudinal models (see McArdle 2008) allowing dependencies among the observed scores at various time points. This dependency is also responsible for the popularity of multi-level modeling (see Bryk & Raudenbush, 1987, 1992). It is demonstrated here that the statistical power to detect changes is an explicit function of the positive dependencies and the timing of the observations.

The typical lack of dependency is monitored in statistics by a careful assessment of the original scores, typically using linear regression with an outcome score (Yn) and a predictor (Xn) score and usually written as

$$ {\mathrm{Y}}_{\mathrm{n}}={\upbeta}_0+{\upbeta}_1{\mathrm{X}}_{\mathrm{n}}+{e}_{\mathrm{n}}, $$
(1)

where the regression terms β0 and β1 are thought to apply to everyone, and the residual term (e n) is an individual characteristic that is unmeasured and supposedly follows a normal distribution. This is an effort to find the relationships between some outcome Y and the input variable X. If X is a group then this model provides a way to determine group differences on the outcome (the usual ANOVA as a between groups t-test). But this is not an effort to deal with observed dependency in traditional regression analysis (see Fox 1999).

But some people noticed that having an individual measured more than once created a statistical virtue. Indeed this was the stimulus for progressively repeated measures. One classical representation of longitudinal data can be found in the repeated measures model for the analysis of variance (RANOVA; see Fisher 1925). In this first model the individual score at any time point (Y[t]n) is assumed to be decomposed as

$$ \mathrm{Y}{\left[\mathrm{t}\right]}_{\mathrm{n}}={\upbeta}_{0\mathrm{n}}+{\upbeta}_1{\mathrm{X}}_{\mathrm{n}} + e{\left[\mathrm{t}\right]}_{\mathrm{n}} $$
(2)

where the individual (n = 1 to N) is allowed to differ at all throughout the time series (t = 1 to T) in two ways: (1) Individuals are different from one another at all times, and (2) there are random normal fluctuations at each time point (e[t]n). The use of the X weighted function is an adjustment in the mean of the scores for group differences in the trends over time. This model can give correct statistics for the mean of the individuals and the effect of X (assuming it is the same over all occasions) as long as the contrast questions are “spherical” in shape (among others, see Davidson 1972; Huynh & Feldt 1976).

The repeated measures model permits the power to detect differences between treatment groups in means (or over time) as a function of the standard deviations of the scores (as usual, with the sample size included as the square root of N at the end). But in repeated measures, the variance at the second occasion is also based on the correlation of the observed scores over time:

$$ {\upmu}_{\mathrm{d}} = \left(\mathrm{m}\left[1\right]-\mathrm{m}\left[2\right]\right)/\left(\mathrm{s}{\left[1\right]}^2 + \mathrm{s}{\left[2\right]}^2\right) - 2\ \left(\left(\mathrm{s}\left[1\right]+\mathrm{s}\left[2\right]\right)\mathrm{r}\left[1,2\right]\right) $$
(3)

where we have symbolized the estimated mean difference as μd, using the two observed means as m[1] and m[2], the two observed variances as s[1]2 and s[2]2, and the observed correlation over time as r[1,2]. This is nothing more than the mean difference over the standard deviation, but the correlation is for the same measure at two occasions. So for the same mean difference (m[1] − m[2]) as found in a cross section we can say we have found a significant different from zero if the correlation of the two measures is positive (which it typically is; see Bonate 2000; Cribbie & Jamieson 2004). For this reason, it is typically far better (depending on the sign of the correlation) to measure a person twice than to measure twice as many people just once. That is, the longitudinal case is far more powerful than the cross-sectional case. This is not the only issue of statistical power (see Tu et al. 2005) that could be considered, but it is relevant here. Of course, there are more than two time points over which change is to be measured, and this typically increases our power.

The Move to a Latent Curve Model

A straightforward generalization of this RANOVA model allows the move to a latent curve model (LCM) and makes it not very hard to understand. This LCM was first used by Tucker (1958, 1960 1966) and Rao (1958), and later Meredith and Tisak (1990) gave it a structural equation model (SEM) interpretation (also see McArdle 1986 and McArdle & Epstein 1987) to determine the best fitting curve to the observed data. Basically, the slope can vary along with any way the individual changes. Each individual is assumed to have three latent variables, defined as

$$ \mathrm{Y}{\left[\mathrm{t}\right]}_{\mathrm{n}}={L}_{\mathrm{n}}+{S}_{\mathrm{n}}\varOmega \left[\mathrm{t}\right]+u{\left[\mathrm{t}\right]}_{\mathrm{n}} $$
(4)

so the three sources of variation in any response are: (1) A constant change for the individual over all times (the latent level = L), (2) a systematic change (based on a slope score = S, which is systematic with the set of basis coefficients = Ω[t]), and (3) a unique change = u[t], which is essentially random with respect to the other changes. We can examine that the set of basis coefficients (Ω[t] is not necessarily linear) to determine the slope of the best fitting line or trajectory of the data, but this line supposedly has the same coefficients for everyone.

All sources of individual differences are indexed by variance (ϕL 2, ϕS 2, and ψ2). In addition, the constant change is allowed to have covariance (ϕLS) or be correlated (ρLS) with the systematic changes. The variance that remains (the uniquenesses, ψ2) is assumed to be uncorrelated with the changes or the starting point and is furthermore assumed to be equal over time.

We can also have the observed group effects on these individual coefficients, and we can do what we want with them. What is usually done follows the usual regression logic with two of the latent variables as new outcomes:

$$ {L}_{\mathrm{n}}={\upalpha}_0+{\upalpha}_1{\mathrm{X}}_{\mathrm{n}}+{e}_{\mathrm{Ln}}\kern1em \mathrm{and}\kern1em {S}_{\mathrm{n}}={\upbeta}_0+{\upbeta}_1{\mathrm{X}}_{\mathrm{n}}+{e}_{\mathrm{Sn}} $$
(5)

in which case the e L and e S account for the residual variance and covariance. This kind of mixed model function, including both fixed (α0, α1, β0, β1, and Ω[t]) and random (ϕL 2, ϕS 2, ψ2, and ϕLS) effects, can be evaluated for goodness of fit using the standard SEM statistical logic (see Meredith & Tisak 1990; McArdle 1986). If the model fits the data of means and covariances we assume that the score model (of [4] and [5]) is reasonable.

The kind of change we will test is dependent largely on the set of basis coefficients we employ. We can force the systematic change to be linear with the time simply by fixing the coefficients Ω[t] = [0,1,2,3…T]. This is often done, but it is only one option, and there are many others. We can even estimate some of the coefficients (T-2 in the one factor case) so that they form an optimal curve for the data. This is basically what the earliest pioneers (Tucker, Rao, Meredith, etc.) did. But there are many more ways to examine the curves and a lot can be done here. Using the basic logic, we can also consider more than one curve for these data (as done in later chapters).

The LCM is considered useful now because it can describe both, group (i.e., fixed) and individual (i.e., random). For this reason it is popular in psychology where we often are interested in group effects but individual differences from the same perspective. We should note that it is not widely used in other areas of science (e.g., Econometrics) where the dominant paradigm uses time as a causal hinge, so which measure came last in time is regressed on all the prior instances. The same longitudinal data can be used in this way (see McArdle 2008; McArdle & Nesselroade 2014).

We note immediately that the LCM does not try to explain how the prior time points (if measured) impact the subsequent events. This makes the procedures of LCM more descriptive than inferential. But all is not lost because there is some savings in the number of parameters used to define these differences.

Model Fit and Model Selection

A good question can be asked about “Does the model fit the data?” This question can be answered in a number of ways. But what we want is a model that has easy to understand parameters and fits as well or better than others of its kind. The approach, known by the Bayesian Information Criteria (BIC) is used throughout this book so it is useful to investigate it further now, according to Raftery (1996) and Nagin (2005, p.64) the formula for BIC can be written as

$$ \mathrm{B}\mathrm{I}\mathrm{C}= log\,(L)\hbox{--} 1/2p\ log(n) $$
(6)

where the log is the natural logarithm, and L is the model’s maximum likelihood, and this is penalized (lowered) by p, the effective number of parameters used, and n, the sample size of individuals used. “If one is comparing several models we should prefer the one the lowest BIC values.” (Raftery 1996, p. 145). In this way, the BIC “counterbalances” a good fitting model by the number of parameters and the sample size used. So, although it does not seem to be the fit of the model, it can help choose one model among many others. What we hope to obtain is a model where the BIC is as negative as possible, although there are several ways to use this information. Several keen insights into how this BIC behaves are given in Nagin (2005), and these will not be repeated here, but the use of Bayes factors is illustrated. The use of the BIC is obviously Nagin’s favored device for model selection with groups, but he does conclude that:

Such debate is important for advancing the theoretical foundations of model selection. However, disagreement about the technical merits of alternative criteria may obscure a fundamental point—there is no correct model. Statistical models are just approximations. The strengths and weaknesses of alternative model specifications depend upon the substantive questions being asked and the data available for addressing these questions. Thus the choice of the best model specification cannot be reduced to the application of a single test statistic. To be sure, the application of formal statistical criteria to the model selection process serves to discipline and constrain subjective judgment with objective measures and standards. However, there is no escaping the need for judgment; otherwise insight and discovery will fall victim to the mechanical application of method. In the end the objective of model selection is not the maximization of some statistic of model fit. Rather it is to summarize the distinctive features of the data in as parsimonious a fashion as possible (Nagin 2005, p.77).

I can easily say I am in complete agreement about these model-fitting issues.

Potential Biases

Thus, the collection of longitudinal data is useful because: (1) They allow the study of the natural history of the development of problem behavior, such as externalizing behavior, its onset and termination. (2) They allow the study of trajectories or pathways. A pathway is defined as “when a group of individuals experience a behavioral development that is distinct from the behavioral development of another group of individuals” (Loeber & Farrington, 1994, p. 890). Trajectories or pathways provide information of processes of continuity and discontinuity and on inter-individual differences. In addition, Loeber and Farrington (1994) postulate that the best studies now rely on multiple informants. The chapter by Stemmler and Lösel (Chapter 4) meets all of these criteria and this chapter should be considered carefully.

But we need to be clear about the difference between a repeated measures design and a multivariate design because both allow correlation over time. For both, sample members are measured on several occasions, or trials. But in the repeated measures design, each trial represents the measurement of the same characteristic, in the same way, at a different time. In contrast, for the multivariate design, each trial represents the measurement of a different characteristic. It is generally inappropriate to test for mean differences between disparate measurements, so the difference score is useful (in contrast to what is stated in Cronbach & Furby 1970).

But the longitudinal method is not without some well-reasoned detractors (see Rogosa 1988). Among many critiques of the longitudinal method: (1) It is hard to get the representative sample to come back to a second testing, and the people who do come back have done very well at the first time (see McArdle 2012); (2) if they do come back, they have seen the measures before, so it is difficult to measure exactly the same constructs at a second time, without retest or practice effects; and (3) the construct or thing that we want to measure may have changed, and we will not know it by simply looking at the variance or taking the difference between measures. These are some of the many potential confounds of the longitudinal method.

The results of these problems lead us to think that a cross-sectional study had less potential confounds than a longitudinal study. This is hardly ever true because these conditions can occur in cross sections as well, and we may not know it.

Assumption 1: In the LCM, the Latent Scores Used Are Related to Latent Change Scores

It seems that all the prior work has focused on the “change” at the individual and group levels but very few researchers are willing to say so. Instead, words like “curve” or “slope” or “trajectory” are used. But there turns out to be an easy way to represent these basic change ideas and we will usually do so here.

We can define the basic model of change to isolate the functions as

$$ \mathrm{Y}{\left[\mathrm{t}\right]}_{\mathrm{n}}={L}_{\mathrm{n}}+{\displaystyle \sum \mathrm{i}}=1,\mathrm{t}\ \left\{\varDelta \mathrm{y}{\left[\mathrm{i}\right]}_{\mathrm{n}}\right\}+u{\left[\mathrm{t}\right]}_{\mathrm{n}} $$
(7)

so the changes are just accumulated up to that time (i = 1 to t). This is not intended to be a controversial statement and it leads to the same fit as the prior linear models, but it is really another way to consider have the outcome at time t (after McArdle, 2008).

The change as an outcome can be strictly defined at that latent variable level (after McArdle & Nesselroade 2014) as

$$ \varDelta \mathrm{y}{\left[\mathrm{t}\right]}_{\mathrm{n}} = \mathrm{y}{\left[\mathrm{t}-1\right]}_{\mathrm{n}}\hbox{--}\ \mathrm{y}{\left[\mathrm{t}\right]}_{\mathrm{n}}\kern1em \mathrm{or}\kern1.25em \mathrm{y}{\left[\mathrm{t}\right]}_{\mathrm{n}} = \mathrm{y}{\left[\mathrm{t}-1\right]}_{\mathrm{n}}+\varDelta \mathrm{y}{\left[\mathrm{t}\right]}_{\mathrm{n}}, $$
(8)

so the latent score is the source of all inquiry. This can be useful in a number of interpretations, especially for the regression of latent changes. For example, we now can fit

$$ \varDelta \mathrm{y}{\left[\mathrm{t}\right]}_{\mathrm{n}}={\upbeta}_0+{\upbeta}_1{\mathrm{X}}_{\mathrm{n}}+{e}_{\varDelta \mathrm{n}} $$
(9)

so the latent change score is modeled directly, and has a residual (e Δn). But the LCS approach is entirely consistent with the LGM approach, as stated by McArdle (2008) and this is why the same values emerge for various estimates. The LCS model is largely a clearer change-based re-interpretation of the LCM, and the LCS model can be programmed and used efficiently (see McArdle 2008; McArdle & Nesselroade 2014).

Latent changes are apparent in this model. Much more could be said about this approach, but this is all that will be needed here.

Assumption 2: In the LCM, the Model Parameters Have the Same Shape for Everyone

This assumption is also true of all regression models (see Eq. (1)) but it is most clearly not appropriate here. That is, we can control the size and sign of some parameters of the trajectory with the means and the variances of the latent variables, but the shape of the latent change is a combination that is beyond the usual reach.

The chapters listed here do distinguish between these shapes using an unobserved difference between people. That is, this clear difference between individuals is recast at the main reason they are members of a latent grouping—a mixture of different distributions. This was evidenced in the brilliant early work of Tucker (1960 1966, also see Tucker 1992), and the subsequent maximum-likelihood formalizations of Nagin (1999 2005) and Muthén and Sheddon (1999).

This logic using multiple groups is indeed a good idea, because it is focused on different kinds of changes within the person. But Tucker (1960 1966 1992) seems to have found a way to differentiate people with standard methods of factor-cluster analysis. Perhaps the first time this procedure was used in real questions and stated clearly was by McCall, Applebaum, and Hogarty (1973, pp. 44–48) who suggest that there are five clusters of people based on their changes over age in IQ tests over age (see Fig. 1).

Fig. 1
figure 1

From McCall, Applebaum & Hogarty (1973, p. 48)

Now it is clear that Tucker (1958 1960 1966) did not have all the statistical tests (or MLE) to support these choices, nor did he have or did develop the mixture model as the possibility of a person belonging to multiple clusters (this allowing for a much better mixture), but he did distinguish large group of persons on their trajectory using multiple factors and he resolved multiple clusters, so we will generally consider Tucker’s (1958 1966) work as pre-dating the more recent work of Nagin (1999 2005) and Muthén and Sheddon (1999).

Assumption 3: In the LCM, the Residuals Are Equal and Uncorrelated, and the Model Fits

There is much more that could be said about the equality of the unique variance (for details, see Grimm & Widaman 2010) but the basic idea is on must have an a priori theory about why these kind of unique but uncorrelated changes are needed. If we do have such ideas we can remove the variance terms at each time and achieve a much better fit to the data. We will not deal with these issues too much here. In this regard this is an unchallenged assumption that deserves much more scrutiny.

The simple fact that “everything else” is supposedly uncorrelated is actually never met and yet this is what is tested by the model fit. The test of goodness of fit is supposed to test whether or not the LCM can be considered viable. But the way we typically test any hypothesis is to remove all other features until all that are left are random variables. This is primarily because we do know how to test for random events (usually with the χ 2 goodness-of-fit test; but see Raftery 1996).

Assumption 4: In the LCM, the Model Has the Properties of Invariant Measurement

In all cases, it is also necessary to illustrate the loss of fit due to “multiple factorial invariance over time,” (MFIT) and how this invariance can be crucial to understanding changes. That is, some things may not change while others will. Here we will only use common factor analysis in a simple example. This is a second dependency because the measures are somewhat the same within a time. Some elaboration of the early work on any scale is further developed for items. This is related to both “test bias” and “harmony.” That is, if we assume that a test is a good measurement of a construct, it should behave the same way at all waves.

I do not view MFIT as a “testable hypothesis” as many others do (e.g., Meredith 1993) but I view this as a necessary feature of longitudinal data. That is, in the absence of MFIT it is not clear that we can take differences between successive occasions, and this is critical to most any accumulation model. Thus, this test would be a useful foil against a measure, and we can use it to evaluate an existing measure. But to create one, we must be accumulating something, and that something is strictly defined as the object of our MFIT. Perhaps it is best to say we can evaluate the part of the MFIT that works the way we intended. At least our intentions for MFIT are clarified in this way.

Assumption 5: In the LCM, the Model Variables All Have Normal Properties

Another kind of dependency is that due to items that are miscalculated as normal. That is, we typically assume all variables are normally distributed, even when they are highly skewed. This is also the case of a variable that can reach an upper or lower limit and should be considered censored (see Wang et al. 2008). As we do not illustrate here, but could have, this can pose a major problem for our understanding of the changes (but for an example, see Hishinuma et al. 2012; McArdle et al. 2014).

Assumption 6: In the LCM, the Individuals Have All Been Measured at Exactly the Same Developmental Time Periods

This is also probably never true in epidemiological and psychological studies. The problem comes only because the model assumes this is true. In fact, the age-at-measurement is usually not told to the analyst. This means people can be “measured on their birthdays” or at approximate yearly intervals of time, but we just never know. The word “approximate” is used here frequently, and many see this as a natural feature of longitudinal data. But it is not. The big problem that this creates is that the correlations over time, if they are not in a sequential proper timing, can yield some haphazard results. The timing is important to future studies and not enough is done about this issue yet.

The further assumption that we know the true developmental timing is quite absurd. We do not know this and we do not track it very well either. It could be age or it could be something else like puberty (see McArdle 2011), but we need to know it to state how the individuals form groups of people (see Nagin 2005). We often just use whatever longitudinal data we are given, because we are very happy to get some, and we assume we can do something with it, as is. But we cannot.

The Studies of the First Section of This Book

The studies of the first section of this book seem to criticize some of the basic assumptions of the standard LCM. This should be considered fair as a target because it is loaded with assumptions and the linear LCM was designed to be just a starting point for future work. The concepts of simultaneous estimation are also critical here to distinguish what is being done.

The first study by Paolo Ghisletta, Eva Cantoni, and Nadège Jacot as presented here is an examination of more than linear relationships in psychological research, which they term an NGCM (for nonlinear growth curve model). That is, they do not stop at the quadratic form of the prior LCM, and they do not consider the linear model to capture all the relevant variation in their outcomes (in their example, four blocks of 20 trials of time on task in a pursuit rotor task). Instead, they consider other terms (see their Eq. (6)) that are not a usual part of this basic model (our Eq. (4)).

These author(s) do fit a wide variety of nonlinear models to these data, and this is notable, and they compare each, and this is also notable. But they do drop linearity quickly as a possibility and I think this is a mistake. That is, before we deal with how nonlinear a model can be I think we ought to first see how linearity works, in terms of explained variance at each time point (η2[t]) at least.

So I also think these claims can be made from a different perspective. That is, the LCM with a different curve may capture some of these individual changes. The curve could obviously be defined using the last 18 measurements, but an exponential curve could be fitted with less parameters. Nevertheless, the model with the best fit for least parameters is an obvious choice. This, at least, is how I could deal with all the nonlinearity that seems to be present here. I would like to see LCM and the quadratic model as a comparison in their tables.

The second application titled “Stage-sequential growth mixture modeling with criminological panel data” is by Jost Reinecke, Maike Meyer, and Klaus Boers does exactly what this title suggests. However, it uses General Growth Mixture Modeling (GMM, from Muthén & Sheddon 1999) within a LCM framework to empirically distinguish between people. Expanding upon the prior work of Kim and Kim (2012) they consider three distinctive types of stage sequences: (1) stage-sequential (and linear) growth mixture models, (2) traditional piecewise GMM, and (3) discontinuous piecewise GMM and sequential process GMM. These three models are applied to a range of adolescence and young adulthood using data from the German panel study termed, Crime in the modern City (CrimoC, Boers et al., 2014). In the case of count variables a Poisson or negative binomial distributions (following the work of Hilbe, 2011, not Nagin 2005) can be considered which give a better model representation of the data. With the count data that criminologists seem to have, the Poisson model for measurement is used because it is more appropriate. That is, a regular regression model (but not evaluated) may still work, but the Poisson model that is used here as a measurement device because is sensitive to the use of a probability of an event. The zero-inflated Poisson (or ZIP; see Nagin 2005) model may even be a better choice because it essentially proposes that the reason for the zero counts (no criminal acts) is possibly different than the reasons for the rest of the counts (one, and so on). This can always be compared to the assumption of a continuous distribution of the LCM. And this all can be combined sequentially in a program like Mplus (Muthén & Muthén 2012).

This chapter is notable in a number of ways. First the author(s) use a three-part curve model, with knot points that are notable in terms of substance. This is a distinction that is worthwhile to make and it could be pursued further. I do not see this as quite as different as the typical LCM, so I would compare the fit of both of them. Second, they simultaneously use a measurement model based on a Poisson distribution for the scores. This is decidedly different and is most appropriate for data that comes in the form of counts. But their justification for the use in real data is not presented clearly. Third, they simultaneously use a mixture model to examine for the German Crime data. This use of multiple groups is based on the trajectory differences and they assume these cannot be accounted for otherwise. I would very much like to hear what Nagin (2005, p. 54) says about this part of the analysis. But in any case, any one of these three concerns would be a challenge to fit but they proceed as if this is all standard. This is not standard, and what they do here is quite amazing, partly because it can be done at all.

The differences between the current versions of Mplus (Muthén & Muthén 2012) and SAS PROC TRAJ (Nagin 2005) are important here. Currently, in Mplus, we can ask if any parameter is invariant over groups, and we do not need to define the group membership in advance. This can be in terms of any mean, regression, or covariance component. But in this same sense the analysis is entirely exploratory. If we further assume that the factor loadings Ω[t], for at least t = 3, T, are different we can have different curves. This can be written with different means and variance terms so the entire placement within groups can differ. This is somewhat different than assuming different linear or polynomial coefficients for the same data. Much more could be said here (see Nagin 2005, p. 54) but Mplus 7 (now used by almost everyone here) seems much more flexible to me now. But I fully expect the debate about “groupings” will go on, and this is productive.

The third application by Mark Stemmler and Fredrich Lösel is titled, “Developmental pathways of externalizing behavior from preschool age to adolescence,” and also uses general growth mixture modeling (GMM) with BIC this time to separate five categories of persons among their total sample size of n = 541. The goal of this study is to analyze the data of the Erlangen-Nuremberg Development and Prevention Study (ENDPS; Lösel et al., 2009) for the first time with regard to different trajectories for externalizing behavior. ENDPS is a normative sample and is a combined experimental and longitudinal study on antisocial child behavior covering a time period of nearly ten years. Social behavior was rated by multiple informants such as self, mothers, kindergarten educators, and school teachers. Using this longitudinal data, they seem to have found (1) the “high chronics” (2.4 %; n = 13), who are receiving the highest values for externalizing behavior from childhood on up to adolescence; (2) the “low-chronics” (58.8 %; n = 317) who are low on externalizing behavior throughout the years; (3) the “high-reducers” (7.9 %; n = 43) who start out high in childhood, but who reduce their externalizing behavior monotonically over time; the (4) “late-starters-medium” (8.7 %; n = 47); and the (5) “medium-reducers” (22.4 %; n = 121). The results stress the idea of a life course perspective, which enable the study of the natural history of the development of externalizing behavior, its onset, and termination.

In all, these authors give an excellent history of the GMM, and demonstrate how it has been used before in many criminological samples. They seem to show that most studies report between three and five groups (with a total range of two to seven groups), and they use the BIC. Most studies show the group of life-course persistent or chronic offenders, and one group that does not exhibit violent, aggressive, or delinquent behavior; in addition, there are existing groups of late onset or desisting. Jennings and Reingle (2012) claim that the number and shape of the groups depend on the nature of the sample (high risk versus normative sample), the life course captured, the length of the observation, and the geographical context. Among the author(s) conclusions, they postulate that further research should be based on multiple observations and across multi-informants (e.g., child/youth reports, parents and teacher report) to ensure the best results. Since this result requires expertise in criminology, we must leave it up to the reader to make sense of these trajectories.

The fourth application by Jang Schiltz is proposal for the potential extension of “the Nagin model” of multiple groups. This can be a quite useful technology because in this representation we do not have to think everyone has the same general nonlinear slope of their trajectory. The problem with Nagin’s original formulation is that he only determined trajectories for the mean level and a quadratic slope, and less effort was put into the variance terms or other forms of the slope (see Nagin 2005, p. 54). These changes are made and the basic model is extended here to include group differences in the slopes and the error terms.

Since we all believe that there will be substantial heterogeneity in real data—different change patterns for different groups—and the LCM will not be capable of dealing with these based on two means and covariances alone, it is clear that this model is more correct. This and other examples on the use of the mixture model is certainly a powerful latent variable modeling approach. But this latent variable model is not the only way to explore the groups—they can even be formed out of measured variables too (see Brandmaier et al. 2013).

The exploratory use of measured rather than latent variables is attractive on a number of counts. First, there are usually many extra ancillary variables that are measured and used as covariates for no particular reason other than they exist. As we will demonstrate, this typical usage can tell us something about their impact on mean differences or between group effects. But what we are interested in is putting them into the analysis is to see if they impact the variances and covariances also. Second, there are always extra ancillary variables that are measured and these could be selected for this exploration. That this is any mixture model is an exploration that is obvious to anyone who uses them and the selection of a group is complicated. So we do not try to handle all these assumptions at once but instead we refer to Nagin (2005) for details on this issue.

Our Cognition in the USA (CogUSA) Study

Our CogUSA study (see McArdle & Fisher 2015) was designed to do something different than those in this section—that is, the most notable feature of the design of this particular longitudinal study is the variation of age at the initial time, and the variation between time intervals for different waves of testing. As stated earlier in our last Assumption 6, this is a feature of many psychological measurements although it is hardly ever dealt with on a formal basis.

Our ability to measure similar constructs in an in-person face-to-face (FTF) interview and over the telephone (TEL) is not the key issue here, but it is important. In prior surveys (including the HRS; see Juster & Suzman, 1995; Heeringa, Berglund, & Khan 2011) the only human abilities measured over the phone (say, using the Telephone Interview of Cognitive Status; TICS; Fisher et al. 2013) were the very simplest ones (Episodic Memory and Mental Status; see McArdle, Fisher, & Kadlec 2007). It is not too surprising that these simple variables could be measured in the same way in either modality (FTF or TEL) and still retain MFIT (see McArdle 2010; McArdle & Nesselroade 2014).

But when we consider measuring something as important in aging research as fluid intelligence (Gf) in a survey, we remain perplexed (see Lachman & Spiro 2002). This variable needs to measure “reasoning in novel situations” and this is fairly hard to do. One of the ways this can be done in surveys is with indices that supposedly measure numerical reasoning (NR), a decided subset of all reasoning and thinking, and the measure of numerosity (NU) from the HRS is a good indicator of this. Another way to consider NR this is to measure Serial Seven’s (S7) from the HRS, because this takes some NR as well as holding specific but complex ideas in memory (see Blair 2006). Still another way to indicate NR is to measure something like Number Series (NS) because these are intended to be small puzzles in numerical form.

One adaptation is that we initially reasoned that people, especially older people, would not take all test items necessary for a reliable score on anything, so the items administered had to be cut down. In the case of both Immediate Recall (IR) and Delayed Recall (DR) and Numeracy (NU) and Serial 7’s (S7) the work had already been done by the HRS staff. These were properly considered as short forms due to the required telephone constraints on time.

The final telephone definitions follow on Table 1. They were all administered over the telephone and this is a limitation because we do not really know what the respondent is doing. These include definitions of IR, and DR to measure a general memory or general retrieval (Gr) factor, and NU, and NS to measure a general fluid (Gf) factor at each time ([1] or [3]). We will see if the fit of this specific two factor model is different than a one general intelligence (G) factor, but we will examine the factor loadings. Clearly, McArdle et al., (2007) found the first two scales (IR and DR) to be highly correlated (r~0.80) and suggested they be added up and calculated as a single score termed episodic memory (EM) to distinguish it from another scale of cognitive measurement from the TICS, mental status (MS; {BC + S7 + NA + DA} / 4), but the second factor here is much different. And we hope it is clear that several other cognitive measures obtained in CogUSA were not yet used here (see McArdle & Fisher 2015).

Table 1 Selected Telephone Measures used in CogUSA (McArdle & Fisher, 2015)

For common factors to retain their meaning over time, we required them to have “strict” invariance (Meredith 1993). In this case, this implies the factor loadings (Λ), unique variable intercepts (Ι), and unique variable variances (Ψ2) are all assumed to be invariant over time (for each measure). We also brought all means differences to the factor score level. This is typically tested but it is clear that any differences or changes over time must go through the common factors or they are not worth using and summarizing at this level. This is basic or, indeed, fundamental to our definition of the latent variables. This does imply that the way we measure the common factors can change from time to time, but for now we assume they are identical at both occasions of measurement.

Many other researchers search for different forms of invariance (e.g., see Byrne, Shavelson, & Muthén 1989; Reise, Widaman, & Pugh, 1993; McArdle, Petway, & Hishinuma 2014), and now this is an evaluation of configural, metric, strong, or strict invariance constraints. We will not partake in this quest again here. This is primarily because we only want the number of factors (K) to be determined by what is comparable over time in measurement (as in McArdle & Cattell 1994; McArdle 2007) not by a lack of invariance. There is a prominent thought that the search for the type of invariance of a measure is crucial (see Byrne et al. 1989), but if this is not met then the number (or type) of common factors (can be) needs to be altered to meet this criterion. That is, the criterion of invariance should always be met before we evaluate the latent changes (as in McArdle & Cattell 1994). This is only our belief system, and we use this belief at all occasions, but we should point out that it is not one used by many others.

Methods

Available Data

The data to be analyzed are a small subset (4) of scales from recent tests of Cognition in the USA (CogUSA; see McArdle & Fisher 2015). These scales were chosen in a way that would be consistent with the principles of MFIT but the result of the changes over two time points (W1 and W3 here) is unknown. Basically, we first try to establish MFIT over all ages and then look for changes in these scales over ages. We now only present eight scales in all to consider (four cross sections at two longitudinal occasions).

At each occasion, the people who took the HRS (for details, see Fisher et al. 2013) were asked to fill out the forms for all scales. Most specifically, they were asked each time to fill out a questionnaire about their own health and well-being and the full CES-D was included. We did not use sessions at Wave 2, we only use Sessions 1 and 3 primarily because this time-lag did not offer enough Age differences for Age changes to be picked up. We also do not include all items in analyses here, but we will only include eight scales from the full set (of many). We will state that 13/20 items (from the Center for Epidemiological Studies-Depression scale; CES-D) were previously analyzed by McArdle et al. (2014) who seemed to find MFIT in 13 items from the full set (of 20 items). But here there are several differences: (1) We deal with scales not items; (2) the confusion of the usual testing of MFIT was emphasized in McArdle et al. (2014); (3) in CogUSA the ages-at-testing at each occasion are substantially different.

The plots of Figs. 2a–d illustrate what we are trying to examine in the model. These are plots of the four manifest variables (IR, DR, and NS, NU) against the ages-at-testing for each person separately (i.e., joined by a line), and these illustrate lots of variability and only one kind of dependency among persons (that is, the people are largely the same ages when they are measured but the scores do change over age). They could change for a number of other reasons (such as errors of measurement or practice effects; see McArdle & Woodcock, 1997).

Fig. 2
figure 2figure 2

(a) Immediate recall (IR). (b) Delayed recall (DR). (c) Number series (NS). (d) Numeracy (NU)

Models

Figure 3a is an elaboration of a latent curve model with Age differences as a double cross-sectional variable. The only variable used here is the NS measured at two occasions (1 and 3 for comparability) and the age-at-testing is also measured at each of these waves. The model here uses the two occasions in a double cross-sectional mode in an effort to capture the means and covariation of the NS-age relationship. That is

Fig. 3
figure 3

(a) A path diagram of a one-variable model for multiple waves of measurement (W1 and W3) but as usual treated as a dual cross-section. (b) A path diagram of a one-variable LCM model for multiple waves of measurement (W1 and W3). (c) A path diagram of a one-variable LCM model for multiple waves of measurement (W1 and W3) with different ages of measurement

$$ \begin{array}{c}\mathrm{N}\mathrm{S}{\left[1\right]}_{\mathrm{n}}={\upbeta}_{01}+{\upbeta}_{11}\left(= fun{\left\{\mathrm{Age}\left[1\right]\right\}}_{\mathrm{n}}\right) + {e}_{1\mathrm{n}}\\ {}\mathrm{and}\\ {}\mathrm{N}\mathrm{S}{\left[3\right]}_{\mathrm{n}}={\upbeta}_{03}+{\upbeta}_{13}\left(= fun{\left\{\mathrm{Age}\left[3\right]\right\}}_{\mathrm{n}}\right) + {e}_{3\mathrm{n}}\end{array} $$
(10)

where some fixed function of age is used as a linear predictor (e.g., such as fun{AGE[t]} = (Age[t] − 65)/10—so the intercept is at age 65 and the difference in score is for each 10 years of Age). But using SEM we can also test whether the equations (β01 = β03, β11 = β13, and the respective residual variances, ψ1 = ψ3) are supposedly the same at each time of measurement. This could be useful because we may find it does not work the same way at Wave 1 and Wave 3, primarily because of the exposure at Wave 1. At very least, the prespecified fun{Age} becomes a testable hypothesis.

Figure 3b is a path diagram of a LCM in cases of only one change. The variable here is the NS measured at two occasions ([1] and [3] for comparability). But the model here is an effort to capture the mean and covariation of the NS test. We notice that this uses the leftover variation as the difference (or slope) and this simple representation can b credited to Joreskog (1974).

Figure 3c answers a different question about where we would add age variation to this model. Recall in CogUSA (Figs. 2, 3, 4, and 5) there is a lot of age variation at the beginning (Wave 1) and they are not measured over the same age interval over time. This variation in age was considered a random source of variation (and it was done on purpose) because we did not really know how to break up ages into groups. This is an expression of the work of the primary author of this paper (see McArdle & Woodcock, 1998). For these model to work, some predefined fixed function of age (e.g., = f(Age); it does not need to be linear, but it must be pre-specified) needs to be designated as a regression (or as a factor loading) that must be able to change over the individual case (because of the different ages-of-measurement). This precise feature of varying factor loadings can be used in many current computer programs (see Appendices 1 and 2 here for Mplus code). The concept of the individual loading was used by McArdle (1998, pp. 390–406) fitted together with the concept of individual likelihoods (primarily to check on individual fit). This examination of age-variation is an important concept here, but we would use this representation for any departure from the average timing that is measured (see LCM Assumption 6). This is the same concept that was subsequently used by Mehta and West (2000) and Mehta and Neale (2005) in their description of “definition” variables.

Fig. 4
figure 4

(a) A latent variable path diagram of the one-factor model at Wave 1. (b) A latent variable path diagram of the two-factor model at Wave 1

Fig. 5
figure 5

(a) A latent variable path diagram of the invariant one-factor model for multiple waves of measurement (W1 and W3). (b) A full (means and covariances) latent variable path diagram of the invariant one-factor model for Wave 1 and Wave 3

Adding a Latent Variable Measurement Model

Needless to say, these are common factor models where we assume a factor score for each person (f n) is indicative of multiple measures at multiple occasions. This is an important addition and it can be done with SEM. Following McArdle (2007), every variable (m = 1 to M) we measure at each time (t = 1 to T) can be decomposed as

$$ \mathrm{Y}{\left[\mathrm{t}\right]}_{\mathrm{m},\mathrm{n}} = {\uplambda}_{\mathrm{m}}f{\left[\mathrm{t}\right]}_{\mathrm{n}} + {u}_{\mathrm{m}} $$
(11)

into a common part (the time related common factor score f n multiplied by a time invariant factor loading λm) and a unique part (the random or unique factor score u m). We can think of the variable having an intercept of mean (ιm) too, but this could just as well be a property of the unique factor score. This leads to a common factor model hypothesized for each time point.

The specific models fitted to the one time point data are equivalent to many others, so we will not belabor the process. Needless to say, these are factor models where we assume a factor score for each person (f n) is indicative of multiple measures at each occasions (t = 1 to T). This is presented in Fig. 4a, b for both one and two factors at one wave. We do notice that the one factor (G) also has several demographic influences, including scaled versions of age (and education, sex, and race). That is, in addition to the requirement that this factor account for the covariation of all the internal variables, this G must also account for all the covariation of all demographic influences with these measured variables. In our Fig. 4b, two common factors (of Gf and Gr) are expected, and these two factors are allowed to be correlated above and beyond the external (demographic) influences. This relaxation of the factor pattern is not the only way two common factors could be fit here, but it should fit better (see McArdle & Prescott, 1992).

In the next model we consider a single latent variable, perhaps termed a G factor for general intelligence. This is a very popular model for a number of good reasons (see McArdle 2012) and it can be fitted here to the four variables. In this context, the model makes the additional assumption that all four variables have a common part and a unique part. The common part is not necessarily the same for each score, and this size is as indexed by a factor loading (λm) or by the size of its’ unique variance (ψm 2). The size is only relative here and this is made clear by the requirement that one of the factor loadings (or the common variance) needs to be fixed at some positive value (usually λ1 = 1).

As an alternative, in the next model we consider multiple latent variables termed Gf, for general fluid reasoning, and Gr, for a general memory or general retrieval function of memory. This multiple factor model is a very popular model for a number of good reasons (see McArdle 2012) and it can be tested (fitted) here to the four variables. This is not necessarily the best fitting model to these data (that is, other factor loadings can be estimated instead). The model requires that each factor loading have a fixed positive value (usually λ1 = 1, and λ3 = 1) but this is an arbitrary choice that can be altered but must be made by the investigator. Most critically, this is the same as the prior model if the correlation among the two separate factors is unity.

In this context, the model makes the additional assumption that all four variables have a common part and a unique part. The common part is not necessarily the same for each score, and this size is as indexed by a factor loading (λm) or by the size of its unique variance (ψm 2). The size is only relative here and this is made clear by the requirement that one of the factor loadings (or the common variance) needs to be fixed at some positive value (usually λ1 = 1, and for the second factor, λ3 = 1).

The same comparison of these two models can be examined over time and models for this type of data are drawn in Fig. 5a, b. Here the factor or factors have to do two related things: (1) Define the internal features of the covariation of the measures within a time point, and (2) account for the changes over time in the measures. Since we want the same factor at time 1 and time 2 (or W3 here), and since we define the factors by their factor loadings, we do force the factor loadings to be exactly the same over time. Although it is not necessary for this problem, to simplify our presentation here, we assign the unique variances to be the same over time as well.

Of most importance here is change over age, and the common factor part can further be decomposed as

$$ f{\left[\mathrm{t}\right]}_{\mathrm{n}}={f_0}_{\mathrm{n}}+{f_1}_{\mathrm{n}}\varOmega {\left[\mathrm{t}\right]}_{\mathrm{n}}+{e}_{\mathrm{n}} $$
(12)

where f 0n is the unobserved level or intercept of the factor score, f 1n is the unobserved slope of the changes due to a one-unit shift in the Ω[t]n, and e n is the random noise or disturbance that is thought to be randomly distributed around the predicted value of the first two parts. In this way, the factor score can change and this creates change in the observed variable even with an invariant measurement model.

The differences between this and other formulations of the more standard LCM (Meredith & Tisak 1990; McArdle 1986) are that (1) this is a curve of factors model (CUFFS; after McArdle, 1988) and (2) here we explicitly assume the assignment of a factor loading that varies across the individual (McArdle & Hamagami 1996, pp. 106–112; especially p. 108). Of course, individual fitting of likelihoods is a common feature of many fitting functions now (but see McArdle 1998, pp. 390–406), so we use the program M+ here. The consistency assumption for the individual to look like the group is used to form the basic test statistics—this use of an individually measured score as a model parameter is sometimes call “adding definition variables” (from Mx manual; Neale et al., 1993). Indeed, this kind of raw data procedure was only available in Mx in the past, and it was based on the statistical concepts of unbalanced pedigree analysis (from Lange et al. 1976).

If aging impacts the latent score alone we would think that it impacts both the levels and slopes in some consistent fashion. To this we add that there can be age differences at Wave 1 and Wave 3, and these are summarized for each person as

$$ \begin{array}{c}{f_0}_{\mathrm{n}}={\upbeta}_{01}+{\upbeta}_{11}\left(= fun{\left\{\mathrm{Age}\left[1\right]\right\}}_{\mathrm{n}}\right)+{e}_{0\mathrm{n}}\\ {}\mathrm{and}\\ {}{f}_{1\mathrm{n}}={\upbeta}_{03}+{\upbeta}_{13}\left(= fun{\left\{\mathrm{Age}\left[3\right]\right\}}_{\mathrm{n}}\right)+{e}_{1\mathrm{n}}\end{array} $$
(13)

so the f 0 is a level over both occasions, and f 1 is a slope that is at two particular time points determined by the age of measurement. The terms are indicative of a level (or when the fun{Age} = 0; so at 65 here) and a slope (for each unit—or decade of age—of fun{Age}) of the prespecified age function. That at is, each person’s unique contribution to the two ages is built up in this way. Each person has a level and a slope score and under the assumption that it is the same information about age changes is present in each variable. We can look further at this function. In other words, we have essentially taken the Age model to the latent variable level.

Figure 5a is a path diagram of this one common factor model and Fig. 5b is a more compete version (including means and unique covariances). We fit the latter one here.

Figure 6a extends this logic to having two common factors at each measurement occasion, and Fig. 6b is a full mean and covariance path diagram of this extension. Here the variable slopes and levels are all correlated, and the factor of levels does not assume a mean difference, due to lack of identification, so we do not add one.

Fig. 6
figure 6

(a) A latent variable path diagram of the two-factor invariant model for Wave 1 and Wave 3. (b) A full (means and covariances) latent variable path diagram of the two-factor invariant model for Wave 1 and Wave 3

In the final model (Fig. 6b) we use a two factor solution, but we also include: (1) the means in the diagram (as the regression from a constant triangle), and (2) the covariance of any unique features of the data (ψ2[1,3]). This is simply a more complete picture of the model we will fit.

The same principles hold when we move to multiple occasions of data. The common factors are supposedly the same, but the age changes in the factor scores is examined. Here the models of 5a and 6a will be compared.

The key thing we will note about CogUSA is the staggered time lag of this longitudinal study. This is unusual for a longitudinal study (see McArdle & Woodcock, 1997), but we put in time lag as a variable because we wanted to study. That, in most cases of experimental design we vary all the things that are important to us and leave the rest as fixed quantities.

Results

The summary statistics appear in Table 2 for n = 1,125 people who supposedly took all four scales at both Wave 1 and Wave 3. These are full information maximum likelihood (FIML) estimates because only about 98 % participated at all times. In using FIML we basically assume that there is nothing special about those who did not participate again, and we use their time 1 data assuming they also follow the same general pattern as we observe in those that did come back. But, for example, NS is listed with a mean of 2.38 and a variance of 7.79 at Wave 1, and this can be recast (by the usual W Rasch-scale transformation, following McArdle & Woodcock, 1997; Here this is a transformation that basically raises the score to a power of about 9, and then adds 500) into a raw W-scale score of 524, while the same (or similar) test is listed at a mean of 2.83 with a variance of 9.46 at Wave 3, and this is a raw W-score of 528. It is thought (by many others) that the W-score will have a linear relation with other scales while the raw score will not. The correlation (in Table 2b) is r0.520. Thus, the W-scores go up over time in a pattern that is related to time 1 (those that are high to begin with seem to get high scores here). The high scores can be accounted for by the ceiling of 600 on both test forms (this is the highest we would go) and it appears (as we can see in Fig. 2a) people seemed to get this score with unusually high frequency.

Table 2 Summary statistics for n = 1125 participants who were all considered on four scores at both Waves 1 and 3 (and using FIML)

The resulting age-based model #1 is based on n = 1,125 who answered both scales, starting ages with a mean of 68.83 (with a variance of 106.91) and a time lag of 1.21 years (with variance 0.23). These are first fitted to Number Series (NS) at two waves (Wave 1 and Wave 3) although any one of the four scales could be used. The results are highlighted in the first column of Table 3 and all of Table 4. In addition the computer script used to assess these variables is in Appendix 1.

Table 3 Numerical results for selected univariate models of NS (fitted to the data of Table 2, with the M+ computer scripts of Appendices 1 and 2)
Table 4 Numerical results for selected multivariate models (fitted to the data of Table 2, with the M+ computer scripts of Appendices 3 and 4)

The numerical results of model #1 shows NS has a mean at age 65 of −1.36 and a slope per decade of age of 3.79. These are both significantly different from zero given their respective z-values, so we can talk about the W-scores of 484 at age 65 and with a positive increase of +3.87 (or +38.7 in W units) points per decade from that point onwards (and backwards). This should usually be contrasted with model #0 where no slope is assumed, but the equal variance assumption of the residual was assumed (in fact an equal and fixed variance assumption had to be used, so this model has only two estimated parameters; the mean and variance of the level; see McArdle, 1998). The extra parameters estimated (from the 2 in #0) are the slope mean, the slope variance, and the covariance of levels and slopes. The fact that all variables have such large variance estimates (in #1) and that increasing scores go with increasing ages (this is positive) is a surprise. The fact that model #0 has a much larger BIC is also a question that requires an answer. Perhaps the age changes are too small to count here, but the fact that they are positive is a definite difference from prior results (see McArdle et al. 2007).

The results for the one factor G model of behavior is listed in the next two models of Table 3 (#2 and #3). In model #3 we fit a level and slope model with one common factor (as in Fig. 4b). In model #2 we fix this changing score assumption and did not fit a slope to the G factor and lost substantially in fit (on df = 4). The single factor with factor loadings set λ = 1 for NS, but the factor loading is estimated as 0.77 for NU, 2.25 for IR, and 1.91 for DR seem to fit these data very well (with df = 3). The three extra parameters estimated (from the 10 in #2) are the slope mean, the slope variance, and the covariance of levels and slopes. The invariant loadings do not add anything to the misfit here. But the mean intercept of this age slope factor is 2.68 (indicating a raw W-score of 527 at age 65), and the mean of this slope factor is −0.24, and it is significant, indicating a −0.24 downhill slide in this factor for every decade of age that is increased.

This is a decidedly different result. It is not at all what the observed data seem to show (for any subscale) but this does not take the age-of-measurement into account, it uses only the common (not unique) variance, and it is a closer to what we expected. In fact, it suggests that (1) the common factor is episodic memory (because the loadings of 2.25 for IR dominates the factor), and (2) there is some decline over age. This one factor model is listed as #3 and this is given more completely in Table 5 and the Mplus computer script that was used is in Appendix 2. We can see the BIC is a lot larger for the no-changes model (#2), but we do not assume that Age changes modeled in this way are important. This is only one latent variable after all.

Table 5 Parameter estimates for the single-variable (number series) model with age changes
Table 6 Parameter estimates for the current invariant measurement model with two common factor of changes

A specially selected (and highly restricted) two-factor model alternative is listed as #4. This is a more complex two-factor model that allows for more covariation among the measures but also requires “strict” invariance of the loadings and unique variances over time. Similarly, all variable intercepts are set to zero here, so the mean changes have to go through the common factors (as in McArdle & Nesselroade 2014). In this approach the model has two common factors with λ = 1 the required fixed loading for NS and IR and estimated loadings of 0.74 (for NU) and 0.84 (for DR). This model has 22 parameters, largely due to the extra common factor covariances, and these extra parameters are penalized heavily by the BIC, including the level intercepts of 2.82 (or W = 528) and 6.04 (for IR). This still seems to have the smallest BIC value, so it could be chosen on this basis.

It appears that Gr as measured by IR and DR declines the most over age, with −0.49 per age decade. The function termed Gf, indicated by NS and NU, decline significantly but only at −0.27 per decade. In contrast to the one-factor version, the BIC for this model is smaller, and needless to say, this is far less decline than we initially expected from a normal aging population, so maybe we do not have the right factor yet. This has the smallest BIC of all used, so it could be chosen as the best model for the data. But this BIC (about 31,275) is not much smaller than the prior BIC (about 32,038), and the two-factor model has a lot of extra parameters so this substantive model is not yet considered a large improvement.

Discussion

The final model chosen did not have the smallest BIC but it seemed to fit the data the best. So there is much more to be done here. We fully realize the two-factor model did not fit as well as we would have liked. We would have liked to separate apart the aspects of G F and G R but this proved difficult with only 4 measures. But this is a complicated choice (made here only by BIC) and because the number of measurements is small and we want to say a lot about aging. That is, we had a difficult distinction because we were only working with four variables. But this SEM is a useful starting point even if it only deals with LCM Assumption 6.

We tried to use the standard HRS procedures (e.g., Genesys surveys) to contact households with some persons in the HRS age range (over 50). We succeeded in reaching over 3,000 people, but not everyone agreed to test further. We can only suggest the reader look carefully at Heeringa et al. (2011) and McArdle and Fisher (2015) for details. At Wave 1 we actually had measured over 1,500 people we have measured age and other demographics (like respondent education, sex, minority status, health, dyad, nursing home, currently employed), as well as administration of a telephone versions of standard cognitive tests (the TICS; defined here as BC, S7, IR, DR), as well as some additional tests HRS (NU) and some new telephone administered WJ-based adaptive tests (NS, RF) and the depression scale (CES-D). Our main goal here was to see if we could measure the same constructs as before, but the time over the telephone, and we basically found we could. But we are also involved in a number of selection issues (see McArdle 2013; Heeringa et al. 2011), and a small sample of 200 HRS respondents who were at the top and the bottom of the HRS Cognition scores (in 2008) were also re-measured on our instruments.

Perhaps with more measured variables we can also make finer distinctions among multiple factors. Or perhaps we can take into account the non-normality of the data (see Fig. 2a–d). In this same sense, we were evaluating only part of the measured scale and not all of it. For another example, take our previous analytic work in McArdle et al. (2015). In this research we evaluated 13 items from the CES-D. If we had been evaluating the CES-D for use in these grades (9th and 10th) we would use 20 items as listed in the typical CES-D. But we were most interested in evaluation the concept of depression, the latent factor that it represented, and we thought we could test this idea with only four items. But we looked at 13 items in this chapter only, mainly because we were trying to indicate one factor only.

Indeed an expansion or delineation of the purported common factors can be achieved with measured variables such as S7 (a telephone administration of HRS Serial Sevens) and RF (a telephone administration of WJ-Retrieval Fluency). Each of these could expand the factor space in important ways and may lead to some stability. We can also add several occasions of measurement, before and after Wave 3. This use of Wave 2 was a few weeks later than Wave 1 but we gave them a full battery of tests, about 1,200 people for 3 h in the home, including the WJ-R, and the WASI, plus some personality scales (e.g., BFI, NCS) and some dispositional measures (RISK). This was mainly included so we could verify the telephone measures against telephone adapted tests and several were administered here. There are many measures here and they can be useful too.

At Wave 3 we went back to the original telephone forms after an average of about 1.21 years (with 0.21 as a standard deviation). This was supposedly different for people of different ages in some time lag but other conditions need to be stated up-front as well. For example, Rodgers et al. (2003) and McArdle and Fisher (2015) make it very clear that there we naturally had eight groups of respondents by time-lag (this was not designed) and it was hard to do a second telephone test in a short time if the people had not completed the FTF of Wave 2. This confound does not apply to any people selected but it is there. We started this testing by verifying the Birth Date of the person being tested (the interviewee, or IV). We repeatedly tested the original people again on the same battery used in Wave 1 with, as stated, about a year and a half delay (e.g., Wave 3).

During the last 5 years we have measured the same people again as part of a second 5-year study. During these times (2009–2014) we were mainly interested in the difference (if there are any) between Telephone testing and Internet testing. This issue will not be raised or resolved here. There are currently (as of 2015) no plans by us to re-measure these same people again but we have let the CogECON and HRS teams at the University of Michigan contact them. In any case, our experiment is complex, as are our multiple assumptions, but this is the basic model of aging and invariance we will use in further analyses of these data, so comments are welcome.