In prevention studies, we often encounter proportion data. For example, instruments are often divided into several domains or subscales, and proportions of subscale scores are computed relative to the total scores (e.g., Chamberlain et al. 2008; Lindhiem et al. 2014; Schuhmann et al. 1998). Due to the intrinsic nature of proportions, it is critical to analyze such data properly. In this paper, we review existing methods for analyzing proportional data and provide practical statistical guidelines. Moreover, we illustrate how to apply linear and generalized mixed-effects models to longitudinal proportion data to model changes in proportion scores or frequencies over time. In order to enhance the integration of this special series, these concepts and analytic methods are illustrated using proportion data from the Services for Kids in Primary Care (SKIP2) study described in this special series (Shaffer et al. this issue).

Overview of Proportion Data

The main feature of proportion data is that they are bounded by the interval [0, 1]. Consequently, the data often exhibit heterogeneity in variance (i.e., the variability tends to be higher for data values in the middle range than that toward the boundaries). When studying the association with a continuous predictor, the relationship often deviates from linearity toward the boundaries. If one ignores this special feature of proportion data, the untransformed proportion data can be analyzed using standard methods such as the analysis of variance (ANOVA) and ordinary least squares regression. However, some statistical assumptions underlying these models, such as linearity, normality, and homogeneous errors, might be violated due to these special features of proportion data. Another issue with untransformed data is that the predicted values, as well as prediction intervals, could give rise to values outside of the range [0, 1]. A common practice to resolve these issues is to transform proportion data. While transformation can map [0, 1] data onto unbounded real values, this practice does not take the intrinsic probability characteristics of proportion data into account. Furthermore, studies have shown that data transformations might be inefficient with lower power, and, perhaps more importantly, the results are difficult to interpret (e.g., Warton and Hui 2011).

The generalized linear model, an alternative to the transformation methods, has become more prevalent in analyzing proportion data in clinical studies. The model naturally handles non-normality and heterogeneity issues, and the use of a link function guarantees that the fitted values will be exactly within the desired range [0, 1]. For example, logistic regression, a common generalized linear model, is suitable for a binomial outcome, where the proportion is computed as the ratio of the number of target events to the total number of trials, “n y out of n.” The fitted logistic model is often more interpretable than a transformed model and reflects the underlying characteristics of the data better than linear regression. For another type of proportional data, where the values are between 0 and 1 but are not exactly binomial, one may use an over-dispersed logistic model or beta regression model, which allows more flexible variance structures.

To illustrate the methods, throughout the paper, we will use sample proportion data from the SKIP2 study (Kolko et al. 2014; Shaffer et al. this issue). One of the goals of the SKIP2 program was to examine how a Doctor-Office Collaborative Care (DOCC) treatment improves parent-reported disciplinary practices as compared to Enhanced Usual Care (EUC). For illustration purpose, in this paper, we focus primarily on the proportion of using corporal punishment among all disciplinary practices, estimated from the Alabama Parenting Questionnaire (APQ; Shelton et al. 1996), as this proportion is in general smaller than 0.2 and serves to illustrate many of the concepts and guidelines. We will first examine the analysis of these proportion data using linear regression, the transformation methods, and the generalized linear models. Then, we will briefly discuss linear and generalized linear mixed-effects models for evaluating the changes in these proportion data over both the acute and maintenance phases of treatment.

Regression Analysis

Linear Models for Untransformed Data

Let us consider a common case, where for each subject (i), we observe some outcome or response (Y i ) ranging between 0 and 1 and explanatory variables (X i ). In this case, the linear model for our untransformed data would be

$$ {Y}_i={\boldsymbol{X}}_i^T\boldsymbol{\beta} +{\epsilon}_i, $$

where X i represents values for our explanatory variables, β the coefficients describing the linear relationship between X i and Y i , and ϵ i the error in predicting our outcome of interest. Error values are assumed to be independent and identically distributed, that is, values for particular individuals are not expected to be related to those of others (e.g., they are not influenced by a common cause such as systematic measurement error) and they are expected to share a common distribution. ANOVA is a special case of this kind of model, where the explanatory variables are categorical. The main concerns of modeling untransformed data are twofold. First, proportion data often violate model assumptions, including linearity, normality, and homogeneity of errors, especially for data points that are toward the boundaries. The distribution of Y is usually a mixture of distributions with different means for all the observations. Therefore, we would like to emphasize that the check of normality is for the residual values. Second, the model does not take into account the range restriction inherent in use of proportion scores and predicted values and intervals could fall outside of [0, 1]. In practice, one needs to first examine the histogram of the response data Y and check the residual plots after model fitting. Since the deviation from linearity and constant variance for proportion data is most severe toward boundaries 0 and 1, it is relatively safe to use untransformed normal model for datasets that values fall mainly between 0.2 and 0.8, although the residual distribution should also be examined. Plots describing the residual distribution can be obtained in most statistical packages, such as SAS, R, Stata, and SPSS. The Q-Q plot is a quantile plot, which compares the sorted obtained residual values with quantiles from a normal distribution. If we observe an approximate straight line using the Q-Q plot, we obtain support for the assumption that our residuals are normally distributed.

In addition to the Q-Q plot, we can also look at a residual plot compared to the predicted plot. This residual vs. predicted plot compares the value of our residuals to predicted values of our outcome variable and shows us whether there is greater variability in error at higher or lower values. Generally, as our model assumes that errors are identically distributed, we hope that the spread of error appears similar across predicted outcome values. Statistical tests of normality, such as Shapiro-Wilk (available in SAS, R, Stata, and SPSS), may also be performed to assess whether the distribution of error values deviates from normality. However, findings of statistical significance are more likely in larger samples, and visual examination of plots should also be conducted. If the assumptions described above are met, it is usually safe to proceed with untransformed data, but appropriate diagnostics (see examples below) are still necessary as in regular linear regression. If the residual distribution is substantially skewed, however, our statistical tests and confidence intervals (based on the normal assumption) will be invalid.

In our data example, there are three outcome values, PROPcp (proportions of corporal punishment), PROPpp (proportion of positive parenting), and PROPid (proportion of inconsistent discipline). By definition, the three proportions add up to 1. The outcomes were measured longitudinally at baseline and again at 6, 12, and 18 months after baseline. We first would like to evaluate the short-term treatment effect by focusing on the proportions at time 2 (i.e., after 6 months of treatment). Figure 1 shows the histograms of the three outcome variables (from left to right: PROPcp2, PROPpp2, and PROPid2). We can see that the distributions of PROPpp2 and PROPid2 are quite normal and their values range mainly from 0.2 to 0.8. As we mentioned earlier, it is relatively safe to proceed with untransformed normal models for PROPpp2 and PROPid2. We hence fit linear regression models for these two outcomes on the primary variable of interest, the group assignment, while accounting for age and gender, as well as the corresponding proportion score at baseline. The residual plots for these two outcome variables look quite reasonable (see Fig. 2).

Fig. 1
figure 1

Histograms for three outcomes. PROPcp2 corporal punishment proportion score at 6-month assessment, PROPpp2 positive parenting proportion score at 6-month assessment, PROPid2 inconsistent discipline proportion score at 6-month assessment

Fig. 2
figure 2

Residual plots for PROPpp2 (left) and PROPid2 (right). PROPpp2 positive parenting proportion score at 6-month assessment, PROPid2 inconsistent discipline proportion score at 6-month assessment

In contrast, PROPcp2 is obviously right skewed with most values between 0.05 and 0.2 (compared to PROPpp2 and PROPid2 in Fig. 1). Nevertheless, we performed linear regression for the untransformed data for illustration purpose (Table 1). One data point with PROPcp2 of 0.25 was identified as an outlier from regression diagnostic plots and was removed from the current analysis (further detail on data cleaning and examination for regression analyses can be found in Tabachnick and Fidell 2007). Information on residuals and predicted and observed values of PROPcp2 can be seen in Fig. 3. The residual vs. predicted value plot in the left panel of Fig. 3 suggests that as PROPcp2 values increase, so does their variance (termed a variance on mean relationship). The normal Q-Q plot for residuals (the middle panel of Fig. 3) shows a slight deviation from normality, although the Shapiro-Wilk test for normality of residuals is not significant.

Table 1 Estimation for untransformed normal model (left) and beta regression with identity link (right)
Fig. 3
figure 3

Untransformed normal model for corporal punishment proportion score at 6-month assessment (PROPcp2)

On the right panel of Fig. 3, we plot the observed and predicted PROPcp2 against the linear predictor \( \boldsymbol{X}\widehat{\boldsymbol{\beta}} \) (representing the effect of treatment). Large values among the observed and predicted PROPcp2 scores do not match well. After accounting for significant prediction by PROPcp1 and other variables, the influence of DOCC treatment on PROPcp2 is not significant compared to EUC.

Transformation Model

To partially overcome some of the data issues mentioned above, one possibility is to transform the data. The transformation model assumes that

$$ h\left({Y}_i\right)={\boldsymbol{X}}_i^T\boldsymbol{\beta} +{\epsilon}_i, $$

where h is some known transformation function (e.g., such as taking the square root of Y). In some fields, such as ecology, the arcsine square root transformation has long been a common practice to transform proportional data. Arcsine square root can stabilize the variance when the proportional data are of binomial distribution, computed as “n y out of n” (Gotelli and Ellison 2004; Sokal and Rohlf 1995; Zar 1998). However, the arcsine square root transformation is not monotone, which makes the interpretation of coefficients obscure. Monotone functions preserve the rank order of data points, and the absence of such ordering can make it difficult to draw conclusions from transformed data. Also, several recent papers (Warton and Hui 2011; Wilson and Hardy 2002) have reported that logistic regression (discussed in the next section) is more powerful than arcsine square root for binomial proportional data. Another popular transformation is the complementary log-log, i.e., log(−log(1 − p)), which is a monotone function widely used for transforming percentile data in the survival analysis literature.

For illustration, we applied the arscine square root and complementary log-log transformations to the PROPcp2 data. The residual plots and normal Q-Q plots for regression after the arcsine square root transformation are shown in the top left panel of Fig. 4. The residual vs. predicted value plot has improved slightly over the untransformed model (as reflected by less residual increase at higher predicted values, Fig. 4 and Table 2) but still shows some heteroscedasticity in variance. This is as expected, since our proportion data are not binomial. The complementary log-log transformation is found to be a little bit better than the arcsine square root transformation in terms of stabilizing variance, as seen in the residual plots and Q-Q normal plots for linear regression after the complementary log-log transformation (Fig. 4, right panel).

Fig. 4
figure 4

Results for transformation methods: arcsinesq (left) and complementary log-log (right)

Table 2 Results for transformation methods: arcsinesq (left) and complementary log-log (right)

In practice, it is often difficult to find a reasonably simple transformation that could simultaneously stabilize the error variance, ease the violation of normality for the error term, and maintain the modeling efficiency. Perhaps more importantly, coefficients may only be interpreted using the transformed values, a major consideration for drawing conclusions. Software outputs, such as predicted values, confidence intervals, and prediction intervals, are all in the transformed scale and need to be transformed back to the original scale. The estimated coefficients are shown in Table 2. The R-squared values for both transformed models are still 0.5. The significance results remain the same, while the fitted coefficients are not directly comparable to the untransformed normal model.

Generalized Linear Models

Generalized linear models (Dobson 2002; McCullagh and Nelder 1989) are a family of models that allow the response variables to have non-normal distributions, such as exponential, binomial, Poisson, gamma, and beta distributions. These can be thought of as a broader and more general application of linear regression, where a link function (an equation) is added to the general linear model to describe the relationship between the predictors and their coefficients and outcome (Olsson 2002).

The normal linear regression is a special case of generalized linear models with the identity link function and normal distribution. The identity link is a special kind of link function, used when data are suitable for the general linear model analysis, whereas in generalized linear models, others may be used depending upon distribution. Logistic regression is a popular type of generalized linear model, where the outcome distribution is binomial and the link function is the logit. The logistic regression model is arguably the best model for proportion data when it is computed as “n y out of n” with \( Y=\frac{n_y}{n} \). The number of events is then modeled as binomial, where the explanatory variables contribute to its prediction through use of a logit link function. The coefficients in logistic regression have a natural interpretation, where one unit increase in the predictor X leads to an increase in the odds (probability an event will occur/1 probability event will occur) by a factor of e β, where β is the regression coefficient for X. When the data under consideration are proportions of diagnoses or events out of the total of instrument items or trials, logistic regression correctly models binomial data and usually is more efficient and powerful than transformation methods (Warton and Hui 2011). The use of a logit link function guarantees that the fitted values will be exactly within the desired range [0, 1].

We would like to note that the use of a logit link function is different from linear regression after logit transformation. The logit link in logistic regression is an integrated modeling component, which allows the mean function to be non-linear. No transformation is performed on the actual data Y, and there is no requirement for normality in any forms. However, in the logit transformed linear regression, we still require approximate normality after transformation, and moreover, the logit transformation of data points might encounter problems when there are exact zeros or ones in the Y and adjustments are needed for these cases.

When the data are not exactly binomial and there is variability in outcome beyond that predicted by the model, data over-dispersion presents problems (Raim and Neerchal 2013). More specifically, over-dispersion for proportion data \( Y=\frac{n_y}{n} \) means that the variance is greater than \( \frac{\pi \left(1-\pi \right)}{n} \) as would be expected based on a binomial model, where we assume π = E(Y) has been correctly modeled. Over-dispersed binomial models account for this issue by including an additional parameter to more accurately describe outcome variance. On the other hand, some proportion data are continuous, such as our proportion score for corporal punishment, and are better modeled as a continuous variable rather than modeling n y as discrete binomial variable. This type of outcome may best be handled using a beta regression. Beta regression is a generalized linear model subtype which can handle outcomes ranging from 0 to 1 and uses a beta distribution to model outcome (Ferrari and Cribari-Neto 2004). If Y also assumes the extremes 0 and 1, a useful adjustment in practice is (Y × (n − 1) + 0.5)/n, where n is the sample size (Cribari-Neto and Zeileis 2010; Smithson and Verkuilen 2006). The flexibility inherent in the beta distribution makes this analytic technique particularly suitable for proportion data; this is because the shape of the beta distribution can vary widely depending upon two parameters (Ferrari and Cribari-Neto 2004). The beta distribution can be symmetric, asymmetric, including left- and right-skewed, flat-, or bell-shaped (similar to normal but restricted to values between 0 and 1). The flexibility makes the beta distribution an attractive alternative for modeling proportion data when the exact distribution is unknown.

When a logit link function is used, beta regression has the same mean and variance relationship as an over-dispersed logistic regression, and the coefficients of covariates follow a similar interpretation as in logistic regression. Moreover, fitting beta regression does not depend on counting n y out of n, making it suitable for more general proportional and percentile data. We fit the beta regression model with the logit link for our data example. In Fig. 5, top panel, we plot the Pearson residual (Cribari-Neto and Zeileis 2010) vs. predicted values. This is a residual plot that mimics the residual plot in linear regression and can be used for model diagnostics. In the bottom panel of Fig. 5, the predicted PROPcp2 is plotted as a function of the linear coefficients and predictor variables, overlaid with the observed PROPcp2. Pearson residual vs. predicted values in Fig. 5 can be compared with the PROPcp2 vs. predicted values shown in the right panel of Fig. 3 of the untransformed model. In the untransformed model, the linear predictor is the same as the predicted values of Y, since the link function is an identity function. We can see that both the residual plot and the fitted value plot are improved compared to the untransformed normal model. The fitted coefficients are shown in Table 3. The fitted coefficients are not directly comparable to the untransformed normal model due to use of the logit link function. Let μ = E(Y) be the expected proportion of corporal punishment. The model with logit link establishes that \( \log \frac{\mu_i}{1-{\mu}_i}={\boldsymbol{X}}_i^T\boldsymbol{\beta} \). Following the convention in the logistic regression, \( \frac{\mu_i}{1-{\mu}_i} \) is the odds of corporal punishment, and \( {e}^{\beta_j} \) represents the ratio of the odds of corporal punishment when the jth covariate takes on a value of x ij  + 1 to the odds when the covariate is x ij , while fixing other covariates. Table 3 shows that the estimated coefficient for the only significant predictor PROPcp1 is 1.167, which means that the odds of corporal punishment at 6 months is expected to increase by a factor e 1.167 = 3.2 for one unit increase in the baseline value of PROPcp. Usually, it makes more sense to use a link function that maps the linear predictor onto the desired range [0, 1]. However, for illustration purpose, we also fit beta regression with identity link and the fitted coefficients are shown in the right panel of Table 1. The estimated coefficients are similar to the untransformed normal model. Only the baseline PROPcp score is found to be significant. The generalized linear models can be fitted using proc glimmix in SAS by specifying appropriate link functions and distributions such as binomial or beta. R function glm (“stat”) can be used for logistic regression, and R function betareg (“betareg”) can be used for beta regression.

Fig. 5
figure 5

Beta regression with logit link function

Table 3 Coefficients for beta regression with logit link function

Analysis for Longitudinal Proportional Data

Linear Mixed Models

We now focus on the longitudinal feature of the proportion data. In our data example, the PROPcp score was administrated at baseline and again at 6, 12, and 18 months after baseline. The primary goal of this longitudinal analysis is to evaluate treatment efficacy over time. As one of the outcomes of interest, we would like to examine if parents in DOCC clinics reported more reduction in the relative frequency of using corporal punishment than those parents from EUC clinics (see Shaffer et al. this issue).

Linear mixed-effects models are often the primary analytical tools for analyzing repeated measures with normally distributed errors. The use of linear mixed models for proportion data is generally inappropriate for essentially the same reason we have discussed in previous sections. We start with the linear mixed model here mainly for comparative and pedagogical purposes. The more suitable generalized linear mixed-effects models will be discussed in the next section. We briefly describe the linear mixed-effects model in the following, and readers who are interested in further details are referred to Jiang (2007), Littell et al. (2006), and Pinheiro and Bates (2000). As suggested by the name, mixed-effects models include both “fixed” and “random” effects. The “fixed” terms are similar to explanatory variable terms in linear and generalized linear regression. “Random effects” terms are added to the model to handle the dependence among repeated measures and to model between- and within-subject response variability. Another way to think about these is that fixed effects are the same across subjects, but random effects can differ between individuals (Liu et al. 2012). Linear mixed-effects models are suitable for longitudinal data analysis, as they allow for both baseline and time varying variable influence to be modeled.

The mixed-effects models can also deal with missing data, by including all subjects in the analysis as long as they have at least one data point, under the assumption of missing at random (MAR). MAR requires that missingness of data be accountable for by information within the dataset but allows for missingness of data to depend upon available variables. Per recommendations by Schafer and Graham (2002) and Graham (2009), the MAR assumption is reasonable for this dataset. Therefore, we first fit the mixed-effects models using all available data points, assuming normal distributions for terms representing error. Fixed terms include group assignment (SettingFam1), time as a factor, and the time by group interaction, as well as baseline predictors including child age and child gender. Subject is included as a random term to account for the dependence among repeated measures from the same subject.

The type I ANOVA table from the above linear mixed model is given in Table 4. As in a typical longitudinal study, there are missing values in our data example. We chose to compute the denominator degrees of freedom using the Kenward-Roger method, since Alnosaier (2007) and Gregory (2011) demonstrated the superiority of the Kenward-Roger method over other popular methods, such as the Satterthwaite approximation and the containment method (interested readers can obtain further information in Alnosaier (2007) and Gregory (2011)). After adjusting for child age and child gender, there is a highly significant time effect, F(3, 843) = 23.7, p < 0.0001. However, neither the group nor the group by time interaction is significant. This suggests that parents in both the DOCC and EUC clinics improved over time almost equally in the use of corporal punishment (see Shaffer et al. this issue). The change from visit 1 to visit 2 (during the acute treatment period) is significant, t(849) = −6.57, p < 0.0001. The change from 6 to 18 months after baseline is not statistically significant, t(836) = −0.31, p = 0.76. Linear mixed models can be fitted by proc mixed or proc glimmix (with normal distribution and identity link) in SAS and function lmer (“lme4”) in R.

Table 4 Type 1 tests for linear mixed model

Generalized Linear Mixed Models

Generalized linear mixed models (or GLMMs) are an extension of linear mixed models to allow response variables from different distributions, such as binomial and beta distributions, which are useful for proportion data modeling. Alternatively, one could think of GLMMs as an extension of generalized linear models to include both fixed and random effects. The discussions about linear mixed models are all applicable to generalized linear mixed models. If the proportional data are computed from binomial variable, we can use mixed-effects logistic regression, which allows for random effects to be included in the logistic regression model and to account for correlations within subject data and over repeated measures. If the proportion data are not binomial, as discussed above, we could consider over-dispersed binomial, as well as the beta distribution.

We illustrate the mixed-effects beta regression with the logit link function using the proportions of corporal punishment as the outcome. The fitted coefficients should be interpreted in view of the logit link function and are not directly comparable to the untransformed linear mixed model. The significance tests of covariates are similar to the untransformed model (type I results shown in Table 5). Again, we see a significant time effect but no group effects. The change from baseline to 6 months is significant, t(853) = −6.71, p < 0.0001. The change from 6 to 18 months after baseline is not statistically significant, t(842) = −0.32, p = 0.75. As discussed above, the beta distribution models a variable that takes values on the interval [0, 1], which is conceptually more appropriate for proportion data. The generalized linear mixed models can be fitted by proc glimmix in SAS, with appropriately specified distributions and link functions. The mixed-effects beta regression currently is not available in R, and other generalized linear mixed models such as mixed-effects logistic regression can be fitted by function glmer (“lme4”).

Table 5 Type 1 tests for mixed-effects beta model with logit link function

Summary

Proportion data are not necessarily non-normal and do not always require transformations or special analytic techniques. However, it is important to test the underlying statistical assumptions underlying planned analyses. For instances in which statistical assumptions are violated, it may be important to apply transformations or use generalized linear models in order to draw valid inferences about the data from the results. Beta regression is particularly well suited to non-binomial proportion scores, as when proportions of subscale scores are computed relative to the total score. These types of proportion scores are relatively common in prevention trials and warrant special attention.