There has been a growing body of mindfulness research during the past 30-year period, with the methods and apparatus used in mindfulness studies steadily developing (Krägeloh et al. 2019). Mindfulness has been used in the development of a structured program to treat psychological symptoms such as stress, anxiety, and chronic pain (Kabat-Zinn 1982). Early studies showed evidence of the effectiveness of mindfulness treatment based on the changes of specific hypothesized outcomes, such as melatonin levels (Massion et al. 1995), or increasing the effect of phototherapy and photochemotherapy in patients with the skin condition psoriasis (Kabat-Zinn et al. 1998). Later studies applying mindfulness-based interventions (MBIs) such as mindfulness-based cognitive therapy (MBCT; Segal et al. 2002) and mindfulness-based stress reduction (MBSR; Kabat-Zinn 1990) relied on self-report measures that were designed to evaluate the goals of those interventions such as burnout, life satisfaction (Shapiro et al. 2005), and depression (Ma and Teasdale, 2004). However, these earlier studies could not demonstrate expected changes in mindfulness levels to support their validity, which required development of reliable and valid instruments to assess the construct.

When evidence demonstrated the positive effects of MBIs in therapeutic settings (Bohlmeijer et al. 2010; Chang et al. 2004; Chiesa and Serretti 2009; Ledesma and Kumano 2009), research started focusing more on the application of mindfulness practice in many different contexts. Thus, the application of mindfulness in the workplace (Hyland et al. 2015), educational contexts (Bush 2011; Hwang et al. 2019), and sporting (Birrer et al. 2012) involved measurement of both mindfulness and related outcomes. Although alternative mindfulness assessments such as experience sampling (Frewen et al. 2014) or counting of breath (Levinson et al. 2014) have been proposed, self-report measures of mindfulness remain by far the most widely used method to assess mindfulness in research studies (Krägeloh et al. 2019). The importance of self-report mindfulness measures may be explained by subjective nature of human experience of the world, self and their interaction and a problem to derive such experience from more objective (e.g., neurophysiological) measures (Libet 2004).

The Five Facet Mindfulness Questionnaire (FFMQ; Baer et al. 2006) is a widely used psychometric measure of mindfulness including five subscales: Act with Awareness, Describe, Nonjudge, Nonreact, and Observe. To date, according to Google Scholar, the original FFMQ article has been cited over 5700 times since it was published. The growing popularity of the FFMQ may be explained by its ability to enhance exploration of specific mindfulness aspects and the growing body of validation studies supporting its robustness (Brown et al. 2015; Coffey et al. 2010; MacDonald and Baxter 2017; Medvedev et al. 2017b). A number of short versions of the FFMQ have been developed using the classical test theory (CTT) approach (e.g., Baer et al. 2012; Gu et al. 2016; Bohlmeijer et al. 2011), which were unable to address the limitations of ordinal scales such as limited precision and compatibility with parametric statistics (Allen and Yen 1979; Stucki et al. 1996). To address these problems, Medvedev et al. (2018) conducted a study to examine and compare the existing short versions of the FFMQ using Rasch analysis and proposed an 18-item FFMQ version (FFMQ-18).

Mindfulness can be defined as either state or a trait (Medvedev et al. 2017a). Growing evidence has shown that mindfulness practice causes both state and trait changes, and inability to differentiate clearly between the two may confound assessment results of MBIs (Tang et al. 2015). Trait or dispositional mindfulness is described as a relatively stable characteristic of an individual and reflects an ability to remain mindful across different situations and contexts (Baer et al. 2006; Davis et al. 2009). State mindfulness refers to a characteristic feature displayed in a given situation or time (Bishop et al. 2006; Lau et al. 2006; Tanay and Bernstein, 2013). While the FFMQ is widely considered as a measure of dispositional (trait) mindfulness, its ability to differentiate between dispositional and dynamic (state-like) aspects of mindfulness has not been carefully investigated using appropriate methodology. Recently, Generalizability Theory (G-Theory) was proposed as the most adequate method to distinguish between state and trait aspects in a measure and to evaluate various sources of error variance and to establish generalizability of assessment scores as well as reliability of the instrument (Medvedev et al. 2017a; Paterson et al. 2017).

G-Theory was developed by Cronbach et al. (1963) and provides more advanced statistical method compared with classical test theory (CTT) methods for evaluating the reliability of psychometric assessments, such as rating scales and performance tests. G-Theory is able to evaluate specific sources of measurement error and generalizability of assessment scores to all possible circumstances using data obtained from a specific testing situation (Cronbach et al. 1963). Thus, G-Theory considers and estimates unique sources of error variance affecting the main variable of interest (e.g., a mindfulness score), while CTT considers error variance as a single factor and postulates that any measurement consists of true variance and error variance (Allen and Yen 1979). However, in complex natural environments, there are multiple sources of error that potentially influence the accuracy of measurement. For instance, Generalizability analysis will consider interactions between person and different factors including methodological (e.g., scale items) and situational (e.g., time of the day) that might each independently (or via interactions) contribute to the error of measurement. In summary, while CTT considers only one aspect of reliability (e.g., test-retest, inter-rater, internal consistency) at a time, G-Theory closely examines all these influences on reliability (including their interactions) simultaneously thus improving the methodology and precision of a psychometric assessment.

The traditional CTT approach to the state/trait distinction examines test-retest reliability coefficients to investigate temporal reliability of an instrument, which tends to be lower for a state measure (e.g., < 0.60) and higher for a measure of trait (e.g., > 0.70) (Ramanaiah et al. 1983; Spielberger et al. 1970; Spielberger 1999). Therefore, this method is based entirely on the total score correlations at two different time points (i.e., time 1 and time 2) and does not consider variability at individual item level and interactions between person, item, and occasion. Robust estimation of reliability requires consideration of the contributions made by item effects, scale effects, person effects, and occasion effects to the changes in the overall assessment score. Similarly, the intraclass correlation coefficient (ICC) that can be used to estimate temporal reliability has limited accuracy because it does not account for variability of individual items (Bloch and Norman 2012; Medvedev et al. 2017a).

G-Theory is a suitable approach to examine the distinction between trait and state components in an instrument and comprehensively evaluate multiple sources of error variance (Medvedev et al. 2017a; Shavelson et al. 1989). A state is a dynamic aspect that results when a person interacts with an occasion, which is the unique adaptation of an organism to the momentary environment (Spielberger et al. 1970). Reliable distinction between dynamic and stable patterns of a construct or condition is important in both clinical and research contexts. For example, the accuracy of assessment could be affected by evaluating characteristics of a person while avoiding temporary changes (e.g., mood) and might lead to inappropriate conclusions. There should be a clear distinction between state and trait aspects of the presentation of a person in any psychometric measure, which requires identification and consideration of the relevant sources of error variance using appropriate psychometric techniques such as G-Theory (Bloch and Norman 2012; Paterson et al. 2017).

G-Theory partitions the overall variance into different parts related to particular sources and examines their impacts on the overall reliability (Cronbach et al. 1963). The proportions of specific parts can be used to quantify the contribution of a person variance reflecting a trait, and an interaction between person and occasion reflecting a state to the measurement (Medvedev et al. 2017a). By computing the ratios of state variance or trait variance to the sum of state and trait variance, we can reliably distinguish between state and trait components in a measure (Medvedev et al. 2017a; Paterson et al. 2017). Therefore, the current study was to apply G-Theory to examine the reliability of the FFMQ and its short 18-item version over time, distinguish between state and trait components of mindfulness items and subscales, as well as to identify sources of error that may affect the measurement. This research utilized a repeated-measures design with participants assessed at three occasions separated by equal two-week intervals. Application of G-Theory involved two parts, a Generalizability study (G-study) and Decision study (D-study). The G-study examined the overall generalizability and evaluated sources of error variance of the original FFMQ and its short version FFMQ-18 as well as its subscale scores. G-study computed a generalizability coefficient (G coefficient) for each scale under investigation, which is the overall measure of reliability representing the ratio of true person variance to the total variance of the data (Cardinet et al. 2011). The D-study was subsequently conducted to evaluate psychometric properties of individual items and their combinations to optimize reliability of the measurement and distinction between state and trait (Shavelson et al. 1989; Medvedev et al. 2017a). Data from D-study can be used to identify items that are reflecting state or trait aspects of mindfulness.

Method

Participants

The current sample included 83 university students who partook in the study on a voluntary basis and did not receive any payment or academic credit for their participation. The sample size satisfied requirements for reliability studies of this type of research (Shoukri et al. 2004). The sample included 22 males (26.5%) and 61 females (73.5%). From the total sample, ten participants (12%) engaged in regular meditation practice. The age of participants ranged from 18 to 47 years, with a mean of 21.34 (SD = 5.83). Ethnic groups were represented by 57% Caucasian, 11% Māori, 10% Pasifika, 6% Asian, and 17% others.

Procedures

Participants completed the FFMQ items in class before the lecture or during a break and were instructed to return the completed forms to the researcher, submit it to a locked collection box at their faculty, or use a self-addressed pre-paid envelope to post their completed forms to the researcher university address. Each participant was required to complete the same questionnaire at three occasions with equal 2-week intervals. Respondents also provided demographic information such as sex, age, and ethnic group and to ensure anonymity were asked to include a personal code with three letters and three numbers to match the forms completed by the same participant at three occasions. This research was not expected to involve any risk, discomfort, or harm, and participants were informed about the nature of the study. The study was approved by the authors’ university ethics committee.

Measures

The FFMQ (Baer et al. 2006) consists of 39 items that assess aspects of mindfulness grouped into five subscales: Act with Awareness, Describe, Nonjudge, Nonreact, and Observe. Each individual item uses a 5-point Likert scale with options ranging from 1 = “Never or very rarely true” to 5 = “Very often or always true”. There are 19 items that require reverse coding before conducting data analysis. After reverse coding, the total score and individual subscale scores are calculated by adding responses to the relevant items together (see Appendix A).

Data Analyses

IBM SPSS Statistics 25 software was used to compute means, standard deviation (SD), Cronbach’s alpha, test-retest coefficients, and ICC for the FFMQ, FFMQ-18, and individual subscales of the both FFMQ versions. Missing data comprised 0.04%, which were negligible and were replaced using mean imputation (Huisman 2000).

Generalizability analyses were conducted using EduG 6.1-e software (Swiss Society for Research in Education Working Group 2006) by following the guidelines described by Medvedev et al. (2017a). Both G-study and D-study used a random effect design: person (P) by item (I) by occasion (O), expressed as P × I × O, where the P and O facets are infinite and the facet I is fixed because the same set of items were used across all assessments using the FFMQ. In a G-study, all error variances are counted as 100% after controlling for person variance (P), which reflects true differences between persons. Person was the object of measurement (differentiation facet) and not a source of error, while I and O were instrumentation facets (Cardinet et al. 2011). The effects for all facets were presented by observed scores X which were calculated for the G-study (Shavelson et al. 1989):

X = μ + Xp + Xi + Xo + Xpi + Xpo + Xio + Xresidual; where μ is grand mean of X

Xp = μp − μ (person effect)

Xi = μi − μ (item effect)

Xo = μo − μ (occasion effect)

Xpi = μpi − μp − μi + μ (person × item effect)

Xpo = μpo − μp − μo + μ (person × occasion effect)

Xio = μio − μi − μo + μ (item × occasion effect)

Xresidual= Xpio − μpi − μpo − μio + μp + μi + μo − μ

Each of the effects has estimated variance components, which were possible sources of error that might impact measurement and were calculated as follows:

Person variance component: σ2p = (MSp − MSpi − MSpo + MSpio)/nino

Item variance component: σ2i = (MSi − MSpi − MSio + MSpio)/npno

Occasion variance component: σ2o = (MSo − MSio − MSpo + MSpio)/ninp

Person × item variance component: σ2pi = (MSpi − MSpio)/no

Person × occasion variance component: σ2po = (MSpo − MSpio)/ni

Item × occasion variance component: σ2io = (MSio − MSpio)/np

Residual / person × item × occasion variance component: σ2pio = MSpio; where MS stands for the mean of effect square and n represents facet sample size

Generalizability analysis estimates reliability using relative G coefficient (Gr) and absolute G coefficient (Ga) for the object of measurement (person). The relative model of measurement involves interpretation of test scores in a norm-referenced manner in which the score of a person is compared against the scores of others (Suen & Lei 2007; Vispoel et al. 2018). Gr accounts for a relative error variance (\( {\sigma}_{\delta}^2=\frac{\sigma_{\mathrm{pi}}^2}{n_{\mathrm{i}}}+\frac{\sigma_{\mathrm{po}}^2}{n_{\mathrm{o}}}+\frac{\sigma_{\mathrm{pi}\mathrm{o}}^2}{n_{\mathrm{i}}{n}_{\mathrm{o}}};\mathrm{where}\ {n}_{\mathrm{i}}=\mathrm{number}\ \mathrm{of}\ \mathrm{items},{n}_{\mathrm{o}}=\mathrm{number}\ \mathrm{of}\ \mathrm{ocassions} \)), which is directly related to the object of measurement that may influence a relative measurement (e.g., person × occasion and person × item interactions) and includes divisions by desired sample sizes (Shavelson et al. 1989; Shavelson & Webb 1991):

$$ {G}_r=\frac{\upsigma_{\mathrm{p}}^2}{\upsigma_{\mathrm{p}}^2+{\upsigma}_{\updelta}^2} $$

The absolute model of measurement is based on the test scores, which are interpreted in a criterion-referenced manner where the score of a person is compared against some agreed-upon absolute standard. Ga is equivalent to the phi (Φ) coefficient, which is obtained after applying Whimbey’s correction. It accounts for an absolute error variance (σ2Δ = \( \frac{\upsigma_0^2}{n_{\mathrm{o}}}+\frac{\upsigma_{\mathrm{i}}^2}{n_{\mathrm{i}}}+\frac{\upsigma_{\mathrm{pi}}^2}{n_{\mathrm{i}}}+\frac{\upsigma_{\mathrm{po}}^2}{n_{\mathrm{o}}}+\frac{\upsigma_{\mathrm{i}\mathrm{o}}^2}{n_{\mathrm{i}}{n}_{\mathrm{o}}}+\frac{\upsigma_{\mathrm{pi}\mathrm{o}}^2}{n_{\mathrm{i}}{n}_{\mathrm{o}}} \)) that includes item and occasion interaction which may influence an absolute measure indirectly (Cardinet et al. 2010; Shavelson & Webb 1991):

$$ {G}_{\mathrm{a}}\backsimeq \Phi =\frac{\sigma_{\mathrm{p}}^2}{\sigma_{\mathrm{p}}^2+{\sigma}_{\Delta}^2} $$

Both Gr and Ga are estimating reliability of a trait measure if the object of measurement is a person. Gr of 0.80 or higher is reflecting good reliability of assessment score (Cardinet et al. 2010), and while similar criteria are generally applied for Ga, coefficients above 0.70 were considered as reliable in some studies (Arterberry et al. 2014).

A state component index (SCI) and trait component index (TCI) were obtained, which reflect the proportion of variance attributed to a dynamic (state) and an enduring (trait) component in a measure. The formulae used were developed by Medvedev et al. (2017a):

$$ \mathrm{SCI}=\frac{\sigma_{\mathrm{p}\mathrm{o}}^2}{\sigma_{\mathrm{p}\mathrm{o}}^2+{\sigma}_{\mathrm{p}}^2};\mathrm{TCI}=\frac{\sigma_{\mathrm{p}}^2}{\sigma_{\mathrm{p}\mathrm{o}}^2+{\sigma}_{\mathrm{p}}^2} $$

SCI and TCI of 0.50 mean that an equal amount of variance is attributed to state and trait, and SCI above 0.60 (TCI < 0.40) would indicate that the majority of variance is reflecting a state. Conversely, TCI of 0.60 or higher (SCI < 0.40) would signify the majority of variance is reflecting a trait. These coefficients can be interpreted in a similar way to other reliability coefficients, where a higher score reflects a higher proportion of variance attributed to a state (SCI) or a trait (TCI) (Medvedev et al. 2017a).

In the D-study, variance components were obtained for each individual item and SCI values were calculated applying the formula described above. Therefore, items that show high SCI (i.e., ≥ 0.80) are very sensitive to changes over time and can be considered as state items and items with lower SCI (i.e., < 0.30) as reflecting trait mindfulness (Medvedev et al. 2017a).

Results

Descriptive statistics for the 39-item FFMQ, its subscales, and FFMQ-18 at three occasions are presented in Table 1. The internal consistency Cronbach’s alpha of the total FFMQ over three occasions ranged between 0.89 and 0.92. The test-retest reliability scores for Occasion 2 and Occasion 3 (with reference to Occasion 1) were 0.92 and 0.83, respectively, and were reflected by ICC of 0.83. These reliability values were overall higher than that of the FFMQ-18 and the individual subscales of the FFMQ. The mean scores of both FFMQ versions and individual subscales were not significantly different across occasions, as evidenced by paired t tests (all p values below 0.05). The subscales of Nonjudge and Describe obtained the highest Cronbach’s alpha and ICC values compared with other subscales. Overall, all assessed FFMQ scales and subscales showed acceptable internal consistency and temporal reliability expected for a trait measure. An exception was the Nonreact subscale, which displayed the lowest Cronbach’s alpha value 0.69 at Occasion 1 and the lowest test-retest value at Occasion 3 (0.64).

Table 1 Means, standard deviation (SD), Cronbach’s alpha, test-retest coefficients, and intraclass correlation coefficient (ICC) for the FFMQ total, its short version FFMQ-18 together with five facet subscales (n = 83 × 3 occasions)

G-Study

Table 2 presents the variance components attributed to person (P), item (I), and occasion (O), and their interactions (P×I, P×O, I×O, P×I×O) together with generalizability coefficients and state and trait component indices for the FFMQ, its five subscales, and the FFMQ-18. The best reliability and generalizability of scores across persons and occasions was found for the total FFMQ with both relative and absolute G coefficients (Gr and Ga) of 0.89 and the main source of error variance due to P×O interaction that accounted for 98.2% of the total error. Slightly lower but still acceptable Gr and Ga values of 0.76 and 0.75, respectively, were observed for the FFMQ-18, with measurement error mainly explained by P×O and P×I×O interactions, which took up 79% of error variance in combining. The TCI values reflecting the ability of an instrument to reliably assess a trait were calculated for both the FFMQ and FFMQ-18 (both TCI = 0.90). TCI values together with reliability estimates indicate that both the FFMQ and FFMQ-18 are consistent with expectations of a valid trait measure. In contrast, Gr and Ga for all individual subscales of the FFMQ were below 0.45 meaning that all subscales were not meeting expectations for a reliable trait measure (Shavelson et al. 1989). The SCI reflecting the ability of a measure to reliably assess state changes were below expectations for a valid state measure for all individual FFMQ subscales (all SCI < 0.40). Even though TCI value for all five FFMQ subscales were high, ranging from 0.64 (Nonreact) to 0.89 (Observe), all subscales were affected by measurement error due to interaction between person, item, and occasion. This resulted in low reliability of all subscales in measuring trait (all Gr < 0.50) meaning that the FFMQ subscales cannot be considered as measuring either state or trait mindfulness reliably.

Table 2 G-study estimates for the FFMQ and FFMQ-18 and five subscales of the FFMQ including coefficient G relative (Gr), coefficient G absolute (Ga), trait component index (TCI), state component index (SCI), variance components (in %), and the person (P) × occasion (O) × item (I) design including interactions (n = 83)

D-Study

Individual item analysis was conducted to obtain variance components for individual items by excluding all other items. The estimates for variance of person, occasion, and person-occasion interaction together with computed SCI are included in Table 3. There were nine items (i.e., 1, 2, 4, 12, 15, 18, 28, 30, and 38) which presented with high SCI (≥ 0.80) reflecting high sensitivity for state changes over time. On the other end, there are nine items with low SCI (≤ 0.50) that are least sensitive to state changes and reflecting predominantly trait mindfulness. All other items had SCI between these benchmarks (0.50 < SCI < 0.80) and cannot be clearly classified as reflecting either state or trait.

Table 3 Variance components of person (P), occasion (O) and P×O interaction together with state component index (SCI) for each individual item of the FFMQ (n = 83 × 3)

Furthermore, a series of generalizability analyses were conducted by combining the most dynamic items with the highest SCI because we expected that this will result in a reliable state measure. Table 4 shows D-study results including reliability estimates and variance components attributed to person, item, and occasion and their interactions for these analyses. The first analysis was conducted with the five most dynamic items from each subscale including 1, 4, 12, 30, and 38 (Table 4, (a)). In the analyses b (Table 4), the first five items with the highest SCI selected from the total scale (1, 12, 15, 30, and 38) were combined, and subsequent analyses added the next most dynamic item from the remaining items (4, 18, and 28). The results showed that person-item-occasion interaction was the main source of error variance across all these analyses and ranged from 76.50 to 91.40% of the total error variance. As expected, Gr and Ga for all analyses of most dynamic items were below the acceptable generalizability for a trait measure (0.70). However, all SCI values for these analyses were lower than 0.19, which is far below expectations for a state measure (i.e., SCI should be above 0.60 to be considered as a state measure). These findings mean that none of the tested item combinations can be used reliably for the assessment of state mindfulness. Further analyses were conducted to test whether removing items with higher SCI from each subscale will improve its reliability in measuring trait mindfulness. The items with the highest SCI were removed first one at a time and G coefficients of a relevant subscale were examined. However, no improvement of reliability was achieved for any of the FFMQ facets (all Gr < 0.60).

Table 4 D-study reliability estimates and variance components for the person (P) × occasion (O) × item (I) design including interactions for combine FFMQ items with the highest state component index (SCI)

Discussion

The aim of this study was to distinguish between state and trait components in the FFMQ and to examine temporal reliability and generalizability of this scale using G-Theory. The results show that the total 39-item FFMQ and the FFMQ-18 are reliable in measuring trait mindfulness with G coefficients of 0.89 and 0.75, respectively, meaning that their scores are generalizable across persons and occasions. However, all five individual subscales of the FFMQ were found to measure trait mindfulness with TCI above 0.60 (SCI below 0.40) but they appear less reliable (G coefficients below 0.45) compared with the total FFMQ and FFMQ-18. Our results indicated that individual subscale scores were affected by measurement error due to interactions between person, item, and occasion, which presented the highest percentage of the error variance ranging from 43 to 64% across subscales. Individual subscales were also affected by interaction error between person and item that was specifically evident in the subscales Describe (34%), Observe (31.2%), and Nonreact (27.8%). In contrast, the FFMQ total scores contained a state component of person and occasion interaction that constitute 98% of the total error variance, but its influence on the overall reliability of measurement was negligible with G ≥ 0.80 (Shavelson et al. 1989).

A D-study was conducted in an attempt to develop a subscale to measure mindfulness as a state by combining the FFMQ items identified as the most dynamic over time, which did not result in a sensitive state measured as reflected by low SCI. It is possible that dynamic changes in specific aspects of mindfulness are not occurring simultaneously and cancel each other out if different state items are combined. For example, item 38 (“doing things without paying attention”) and item 30 (“I think my emotions are bad or inappropriate”) had SCI at 0.95 (TCI = 0.05) and 0.98 (TCI = 0.02), respectively, which indicates they are measure a state aspects of mindfulness to the large extent. However, combining these items may counter balance state changes on each aspect over time because they are less likely to occur at the same time. This notion is supported by our results in Table 4 where we attempted to combine state items resulting in lower SCI. These findings are consistent with psychometric studies that demonstrated reduction of measurement error due to individual items by combining them into super-items or parcels (Medvedev et al. 2018; Taylor et al. 2017).

We note that each of the FFMQ subscales except for Nonjudge included both state and trait items. Although, all Nonjudge subscale items were sensitive to change overtime but the overall subscale sensitivity was low (SCI = 0.19; TCI = 0.81) meaning that this subscale is not reflecting state changes. This could be explained by the fact that different aspects of non-judgmental attitude captured by individual items (e.g., self, emotions, thoughts) may not co-occur together in time. Therefore, combining Nonjudge items together may reduce the overall subscale sensitivity to change because state related variances may cancel each other out (Medvedev et al. 2018; Taylor et al. 2017). However, these findings indicate that various aspects of non-judgmental attitude are very dynamic and should be the primary focus of any MBIs because they are more amendable and were consistently found as a strong predictor of psychological symptoms (Baer et al. 2008; Medvedev et al. 2018).

In the Observe subscale, there were only three items (“I pay attention to sensations”, “I notice the sensations of my body moving,” and “I notice how emotions affect thoughts and behaviour”) that clearly indicated measuring state due to their high SCI and low TCI (0.89, 0.88, and 0.75, respectively; TCI of 0.11, 0.12, and 0.25, respectively). If considering to develop mindful observing, then focusing on emotions, sensations, and thoughts in the first place may be helpful as these are the most amendable features. The results also show that “I pay attention to sounds”, “I notice the smells and aromas of things,” and “I stay alert to the sensations of water” obtained lower SCI, which are more stable trait-like aspects of a person.

The Describe subscale shows psychometric patterns comparable to those of the Observe subscale. Only two items (“I’m good at finding words to describe my feelings” and “It’s hard for me to find the words to describe”) in the Describe subscale clearly displayed high sensitivity to change (state) with SCI of 0.81 and 0.89 (TCI of 0.19 and 0.11), respectively. The remaining items in this facet reflected predominantly enduring patterns. Although Describe had a higher number of trait-like items than items reflecting a state, this facet can still not be regarded as a reliable trait-like mindfulness measure according to our results (Gr = 0.40). This may be explained by the fact that individual items measuring the ability to describe mindfulness related to unobservable behaviors such as feelings, sensations, and thoughts change over time, which is reflected in the high measurement error due to interactions between person, item, and occasion.

In the Nonreact subscale, there were four items with SCI > 0.60 that indicated high sensitivity to change, with the most sensitive item “I perceive my emotions without reacting to them” (SCI = 0.80; TCI = 0.20). The remaining three items in this subscale can be psychometrically quantified as measuring a person’s trait. Although the Nonreact subscale included items sensitive to change over time, the overall SCI was low (0.36; TCI = 0.64), meaning that this subscale did not reflect dynamic aspects of mindfulness reliably when these items were combined together. Similar to the other subscales of the FFMQ, Nonreact was affected by measurement error due to interactions between person, item, and occasion. This indicates that people may respond to the same item differently at different occasions because individual thoughts and feelings varying over time.

There was an obvious imbalance between items reflecting state and trait mindfulness in Act with Awareness facet. There were only two items, “I am easily distracted” (SCI = 0.45; TCI = 0.55) and “I do jobs or tasks automatically” (SCI = 0.48; TCI = 0.52), that were less sensitive to changes over occasions. The remaining six out of eight items of this subscale reflected state aspects of mindfulness, with three items showing high SCIs ranging from 0.83 up to 0.95 (TCIs ranging from 0.05 to 0.17). However, combining these items did not result in a sensitive state measure.

Limitations and Future Research

Some limitations need to be acknowledged. The current study was conducted with participants who were all university students, which has a degree of homogeneity and large population of females, and the results should be replicated in more diverse samples. The gender imbalance may influence the results and it would be beneficial for future studies to replicate this analysis with a more balanced sample and analyze different genders separately. The FFMQ-18 was analyzed using data from the full scale which is a potential limitation because responding to items presented in a different order may influence the results. Although the FFMQ contains 19 reverse scored items designed to reduce response bias, they may potentially affect reliability of the scale meaning that obtained G coefficients could be higher if there would be no reverse scored items.

In the current study, we found that there were 25 items (i.e., 1, 2, 3, 4, 5, 8, 9, 11, 12, 14, 15, 17, 18, 23, 25, 28, 29, 30, 31, 33, 35, 36, 38, and 39) with high SCI (≥ 0.60) reflecting high sensitivity for state changes over time. On the other hand, the remaining fourteen items had SCI between the benchmarks (0.30 < SCI < 0.60) and cannot be clearly classified as reflecting either state or trait because they are measuring both aspects. It means that there are no items with low SCI (≤0.30) that are least sensitive to state changes and are reflecting predominantly trait mindfulness. These findings should be replicated in future research using different samples to confirm replicability of this result.

In conclusion, the findings of this study indicate that reliable measurement of trait mindfulness can be achieved by using the full FFMQ scale or its short version FFMQ-18 with scores generalizable across sample population and occasions. The scores obtained on individual facet subscales of the FFMQ predominantly measuring trait mindfulness, but their reliability is affected by measurement error due to interaction between person, item, and occasion. Robust psychometric properties of the FFMQ full scale and the FFMQ-18 permit assessment of trait mindfulness reflecting long-lasting effects of MBIs and evaluation of their long-term effectiveness. State items identified in this study are reflecting dynamic components of mindfulness that are the most amendable and should be the primary target of MBIs.