Background

Anxiety disorders (AD) have become the most common type of mental disorder in the population, often leading to chronic illness and disability [1]. Anxiety disorders are characterized by excessive and persistent fear, anxiety, or avoidance of perceived threats, and may include panic attacks [2]. The social pressure of China’s adults is increasing along with the high development of China’s society and economy. According to a research held by Huang, a China Mental Health Survey in 2012 showed that anxiety disorders were the most common class of disorders both in the 12 months before the interview (weighted prevalence 5.0%, 4.2–5.8) and in lifetime (7.6%, 6.3–8.8) [3]. Impact on the mental health of the community population during the COVID-19 pandemic, primarily in terms of depressive and anxiety symptoms [4]. At the same time, an American study showed that anxiety disorders have the highest estimated lifetime prevalence rates of any psychiatric disorder. (18.0–3.7%) [5]. A survey found that the prevalence of anxiety symptoms among Chinese older adults (≥ 60 years) was 12.15% (1751/14,417), and the prevalence of anxiety disorders among older adults had nearly tripled in six years [6]. The average annual family medical cost of mental illness has increased from $1094.8 to $3665.4 [7], resulting in a strain on health care resources and an increase in the socioeconomic burden on families. In addition, due to the lack of assessment criteria, many people classify anxiety disorders as depression, which leads to later worsening of the illness and makes differential diagnosis increasingly difficult [8].

Although the Generalized Anxiety Scale (GAD-7) has been used in clinical practice in China, we found that it focuses only on psychological aspects and does not include physical conditions and social support, which is not well suited to the Chinese context. Meanwhile, in China, SF-36 and WHOQOL-BREF are mostly used to measure the QOL of anxiety disorder patients, but we think they lack pertinence. Some scholars believe that QOL should use a combination of generic and specific instruments to maximize both sensitivity and generalizability [9].

Although it is possible to develop Chinese versions of Western scales after a rigorous translation process, their Chinese versions are hardly responsive to Chinese characteristics due to the differences between Western and Chinese cultures and the strong cultural dependence of quality of life. Considering the culture dependence and disease pertinence of QOL, we systematically developed a QOL instrument system called QLICD(V2.0) (Quality of Life Instruments for Chronic Diseases) [10,11,12]. Among them, QLICD-AD (V2.0) is a specific scale for anxiety disorders, which is composed of the 28-item general module and the 12-item specific module. The results of preliminary validation showed that it has good psychometric properties [1314].

The quality of the items is an important aspect of the quality of the scale. Item analysis is an integral part of the scale development, application and simplification. The classical test theory (CTT) is a tool for evaluating assessments from a macro perspective, with low sample size requirements for simplicity and conceptual intuition for parameter estimation of the model. However, the development of the whole scale system is still mainly based on the CTT, and there are some obvious shortcomings, such as the sample dependence of the statistics, the ambiguity of the error and imprecision of the reliability estimation, and the inconsistency between the ability and difficulty scales [15]. While item response theory (IRT) is widely used in micro aspects, such as item analysis in psychological and educational measurements, with the advantages of sample freedom and accuracy of results, to further deepen the analysis of the quality of the scale, providing more detailed and detailed and standardized [1617], while the IRT is more computationally intensive, and the results of the analysis of small samples may be unstable [18]. Combining the two methods to analyze the entries can compensate for their respective shortcomings and greatly improve the level and scientificity of scale development and evaluation. Therefore, in our study, CTT and IRT were used to analyze the items together from both macro and micro aspects, thus avoiding the errors caused by relying only on statistical analysis and improving the representativeness and reliability of the items.

The purpose of this study was to systematically evaluate the items of the QLICD-AD(V2.0) based on classical test theory and item response theory, which will provide a basis and reference for further optimization and application of the scale. Also it will help to evaluate applicability of the QLICD-AD (V2.0) to hospitalized patients with anxiety disorders that effectively facilitates the assessment of quality of life in patients with AD.

Methods

Participants

We recruited participants at the Affiliated Hospital of Guangdong Medical University in China using following inclusion and exclusion criteria. The diagnosis was fully supported by the Department of Psychiatry at the affiliated Hospital of Guangdong Medical University.

Inclusion Criteria: ①Participants should meet the diagnostic criteria of ICD-10 (International Classification of Diseases).②Participants should have clear consciousness and stable condition. ③Participants should be able to complete the questionnaire on their own. ④Participants should be willing to participate in this research and have signed an informed consent form.

Exclusion Criteria:① Participants with anxiety disorders caused by organic and somatic brain diseases.②Participants who were diagnosed by the use of psychoactive substances or have a history of using psychoactive substances. ③Participants who are delirious and in the acute phase of an anxiety disorder. Participants who have been diagnosed with any other mental illness.

After explaining the study procedure to eligible patients, we sign an informed consent form with them. The study protocol and informed consent form were approved by the Institutional Review Board (IRB) of the investigator’s institution of the investigator’s institution.

Measurement tools

QLICD-AD(V2.0): The second edition of Quality of Life Instruments for Chronic Diseases-Anxiety Disorder (QLICD-AD, V2.0) are combined with general module and anxiety disorder module, 40 items in total [14, 19]. General module includes 3 domains which are physical function (GPH1-GPH9), social function (GSO1-GSO8) and psychological function (GPS1-GPS11), and 9 facets, 28 items in total. Anxiety disorder module includes 12 items. Each item is a five-level item (possible score range: 1 to 5, ranging from 1 no problem to 5 extreme problem). According to score principle, it can calculate the standard score of each domain, facet and the total. The standard score of it is from 0 to 100, the more score, the higher QOL. Details of the items were presented in Table 1.

Table 1 Items of the QLICD-AD (V2.0)

Statistical analysis

After collecting the data from the completed scale, the demographic profile was first described after data organization. Then the statistical indicators in the CTT were calculated separately as well as derived using the graded response model (GRM) to calculate the average amount of information, coefficient of difficulty, and discrimination in the IRT. All the above analyses were performed in R studio.

Classical test theory(CTT)

CTT is founded on the proposition that measurement error, a random latent variable, is component of the observed score random variable [19, 20]. It is a traditional quantitative approach to testing the reliability and validity of a scale based on its items [21].

The CTT was analyzed for reliability and validity, and the scale items were evaluated in this study using four statistical methods: the Cronbach’s coefficient method, the variability method, the correlation coefficient method, and the factor analysis method. The items that satisfy at least three of these statistical methods can be comprehensively evaluated as good items. The calculation of CTT in R studio we use ltm package to calculate Cronbach’s coefficient, bruceR package for exploratory factor analysis, degree of variability, correlation coefficients are done using the appropriate formulas.

(1)Cronbach’s coefficient method: to analyze the items from the perspective of internal consistency, calculate the Cronbach’s coefficient α1 for each domain, and then compare it with the α2 coefficient of the domain after deleting this item, if α1 ≥ α2, evaluating it as a good item. If the subscale Cronbach’s α coefficient is above 0.7, it means that the scale has good reliability, between 0.6 and 0.7 means that the scale is acceptable, and if the α reliability coefficient is lower than 0.6, then consider modifying the scale.

(2) Degree of variability method: to analyze the items from a sensitivity perspective, calculate the standard deviation of each item, and evaluate those with a large degree of dispersion (> 0.90) as good items.

(3)Correlation coefficient method: In order to evaluate the independence or representativeness of the analyzed items, the correlation coefficients of the individual items with the scale scores were calculated. If the correlation coefficients of the items in the scale with the scores of the domains to which they belonged and with the total scale were > 0.5, it means that the correlation of the items with the domains to which they belonged and with the total scale was high, and this item could be rated as a good item.

(4) Exploratory factor analysis: In order to evaluate the representativeness of the analyzed items, according to the principle eigenvalue > 1, principal component analysis is used, and after orthogonal rotation with maximum variance, the factor loadings of each item are calculated. An item with a factor loading > 0.5 is considered a good item, and if the factor loading of an item in the scale is < 0.5, it means that the item does not have much influence on the latent variable to be measured. By exploratory factor analysis (EFA) of the minimum residual decomposition to test the unidimensionality of the scale. It is generally accepted that the unidimensionality assumption is largely met when the first factor explains more than 20–40% of the variance and the ratio of the first to second eigenvalue is greater than three [22].

Item response theory(IRT)

Unlike the CTT, the IRT directly simulates the response of an item to its corresponding underlying trait, overcoming the shortcoming that CTT parameter estimation should depend on samples. Compared to the CTT, it can accurately estimate the measurement error of each item and each participant [18].

QLICD-AD (V2.0) is divided into four domains: physical functioning domain, psychological functioning domain, social functioning domain, and the specific module, and each item is scored using a five-point Likert scale, which is in line with the characteristics of the ordered multiclassification, and in this study, we can use the GRM rank-response model of the hierarchical multiclassification in the IRT [23]. The formula of the rating response model [24] as below:

$$ P\left({v}_{i}=k|\theta =t\right)=\frac{1}{1+\text{e}\text{x}\text{p}[-1.7{a}_{i}\left(t-{b}_{i,k}\right)]}-\frac{1}{1+\text{e}\text{x}\text{p}[-1.7{a}_{i}\left(t-{b}_{i,k+1}\right)]}$$

The hierarchical response model treats each item as a series of dichotomies (one minus the number of categories) and estimates each dichotomous 2-parameter model for each dichotom, corresponding to the lowest and highest categories, \( P\left({v}_{i}=k|\theta \right)=0 \)and 1. \( v\) responses to multilevel scoring items 𝑖, \( k\) indicates a response option, \( \theta \)(theta) is the latent variable measured by the item, a is the discriminant parameter, and b is the threshold parameter.

The amount of information, the average amount of information, the difficulty coefficient, and the degree of differentiation at different positions of each item were calculated to analyze the micro-evaluation of the items on the scale. We also estimated the TIF and the associated standard error of measurement (SE), which indicates the precision of the entire scale [25], to determine the level at which the QLICD-AD (V2.0) provided the most information. The parameters were estimated using the Marginal Maximum Likelihood Estimation (MMLE) method and the Expectation Maximization Algorithm (EM) [26].The computation and plotting of the IRT was done in R Studio in the mirt package, purrr package.

(1) The information amount of the items: reflects the amount of information that each item can provide in estimating the respondent’s ability, the larger the information amount, the smaller the standard error of measurement. In this paper, five points with values of -2, -1, 0, 1, and 2 are selected, and the values of the information function parameter \( \theta \)and its average value at these five points are calculated. Scale measurement information amount > 25 indicates that the quality of the measurement is good, information amount 16–25 indicates that the measurement is acceptable and information amount < 16 indicates that the measurement are poor [14, 19]. The QLICD-AD (V2.0) scale has a total of 40 items, and the average information amount of each item can be obtained by dividing 16 and 25 by 40, so that items with an average information amount > 0.63 (25/40) are judged to be excellent; <0.40 (16/40) are judged to be poor. However, we believe that this criterion is too strict. In this study, the total information amount of the scale was considered to be 5 based on a reliability equal to 0.8, and the average information amount of each item was 0.125 (5/40). Accordingly, when the mean information amount of an item was greater than 0.125, the item was evaluated as “good” and those less than 0.125 (5/40) were evaluated as “poor”.

(2) Difficulty coefficient b: the scale adopts a five-point equidistant scoring method, and each item has four difficulty coefficients, which are b1, b2, b3 and b4, with the increase of difficulty level (b1→b4), the difficulty coefficients corresponding to each item should show a monotonically increasing trend, and the items with the range of [-4, 4] are good; Degree of differentiation a: The greater the degree of differentiation, the greater the amount of information of the cued items, and the items with a degree of differentiation > 0. 5 are considered good.

(3) Item Characteristic Curve(ICC): It is used to describe the functional relationship between a subject’s latent traits and the probability of response. The Item Information Curve (IIC) describes the fact that a larger area under the curve indicates a higher degree of measurement accuracy. Test Information Function (TIF) reflects the precision of the test at various levels for the trait being measured. In general, the quality of the scale was considered high when the total information was 25 or more, and the quality of the scale was considered acceptable when the total information was between 16 and 25 [27, 28]. In addition, a list of conversions between raw total scores and IRT trait scores was calculated using the Expected A Posteriori (EAP) method of Bayesian estimation [20]. The IRT scores were calculated by integrating the parameter estimates (a, b, c) for each item, which means that the corresponding IRT scores are an interval of the same total score.

Results

Patient’s characteristics

A total of 120 AD patients with anxiety disorders aged 15–63 years agreed to participate in the study. Among the studied patients, 74 (61.7%) were males and 46 (38.3%) females; 30% were unmarried, and the divorced and widowed were 1. 7 and 2.5%, respectively; family economy was predominantly middle class, totaling 67 (55.8%); occupation was half occupied by farmers and laborers, 30 (25.0%) and 29 (24.2%), respectively, and the total detection rate of complete anxiety symptoms was 61.7%. See Table 2 in detail.

Table 2 Socio-demographic characteristics of the participants (N = 120)

Scores of the QLICD-AD (V2.0)

The overall mean score of the QLICD-AD (V2.0) was 58.44 ± 15.06 with a range of 24.47 to 91.49; a mean score of the general module was 57.52 ± 15.24 with a range of 20.31 to 90.63; and a mean score of the specific module was 58.58 ± 20.69 with a range of 12.50 to 95.83. General module skewness:-0.564 < 0;kurtosis:0.232 > 0. skewness z-score:2.058;kurtosis z-score:0.429,negative skewness, peak; specific module skewness:-0.567 < 0;kurtosis:0.241 < 0. skewness z-score:2.069;kurtosis z-score:0.445,negative skewness, flat peak; general module skewness:-0. 241 < 0. kurtosis:-0.241 < 0. skewness z-score:2.069; kurtosis z-score:0.445,negative skewness, flat peak; skewness of the whole QLICD-AD (V2.0):-0.602 < 0; kurtosis:0.194 > 0. skewness z-score: 2.197; kurtosis z-score: 0.359, negative skewness, sharp peak. There was no “floor effect” or “ceiling effect” in the overall score or in the scores of each domain/module. See Fig. 1 in detail.

Fig. 1
figure 1

Histogram of total and module scores of the QLICD-AD (V2.0)

Classical test theory analyses

Based on the results of CTT analysis, Cronbach’s coefficient alpha value of QLICD-AD (V2.0) scale is 0.931. The physical functioning neighborhood Cronbach’s coefficient alpha value is 0.706, and Cronbach’s coefficient alpha coefficient of the 9 items after the deletion of a certain item ranged from 0.655 to 0.692; psychological functioning neighborhood Cronbach’s coefficient alpha value is 0.855, the Cronbach’s coefficient alpha after the deletion of an item in 11 items ranged from 0. 825 to 0.866, and GPS3 and GPS10 were not satisfied; the Cronbach’s coefficient alpha value for social functioning neighborhood was 0.758, the Cronbach’s alpha coefficient after deleting an item in 8 items ranged from 0.699 to 0.774, and GSO6 was not satisfied; and the Cronbach’s coefficient alpha value for the specific module neighborhood was 0.865, and the Cronbach’s alpha coefficient of 12 items after deleting one item ranged from 0.847 to 0.863.

40 items satisfied the degree of variability method. The correlation coefficients between the items and the scores on the total scale ranged from 0.321 to 0.711, with 10 items < 0.5 and the other 30 items > 0.5, which is a good result. The factor analysis showed that the KMO value = 0.804, Barlett’s spherical test\( {x}^{2}\) = 2618.627,\( P \)< 0.001, and 30 items satisfied the factor analysis.

Table 3 Results of QLICD-AD (V2.0) items analysis based on four methods under CTT

In summary, since GSO6 satisfies only one statistical method, further major change was needed. GPH1,GPH2,GPH4,GPS3,GPS10,GSO4 satisfy both statistical methods, need to make appropriate adjustments. There were a total of 33 items that satisfied at least 3 statistical methods. See Table 3 in detail.

The results of unidimensionality test in this study showed that the ratio of the first and second Eigenvalue > 3. See Fig. 2 in detail.

Fig. 2
figure 2

Scree plot of QLICD-AD (V2.0)

IRT analyses

In this study, the GRM of IRT was used to calculate the differentiation, difficulty coefficient and average information amount of each item.

Discrimination and difficulty

As can be seen in Table 4 in detail, the differentiation of the 40 items ranged from 0.35 to 1.94, with 38 items having a differentiation > 0.50 and 2 items (GPH6 and GPS3) having a lower differentiation. The difficulty of each item ranged from − 12.134 to 5.072, and there were 32 items that met the − 4 to 4 and monotonically increasing trend, while GPH6, GPH8, GPS3, GSO2-GSO4, AD2, and AD5 did not meet the requirements.

Table 4 Estimates of discrimination and difficulty parameters of QLICD-AD(V2.0) based on IRT GRM

Average information amount

In this study, 35 out of 40 items had mean information amount > 0.125, 11 of them were judged as excellent, 24 were judged as fair and the remaining 5 (GPH1,GPH6,GPH8,GPS3,GSO2) were judged as poor. See Table 5 in detail.

Table 5 Information amount at different points(\( \varvec{\theta }\)) of items of the QLICD-AD (V2.0)

Item characteristic/ information curve

Item Characteristic Curve(ICC) Expresses the probability of each option being selected as a contribution to the estimated basis function. Figure 3 shows the ICC and the Item Information Curve(IIC) for all items. The smallest area under the curve shown on the left is for items of GPH1,GPH6,GPH8,GPS3,GSO1-GSO3,AD2,AD5, indicating measurement accuracy is low. Figures P1-P5 on the right show different response options GPS3,GPS8,GPH6,GPH8,GSO4,AD2 Response probabilities are similar across categories and a response always has the highest probability at higher levels of the continuum.

Fig. 3
figure 3

Item information curve (IIC) and item characteristic curve (ICC) of QLICD-AD (V2.0)

Test information function

Figure 4 shows the test information function and measurement error. It can be seen that information is highest (standard error lowest) in the range of -1 to 0 on the z-score metric, all marginal reliabilities for this scale were > 0.8.

Fig. 4
figure 4

Test information function (TIF) and reliability of QLICD-AD (V2.0)

Discussions

For many years, CTT and IRT have been the two major methods used for test and scale construction and development in the educational, behavioral and social sciences [29]. CTT and IRT are also the two most classical theories in the field of scale development and are commonly used for item analysis and screening. CTT evaluates scale from a macro perspective [30]. It is accurate enough in most cases, but theoretical hypothesis is weak and error index is general and single. The biggest disadvantage of it is that it has large dependence on samples, IRT overcomes it. IRT calculates the discrimination, difficulty and information of each item from the micro level. The item parameter estimation is independent of the sample, which can accurately estimate the measurement error of each item and test for each subject, evaluate item more accurately. QLICD-AD(V2.0) items have five degrees, IRT could perform a more accurate analysis and estimation of the non-linear model and better meet the needs of modern analysis [31]. CTT and IRT complement each other, and the combination of the two can better assess items.

From the results of CTT analysis, seven items (GPH1, GPH2, GPH4, GPS3, GPS10, GSO4, GSO6) did not satisfy the three statistical methods in CTT. The correlation coefficients of the items GPH1,GPH2,GPH4,GPS3,GPS10,GSO4,GSO6 are small, and the representativeness and independence of the items are poor. The factor loadings of GPH1,GPH2,GPH4,GSO4,GSO6 are small, and the representativeness of the items is poor. There is a role for reducing the internal consistency of the dimension for GPS3,GPS10,GSO6. For GSO6, in one study, more than half of those who remained disordered at follow-up had significant health care costs, treatment-resistant symptoms, and severely impaired quality of life [32]. However, considering that the four statistical methods satisfy at least three of the items rated as good quality, the final CTT method determines that seven items are subject to further optimization.

From the results of IRT analysis, the average amount of information of GPH1, GPH6, GPH8, GPS3, GSO2 was too low. The difficulty or differentiation of items of GPH6, GPS3 did not meet the judging criteria. The difficulty coefficients of GPH6, GPH8, GPS3, GSO2-GSO4, AD2, AD5 were not within the range of the judging criteria. Together with the IIC non-compliant graph items of GPH1, GPH6, GPH8, GPS3, GSO1-GSO3, AD2, AD5 and the ICC non-compliant graph items of GPH8, GPS3, GPS8, GSO4, AD2, the differentiation of the 38 items meets the judging criteria and the items provide a greater amount of information.

Table 6 CTT and IRT unlisted items summary

In summary(Table 6), combining the results of CTT and IRT analyses, among QLICD-AD (V2.0) 40 items, there are 32 items with good performance, 6 items (GPH1, GPH8, GSO2, GSO4, AD2, AD5) need to be further optimized, item GPH6,GPS3 should be deleted due to the number of tests do not meet the requirements. The remaining items are of better quality. Although the results showed that the QLICD-AD (V2.0) could be effectively used to measure patients with anxiety disorders, for the items that needed to be modified and deleted, the anxiety disorder experts in the group discussed the statistical results and suggested modifications to avoid errors caused by relying solely on statistical analysis and to improve the representativeness and reliability of the items.

This study has used two theories to evaluate items of the QLICD-AD(V2.0) for relatively comprehensive and complementary, but the sample size and scope of the collection are still limited. Sample size for IRT analysis of items generally requires 250 cases [33]. Due to time, manpower and other reasons, this research does not meet the requirements of a large sample size. In order to make the scale evaluation more accurate and reliable, the sample size can be increased for further analysis and evaluation. In addition, the subjects in this study were only selected from hospital inpatients. Further large-scale research is needed for other settings and populations, such as outpatients in hospitals or local clinics. The next step is to adjust the QLICD-AD(V2.0) based on the above results. In the future, we will work with psychiatric departments of hospitals in different provinces of China and local communities to expand the population coverage, so that the QLICD-AD(V2.0) can become a suitable scale for measuring anxiety disorders in China.