Introduction

Test anxiety (TA), first described in the psychological literature by Mandler and Sarason (1952), was characterized by a heightened state of anxiety that occurs before or during tests (Sommer and Arendasy 2015), and it also be described as a series of physiological and behavioral responses with specific performances that accompany concerns that the test may fail or result in poor performance (Zeidner 1998). TA is a serious and pervasive problem among students (Bodas and Ollendick 2005; Ergene 2003), and students with TA will feel nervous, fear and worry in the evaluation situation (Spielberger et al. 1979; Spielberger and Vagg 1995). Researches that correlate TA with academic achievement suggest that high levels of TA are associated with lower levels of learning and performance (Sub and Prabha 2003). At all levels of education, students who often feel test-anxious perform poorly on standardized tests (Everson, Millsap, & Rodriguez, 1991a, b) and receive poorer grades (Chapell et al. 2005), which is mainly due to that anxiety and other test-taking deficiencies interfere with their performance either directly or indirectly (Efklides et al. 1997, 1999; Lowe et al. 2008; Metallidou and Vlachou 2007). Accordingly, it is extremely critical to have an accurate assessment and diagnosis of those with TA and provide timely treatment. Measuring TA using self-report scales has become a common method over the past several decades. A number of different self-report scales have been used in previous study, including the test anxiety inventory (TAI; Spielberger 1980), the test anxiety scale (TAS; Sarason 1978), the Friedman-Bendas Test Anxiety Scale (FAT; Friedman and Bendas-Jacob 1997), and the state—trait anxiety inventory (STAI; Marteau and Bekker 1992). In spite of some differences concerning items numbers, severity of the symptom, time period, and so forth, each scale measures the similar general construct—TA (Friedman and Bendas-Jacob 1997; Sarason 1978; Spielberger 1980; Umegaki and Todo 2017). In the past, psychometric properties of most self-reporting scales have been assessed by classical test theory (CTT), which focused on reliability, validity, and norms, etc. (Hunsley and Mash 2007, 2008). Moreover, validity and reliability are two important characteristics of measurement instruments (Devellis 2005). Reliability captures the consistency of scores obtained from applications of the instrument, and commonly used index of reliabilities are test-retest reliability, split-half coefficient, Cronbach’s alpha. Validity consists of a complex set of criteria including convergent validity, divergent validity, and factorial validity used to judge the extent to which inferences, based on scores derived from the application of an instrument, are warranted. Norm is more of a reference system for evaluating the position of the test score in the team, that is, the index used to evaluate the test score. For convenience, we had summarized them into an overview table in Appendix 1. What’s more, knowledge about the range of severity evaluated by an instrument is critically important for tailoring measurements to solve specific questions and to solve them in specific settings (Embretson and Reise 2000, 2012; Olino et al. 2012). To achieve this goal it is likely to be achieved by applying the approaches of item response theory (IRT).

In terms of TA scales, existing researches of TA scales mainly based on CTT and focused on: (1) Analyzing the psychometric properties of TA scales cross different culture (Manavipour et al. 2013; Mowbray et al. 2015; Lowe et al. 2011a, b, Sebastian et al. 2012). Bi (2002) first translated the FAT into China. He pointed the FAT of Chinese version had good reliability (> .85; see Table 1), and convergent validity between the scale and Spielberg TAI was .84 for boys and.82 for girls; the Greek version of Spielberger (TAI; 1980) self-report measure of test anxiety was verified the well-established two-factor structure for the TAI (Dimitra et al. 2011); Raju et al. (2010) translated Sarason’s Test Anxiety Scale into an Ethiopian language and pointed the results of confirmatory factor analyses with extraction of four factors. The Ethiopian version of the Test Anxiety Scale as a whole could be considered reliable and useful for Ethiopian students. (2) Revising the scale and developing a short form. Taylor and Deane (2002) pointed that a 5-item short form produced optimal reliability (> .80; see Table 1) and validity, and a balance of items from the Worry and Emotionality subscales of the TAI. The 5-item short form of the TAI shows promise, particularly for contexts in which time demands preclude the use of longer versions; a brief version of the FRIEDBEN Test Anxiety Scale (B-FTAS) was investigated, which had the unique strength of measuring test anxiety using a contemporary biopsychosocial model. Exploratory and confirmatory factor analyses identified a 3-factor, brief, 12-item test anxiety assessment consistent with a biopsychosocial model including social, cognitive, and physiological factors. Results provide sufficient evidence for internal reliability (> .80; see Table 1) and validity of this brief measure of test anxiety (Dave et al. 2013). (3) Using TA Scale to conduct related research. The study of Yazici (2017) revealed that competitive and cooperative learning styles had positive, low-level and significant relationship with the TAS’ emotionality sub-dimension, and the same relationship was observed between the competitive learning style and the worry sub-dimension. Lori and Lori and Richard (1998) found that there were no significant differences among age groups with respect to test anxiety. And poor study behavior was related to higher levels of test anxiety, and better study behavior was related to lower levels of test anxiety. Multiple regression analysis also revealed that test anxiety, gender, age, and ethnicity were all statistically significant predictors of study behavior; Everson et al. (1991a, b) pointed the invariance of the traditional two-factor structure for both males and females, and the psychometric properties of TAI had acceptable reliability (>. 60; see Table 1); Other study pointed that researchers should be careful when drawing conclusions based on original TAI norms, especially in the case of female undergraduates (Szafranski et al. 2012). With the literatures, most of all the existing researches on TA scales were based on the framework of CTT. However, CTT methods cannot offer specific information on the severity of TA symptomatology with respect to different trait levels. In addition, unidimensionality is an important assumption in IRT, and it is difficult to be satisfied for the most scales. If the unidimensional model is applied to estimate the item parameters of multi-dimensional instruments, it is likely to yield inaccuracy in parameters estimation. Third, although plenty of instruments are available, the agreement between them is less than optimal and no scale can be considered as a gold standard (Umegaki and Todo 2017). Therefore, it may be difficult for researchers and clinicians to choose an optimal instrument when assessing for TA. To address this gap, new approaches to analyzing multi-dimensional structure scales are essential and should be applied to reanalyze the TA scales. Above all, CTT alone is not sufficient to illustrate the ability of a measure to accurately assess the severity of various symptoms. Item response theory (IRT) is a new psychometric theory, which is developed on the basis of overcoming the limitations of CTT. IRT methods are based on probabilities of individual response options and estimate TA independently of the selection of test items, and provide estimates about the position on the latent trait (theta level; i.e., test anxiety) where each item or inventory provides the most information (Olino et al. 2012).

Table 1 Previous studies on psychometric properties of TAI, TAS and FAT

This study aims to address the issues by (1) investigating the structures of some commonly-used scales and (2) simultaneously comparing their psychometric properties under the framework of a Bifactor multi-dimensional structure approach of IRT. To fairly compare the psychometric properties for the three scales, the TA scales used here include the TAI, TAS, and FAT. The reasons why these were chosen for this study are as follows: (1) the three instruments are widely used in several fields of psychology studies. The TAI and TAS are widely used in research and practical settings and have particular application to the assessment and treatment of TA in student populations (Song and Zhang 2008; Zhu et al. 2019). The FAT also is applied to research its validation and standardization (Fereshteh et al. 2012). (2) Some critical evidence has indicated that the three scales have high reliability and validity. For example, Song and Zhang (1987) pointed out that the total and each subscale of the TAI had a good Cronbach’s alpha (α = .90; α1 = .80; α2 = .84) and high scale construct validity. Wang (2001) suggested the TAS had good Cronbach’s reliability (α > .60) and high concurrent validity. Bi (2002) found each subscale of the FAT had good internal consistency (range between 0.85 and 0.91) and high construct validity. (3) The same scoring methods ensured that psychometric properties of three TA instruments could be compared fairly (the higher the score, the more serious the test anxiety; Bi 2002; Newman 1996; Song and Zhang 1987; Wang 2001). With our best knowledge, there is no research that compares the psychometric properties of different TA scales under the framework of IRT. To address this issue and take full advantage of IRT, the study aims to compare the psychometric properties of three commonly-used TA scales in Chinese university students. This study is expected to provide suggestions for selecting and applying the most optimal and precise measures for researchers with different study purposes (Umegaki and Todo 2017). For instance, the scale may be designed to be used in studies where it can provide the most information at the lower TA severity level; or it may be useful for assessing changes in TA severity in treatment studies where it can more precisely measure the mean of TA severity. It may also be designed to obtain information about a clinical diagnosis for the best assessment at the higher TA severity level. Furthermore, a multi-dimensional approach—(the Bifactor multi-dimensional item response theory model) is first used here to analyze and compare three widely used TA scales, which is expected to derive more appropriate parameters estimation of items and individuals than unidimensional approaches. This article might play a significant role in the selection, development and revision of TA measures.

Method

Sample

A total of 790 university students from China were recruited. Participants were mainly from two general universities of Jiangxi province. The age of participants were range from 18 to 23 with mean of 19.40 (SD = 1.51). The proportions of male and female participants were 57.2% and 42.8%, respectively. In terms of region, of 57.5% students were from the countryside and 42.5% students were from cities.

Procedure

Data were collected across multiple sessions ranging in size from 10 to 30 participants. Three TA instruments were administered before participants’ academic examination. Participants also provided demographic information, including age, gender, class level (freshman, sophomore, junior or senior) and region (city or countryside) prior to completing the questionnaires. All participants in the study agreed to participate and were informed about the purpose of this research. Furthermore, the study was conducted anonymously, and no information that could identify individuals was collected.

Measures

TAI (Spielberger 1980; Chinese Version: Song and Zhang 1987)

The Test Anxiety Inventory (TAI) is a self-report inventory designed to measure test anxiety (TA) as a situation-specific personality trait. The TAI consists of 20 items, with a 4 point Likert-type scale ranging from 1 (rarely or none of the time) to 4 (always). The TAI provides a measure of total TA (TAI-T) as well as measures of two TA components of emotionality (E) and worry (W). Emotionality refers to perceived autonomic reactions (physiological arousal) evoked by evaluative stress (Spielberger and Vagg 1995), whereas worry refers to cognitive concerns about the consequences of failure (Morris and Liebert 1969). Worry tends to be associated with performance decrements on cognitive and intellectual tasks, but emotionality is not (Hembree 1988; Hong 1998; Spielberger et al. 1979). The Chinese version of TAI was first tested by Song and Zhang (1987) and the Cronbach’s alpha of the total and subscales are .90, .80, and .84 in Chinese university students. The inventories describe phenomena associated with TA. For example, I feel confident and relaxed when I take the exam. At the exam, I was upset. Of the 20 items, one is positive statement. Furthermore, The TAI has been used extensively, and the manual indicates that “most high school and college students complete the inventory in 8 to 10 minutes” (Spielberger et al. 1980).

TAS (Sarason 1978; Chinese Version: Wang 2001)

The TAS is a unidimensional self-report scale. The TAS is comprised by 37 statements, and each item asks for a yes or no answer. The Chinese version of TAS was first tested by Wang (2001), and the test-retest reliability for university students was .62, and the Cronbach’s alpha is .64. The statements reflect common symptoms of TA—such as, when a major exam is coming, I always think of others smarter than me. If I was to attend a large exam, I would be very anxious before starting. Of the 37 items, 5 are positive statements. Newman (1996) suggested that 12 points or below of TAS total score indicated that the TA was a low level; 12 to 20 points are moderate, 20 and above were higher levels.

FAT (Friedman and Bendas-Jacob 1997; Chinese Version: Bi 2002)

The Friedman-Bendas Test Anxiety scale (FAT), contains 23 items with a Likert 5-point scale ranging from 1 (not at all) to 5 (completely suitable). The FAT measured three subscales: Social Derogation (worries of being socially belittled and deprecated by significant others following failure on a test), Cognitive Obstruction (poor concentration, failure to recall, difficulties in effective problem solving, before or during a test), and Tenseness (bodily and emotional discomfort) (Friedman and Bendas-Jacob 1997). The Chinese version of FAT (Bi 2002), Cronbach’s alpha of the total and subscales are .81, .91, .86, .85. The items correspond to the two other scales. For example, even if I’m well prepared, I will be nervous before the exam. If the test is not good, I am worried that the teacher wills torment me. Of the 23 items, 5 are positive statements.

Analysis

Description of Total Scores and Reliability

First of all, the total scores of each scale and the correlations and reliability of each scale based on CTT were reported.

Factor Analysis

As for the Bifactor model, Holzinger and Swineford (1937) pointed the Bifactor model refers to a general-specific model. A Bifactor measurement model allows all items to load onto a common general dimension of psychopathology in addition to any specific symptom domains or “group” factors. The Bifactor model assumes that: (1) there is a general factor (for example, a general ability factor) that can explain the common variation of all topics; (2) there are multiple local specific factors (for example, special ability factors), After controlling the effects of general factors, each special factor can additionally explain the common variation of some topics (Chen et al. 2006). If a multi-dimensional test consists of p topics x1, x2, …, xp measured A general factor G and a special factor F1, F2, …, Fn, then the titlexi can be expressed as (Ye and Wen 2012):

$$ {x}_{\mathrm{i}}={a}_iG+\sum \limits_{j=1}^n bijFj+\delta i,i=1,2,\dots \mathrm{p}. $$

Where ai is the load of the topic xi on the global factor G, bij is the load of the topicxi on the local factor Fj, and δiis the test error of the topic xi. It is generally assumed that general factors, special factors, and errors are not related (Chen et al. 2012). The Bifactor model integrates the unidimensional and multi-dimensionality of multi-dimensional tests, and can simultaneously test the common effects and unique effects of each dimension. The loading pattern and factor structure of the Bifactor model, consisting of nine items and three specific factors, is shown as an example in Fig. 1.

Fig. 1
figure 1

A Bifactor model with three specific factors

The confirmatory factor analysis (CFA) was carried out to investigate the structure of three scales in Chinese university students. Three types of structure were considered in this study, which included unidimensional structure, the initial multi-dimensional structure of each scale and their initial multi-dimensional structure with Bifactor structure. The comparative fitted index (CFI), the incremental fitted index of Tucker and Lewis (TLI) and the root mean square error of approximation (RMSEA) were employed to investigate whether the proposed structures fitted the data well.

If all above three structures were not fitted the data well, the structure of the scale needed to be re-explored and re-confirmed. In this situation, the exploratory factor analysis (EFA) and CFA with Bifactor structure were both used to investigate the structure of scale with two randomly split-half data, respectively.

The above statistical analyses were conducted by SPSS (23.0) and MPLUS (7.4).

Item Response Theory Analysis

Three commonly-used polytocous multi-dimensional model, including the multi-dimensional Generalized Partial Credit Model (mGPCM; Muraki 1992), the multi-dimensional Graded Response Model (mGRM; Samejima 1969) and the multi-dimensional Ratings Scale Model (mRSM; Muraki 1992), were used to analyze the data. Three test level model-fit criteria, including the Akaike’s information criterion (AIC, Akaike 1974), Bayes information criterion (BIC) and negative 2 times log likelihood (−2*Log-Lik), were employed to select a more suitable IRT model for the data. The less value of the three criteria represented the better of the mode -fitted.

IRT statistical analyses were conducted using R (Version 3.1.2; http://www.R-project.org/) and the R packages psych (Version1.5.1; http://CRAN.R-project.org/package_psych).

Results

Description of Total Scores and Reliability

Table 1 documented the descriptive statistics, internal consistency and mutual correlations of summed scores of the different scale based on classical test theory (CTT).

Factor Analysis

First, with all items loading on only one dimension, the unidimensionality of each scales using CFA have been tested. Unexpectedly, no scale showed a good fit. This result indicated that FAT, TAI, and TAS were not efficiently a unidimensionality measure.

As the one-factor CFA (structure A) did not provide a close fit, and then the initial structure (structure B) of each scale via CFA has been verified. After that, the Bifactor CFA of each initial scale (structure C) has been fitted to investigate whether the Bifactor structure can fit better. Results were displayed in Table 2.

Table 2 Descriptive statistics, internal consistency and mutual correlations of summed scores of the different scale

Table 2 showed the TAI fitted the Bifactor CFA of initial scale (structure C) well. However, the FAT and TAS did not. Therefore, a Bifactor EFA (structure D) for FAT and TAS have been performed to find a better fitted Bifactor structure. It was found that both the FAT and the TAS showed a good fit with four-special-factor explaining 56% and 50% of the variance, respectively. Then, their corresponding Bifactor structure was further confirmed by Bifactor CFA (structure E). Results in Table 2 indicated that the RMSEAs were less than 0.05, and the CFI and TLI were approximation 0.9 for FAT and TAS, which showed that the structure E (i.e., Bifactor structure with four-special-factor) was moderately fitted by both FAT and TAS.

Overall, the TAI fitted two-Bifactor structure very well, while the FAT and TAS moderately fitted four-Bifactor structure well. More details can be found in Table 3. As can be seen from the whole, the new scale structures based on Bifactor model are well-fitting.

Table 3 Fit index of each scale to test how the structure derived from A to E

Item Response Theory Analysis

IRT Model Comparison and Selection

Three multi-dimensional IRT models with Bifactor structure were used for IRT analysis and the results of model-fit indexes were documented in Table 4. As shown in Table 5, the multi-dimensional GRM had the smallest values of Akaike information criterion (AIC), Bayes information criterion (BIC), and − 2*Log-Lik in all three scales, which indicated that the mGRM fit the data of three scales best. Therefore, mGRM was chosen to estimated item parameters of three scales and analyzed their psychometric properties.

Table 4 Factor loading for FAT, TAS and TAI
Table 5 The compare of mGRM, mGPCM, and mRSM

Psychometric properties for three scales

From the factor analysis, it was showed that the TAI fitted two-Bifactor structure very well, the FAT and TAS moderately fitted four-Bifactor structure, and all of the three scales extracted a general factor- that was test anxiety. Besides, the correlations of test scores among three scales ranged from 0.5 (p < .01) to 0.6 (p < .01), which showed that the three scales measured the similar latent trait—test anxiety. Based on the general factor-test anxiety, the psychometric properties of different self-report TA scales were further investigated.

One of the advantages of IRT is that it can provide the corresponding measurement accuracy for each subject. First, test information for each scale was to be calculated. Test information (TI) is the inverse of a squared standard error of measurement (SE), that is to say\( SE\left({\theta}_{\alpha}\right)=\frac{1}{\sqrt{I\left({\theta}_{\alpha}\right)}} \).

Test information (TI) is an important index of measurement precision in IRT. Because test information increases with an increase of scale length, test information was divided by each scale’s length obtained the average test information that denoted test information per item and enabled comparison of measurement precision among scales with different lengths. The average test information curves of three scales were shown in Fig. 2. Among the three scales, it showed an advantage test information for the FAT and TAI over TAS, and the TAI’s advantage test information was the highest from the −1 to +3 (i.e., −1 < θ < +3) of the standardized θ scale. For almost all other areas (i.e., −3 < θ < −1), the FAT’s advantage test information was the highest among the three scales. Conversely, the TAS’s advantage test information was almost always lower than that of the other scales. These indicated that the TAI assessed information well for various degrees of TA severity.

Fig. 2
figure 2

Average test information curves of FAT, TAI, and TAS

What the study does is a comparison between the three scales, so it is necessary to compare the strengths and weaknesses of the measurement performance at a certain point or interval between the two on the θ scale. It is also necessary to examine which test has the best accuracy at the specified point or interval, and how efficient it is compared to other tests. This makes it easy to determine which test to choose is the best decision. The ratio of the test information functions at the specified trait level θ = θ0 is called the relative efficiency between the two tests.

$$ RE\left(\theta \right)={I}_A\left(\theta \right)/{I}_B\left(\theta \right) $$

RE(θ) is relative efficiency, IA(θ) and IB(θ) are the test information functions on tests A and B, respectively.

Next, given that the three scales measure test anxiety as a whole, relative efficiency curves were plotted of the three scales (see Fig. 3). The relative efficiency of the TAI compared to the FAT was likely to be greater than 0.2 from approximately −3 to +3 (i.e., −3 < θ < +3) of the standardized θ scale. That is, FAT can only achieve TAI test strength by extending 0.2 times on the basis of the original number of items. As the test information of TAI was a bit higher than FAT. This means that, when comparing the TAI with the FAT, the TAI have higher discrimination between students with test anxiety around or above the average, while the FAT have a little higher discrimination between students with test anxiety below the average. Furthermore, the relative efficiency of the TAI compared to the TAS was higher than 2 from −3 to −1 (i.e., −3 < θ < +1) of the standardized θ scale and + 1 to +3 (i.e., +1 < θ < +3) of the standardized θ scale. As far as the test function is concerned, the TAI is 100% stronger than the TAS, and the TAS test items need to be doubled on the original basis to achieve the TAI test strength. Because the TAI’s test information was more than four times as much as the TAS, this means that, when comparing the TAI with the TAS, the TAS provides more information for the students who have test anxiety. In addition, the relative efficiency of the FAT compared to the TAS was greater than 4 when the θ was lower than approximately −1 and greater than +1 (i.e., θ < −1, θ > +1). Although the item of TAS was about more than twice as long as the FAT, when comparing the TAS with the FAT, the FAT provided more information only for the students whose TA severity (θ) were less than −1 and more than +1 (i.e., θ < −1, θ > +1).

Fig. 3
figure 3

Relative efficiency curves of FAT, TAI, and TAS

Above all, the relative efficiency curve shows that TAI provides the most test information in the entire interval, and in the whole θ level, The test function of TAI and FAT is not very different. Besides, the test function of TAI and FAT is much stronger than TAS.

Finally, the standard error of measurement (SEM) and marginal reliability were calculated via SE(θα). As the formula showed the larger test information for a θ is, the smaller the standard error of a scale for the θ is, and at the same time the measurement will be more accurate, and more reliable (high reliability).

In Fig. 4, the curve reflects that for FAT, when θ exceeds −1 (i.e., θ > −1), the marginal reliability of the test is higher than 0.8, which means that FAT had a good reliability for the participants whose θ were more than −1 (i.e., θ > −1). With regard to TAI, the accuracy of the whole scale is high and the change of curve is relatively flat. That is to say, for a standardized θ scale greater than −1.5 (i.e., θ > −1.5), TAI is a good choice because it has a higher reliability (edge reliability >0.8) (Fig. 5). As shown in Fig. 6, the TAS has good reliability for a participant whose normalized θ scale is between −1 and + 2 (i.e., −1 < θ < + 2). In general, in terms of measured marginal reliability, the FAT and TAI in the three scales not only have higher test reliability, but also ensure the relative accuracy of the test at both ends. The accuracy of the TAS test is less optimistic than the other two.

Fig. 4
figure 4

Standard error of measurement and marginal reliability curves of FAT

Fig. 5
figure 5

Standard error of measurement and marginal reliability curves of TAI

Fig. 6
figure 6

Standard error of measurement and marginal reliability curves of TAS

Conclusions and Discussion

Using a Bifactor approach with a large sample of Chinese university students, the current study investigated structures and simultaneously compared psychometric properties of three commonly used self-reporting TA instruments, including the TAI, the TAS, and the FAT.

The past researches were found the TAI score of female university students were always higher than that of male university students. In this study, it was also founded that TAI score of female university students (mean = 18.02) in the emotional subscale was significantly higher than that of male university students (mean = 16.39) with t = −2.57, df = 788, and p < 0.05, which was consistent with researches Benson and Tippets (1990) and Everson et al. (1991a, 1991b). As for TAS, 32% participants were at a low level of test anxiety with score of less than 12, 51% participants were moderate TA with score of between 12 and 20, and 17% participants were severely test anxiety with score of more than 20 (Newman 1996). Concerning FAT, the mean score (mean = 38.12) of female university students was significantly higher than that of male university students (mean = 33.47) with t = −4.89, df = 788, and p < 0.01. Descriptive results showed that both the Cronbach’s alpha and the reliability of Spearman-Brown Half Coefficient for each scale were acceptable in Chinese university samples. The correlations of test scores among three scales ranged from 0.5 to 0.6 with significant moderate to high correlate (p < 0.01), which showed that they measured the similar latent trait. That is, there is comparability between the three scales. In addition, the result of the dimensionality and factor analysis showed that the TAI fitted two-Bifactor structure very well, and the FAT and TAS moderately fitted four-Bifactor structure. A correlated factors model did not include a general factor and attributes all explanatory variance to first-order factors (Morgan et al. 2015). A correlated factors model is conceptually ambiguous because it is not able to separate the specific or unique contributions of a factor from the effect of the overall construct shared by all interrelated factors (Chen et al. 2012), whereas a Bifactor model contains a general factor (G) and multiple specific factors (S). Because G and S are independent, a Bifactor model can disentangle how each factor contributes to the systematic variance in each item. The possibility of segmenting the variance in independent sources is one of the primary advantages of the Bifactor model (Reise 2012). In addition, the Bifactor structure has consistently proven to provide superior model fit for TA symptoms across measures in large samples, this finding lends further confidence to the phenomenon that this Bifactor solution offers a more optimal representation of the data than any of the previously suggested correlated-factors structures.

Additionally, psychometric properties of the three instruments by Bifactor IRT approach showed that the three scales had both high reliabilities and low SEMs at the broad range of TA severity, which indicated that the three scales performed well overall. The findings also provide suggestions for determining which scale to use in a given study design: the TAI evaluated TA along a wider range of severity with more precision than the other two scales. TAI can also be used to measure trait test anxiety and state test anxiety, depending on the time of the test. If using it outside the examination situations, the trait test anxiety is measured; if the scale is measured immediately at the post or last of an examination, the state test anxiety is measured (Dong et al. 2011). It may be pointed in this study TAI is a better instrument for the trait test anxiety. The FAT is performing a litter worse than TAI at the same levels of severity of TA. The TAS provided more information at the lower level of TA symptomatology. In conclusion, the TAI and the FAT evaluated information at greatly overlapping ranges; however, the TAI, performing a litter better at the same levels of severity of TA, may be a good choice when recruited those with various levels of TA severity to ensure a high precision. What’s more, FAT may be a good choice for measuring those with moderate TA severity. Meanwhile, the TAS provided more information at the lower level of TA symptomatology, that is to say, TAS is suitable for epidemiological TA studies and for measuring those with lower TA severity. Of note, in fact, the study focused on the comparison of the general factor (i.e., TA) in the Bifactor Multi-IRT model while ignoring specific factors of the three scales in the current study. The FAT merely performed worse than the TAI on psychometric properties of the general TA factor; however, psychometric properties, including the reliability, the SEM, the TI, and the RE of specific factors for three scales were not investigated. Thus, the issue was confused as to whether the TAI is better or worse than the FAT on psychometric properties of specific factors.

Another contribution of this study was that a new approach of the Bifactor IRT model was used to fit the multidimensional structures of TA scales, while almost all of the prior studies used CTT approaches (which cannot offer specific information on the severity of TA symptomatology with respect to the differentiability levels) or unidimensionality IRT methods (the unidimensionality is difficult to be satisfied for TA scales). In a Bifactor IRT model, each item of the scale was able to not only load onto one specific factor but also a general factor (Osman et al. 2012), in which researcher could derive more information from the items and participants for both a general factor and specific factors. Therefore, compared with CTT and unidimensionality IRT approaches, the Bifactor Multi-IRT approach had natural advantages for analyzing psychological scales with multidimensional structures. There are some suggestions for conducting a Bifactor MIRT model. For example, the sample size needs to be large enough to accurately calibrate item parameters (Gignac 2016; Umegaki and Todo 2017). Meanwhile, the Bifactor MIRT model requires two or more specific factors in the structure (Cai et al. 2011; Li and Rupp 2011), and each specific factor needs to contain more than two items (Gomez and McLaren 2015; MacCallum et al. 1999; Velicer and Fava 1998; Zwick and Velicer 1986).

Although the IRT approach got the good result relatively, there also existed some limitations. First, the sample was not comprehensive and not representative, only selected from several universities, generating repeatedly better-fitting models across different samples of primary school students and adolescents. Second, considering the unidimensional IRT model applied will be robust to moderate degrees of multi-dimensionality (Drasgow and Parsons 1983; Olino et al. 2012). Therefore, trying to keep the unidimensional structure or the initial structure of the scales will get more information and provide more detailed and accurate suggestions for a given study. Third, inclusion of other commonly used self-report test anxiety scales may provide further suggestions for determining the usability of different self-report TA scales (e.g., State scale of state-trait anxiety inventory; STAI). At last, to use the TA instruments before an examination may ensure the reliability and validity of scales. The potential to use the TA scales in pre-post examination situations has been supported by previous research (Zeidner 1991). Development of a novel inventory that covers a wider range of TA severity and has the largest amount of test information at any point on the continuum or making a integration of the existed TA instruments are also a future direction.