Introduction

The development of reliable, effective and short tools or scales is an important goal of evaluation, especially in educational, psychological and clinical evaluation. In a general way, shorter scales enhance evaluation efficiency in that they can salvage response time and energy, minimize burdens, increase response rate, and reduce fatigue effects. Green and Frantom (2002) pointed out that reducing the length of the scale can reduce the variance and further affect the reliability and validity of scales, and the short and high-quality scale is the most ideal. In contrast to classical test theory (CTT), item response theory (IRT) includes but is not limited to techniques for generating sufficient information and viability at both the item and scale levels; hence, it can be used to accurately exploit and assess short versions of tests, tools, or scales.

Emotional intelligence (EI) is considered as an influential construct in personality and social psychology (Extremera et al., 2011). EI is also defined as a set of abilities, that can be trained (Hodzic et al., 2017) by which people obtain information from their emotions and use it to guide their thinking and behavior for optimal adaptation (Salovey & Mayer, 1990). Scientific interest in the function of emotional intelligence (EI) in various fields of life is increasing (Blasco-Belled et al., 2019; Joseph & Newman, 2010; Martins et al., 2010). Specifically, EI has shown a positive correlation with a sound body and mind, task performance, and much social contact (Blasco-Belled et al., 2019; Joseph & Newman, 2010; Lopes et al., 2004; Martins et al., 2010; Tejada-Gallardo et al., 2020). As the field advances, researchers are increasingly interested in the processes that underlie the positive effects of EI (Lievens & Chan, 2017). Therefore, an important question is whether dealing with one’s own emotions or those of other individuals is equally important for the prediction of criteria (Brasseur et al., 2013; Zeidner et al., 2008). Pekaar et al (2018) propose that both EI dimensions (i.e., dealing with one’s own emotions and dealing with others’ emotions) may have a positive effect; however, this effect may occur in different life domains. To illustrate, while effective handing of one’s emotions presumably plays an important role in staying (mental and physical) health, effective handing of other’s emotions may be more important in facilitating smooth social interactions. As the positive effects of EI may thus reflect different processes, it may be relevant to differentiate self- from other-focused EI. Previous emotional intelligence scales did not clearly distinguish between self and others, making the influence of the emotional intelligence dimension on both aspects unclear. The Rotterdam Emotional Intelligence Scale (REIS; Pekaar et al., 2018), on the other hand, has a good balance between the emotions of self and others and solves the problem of the previous scales being unable to clearly distinguish between emotional intelligence of the self and the other.

There may be multiple reasons to develop a short version of REIS. First, a short version of the original scale could save time in responding and improve the response rate, which maximizes examinees’ representativeness. Second, the short form of the psychological scale evaluates the overall psychological constructs as well as the original version, and they have the same receivable psychometric quality as their original form counterparts, especially validity and reliability (Thalmayer et al., 2011). Compared with the long scale, the short scale is capable of reducing the fatigue effect of the subjects. Furthermore, a short form could be accepted easily for its application in clinical research and its inclusion in studies where multiple scales need to be accomplished. In fact, many authors consider scales of perhaps fifty or more items as having no benefit, especially in studies with multiple scales, repeated measures or targeting persons who might become bored or disengaged (Austin et al., 2018). In terms of theory, a balance should be found in the time spent in responding (long scales generally have preferable psychometric properties with fair-sized territory coverage) and the requirements placed on examinees (e.g., long or repeated responses). Therefore, a shortened form of the REIS has the advantage of minimizing missing data and solves the problem of not clearly distinguishing between self- and other-focused EI.

Based on the above analysis, the main purpose of the research is to develop a short scale. However, to do this well, we first discuss some measurement characteristics of the original scale, which provides a basis for the development of the short version of the scale. On the basis of the above analysis, we then develop a short version of the test and verify its structure and measurement characteristics.

Methods

Participants

A total sample of 1,086 college students participated in this research by responding to the scales. The response data were first sifted for outliers or extremes prior to performing any statistical analysis (i.e., respondents whose response data were missing for 5 consecutive items or 10 consecutive items in the same response). Ninety-five persons were eliminated from the response dataset on account of missing data or extreme values. The rejection rate was 8.75%, and the effective rate was 91.25%. In this final sample, there were 574 (57.92%) females and 417 (42.08%) males (Data sets can be referred to https://osf.io/2rus5/). The respondents’ mean age was 19.25 years (SD = 1.23) with a range of 16–25 years. Table 1 reveals the distribution of the sample according to the demographic variables.

Table 1 Demographic characteristics of the participates in development sample and verification sample

In the current study, the total examinees were optionally divided into two pieces: the development examinees (494) and the validation examinees (497), which were both approximately 500. Wolf et al. (2013) suggested that 500 respondents are sufficient for confirmatory factor analysis (CFA). Embreston and Reise (2000) proposed 500 examinees for precise parameter estimates by means of the GRM, and Tsutakawa and Johnson (1990) recommended that the sampling sizes be considered sufficient for precise IRT parameter estimates.

The development sample was used to screen items and validate the structure of REIS’s original version. Descriptive statistical analysis and CFA were first conducted on the original scale, and then the original scale was simplified according to standards of item fit, differential item function and other conditions. The verification sample was applied to verify the simplified version of the REIS based on the development sample. For example, validation samples were used to verify the structure of the short scale, and the validation comparison and analysis were conducted with the criterion scale (i.e., EIS and WLEIS).

From Table 1, the development data included more females (57.49%) than males (42.51%), and the majority of participants were science majors (61.34%) from rural areas (64.17%). The average age of all respondents was approximately 19 years. In terms of the validation sample, the same consequences were shown for the demographic characteristics concerning gender, major, census register and age.

Instrument

The Rotterdam Emotional Intelligence Scale (REIS)

The REIS is a 28-item Likert scale in which response data are evaluated on a 5-point scale from 0 (strongly disagree) to 4 (strongly agree). Higher scores on the REIS imply higher emotional intelligence levels. Factor analysis of the original scale implied that the REIS conformed to the hierarchical four-factor model, which contains self-focused emotion appraisal (Cronbach’s α = 0.82), other-focused emotion appraisal (Cronbach’s α = 0.85), self-focused emotion regulation (Cronbach’s α = 0.80) and other-focused emotion regulation dimensions (Cronbach’s α = 0.82) (Pekaar, et al., 2018). The full scale’s internal consistency reliability was 0.86 (Pekaar, et al., 2018).

Under the principle of keeping the original meaning and being easy to understand, two psychology graduate students translated the scale into Chinese and formed the first draft. Then, the first draft was translated back and discussed by two graduate students majoring in English, and according their suggestions we modified the first draft. After that, two psychological experts were consulted to make some modifications to the translation and we formed the second draft. Finally, ten graduate students were invited for cognitive interviews to test the Chinese version of REIS to ensure that there was no ambiguity, difficulty in understanding items and other problems, and according to the cognitive interviews we modified some ambiguous item expressions, then the final draft was formed. The internal consistency reliability of the whole scale (Chinese version) and its four subscales in our sample are 0.92, 0.83 (Self-focused emotion appraisal), 0.76 (Other-focused emotion appraisal), 0.77 (Self-focused emotion regulation) and 0.80 (Other-focused emotion regulation dimensions), respectively.

Emotional Intelligence Scale (EIS)

The EIS is constitutive of 33 items applied to evaluate emotional intelligence (e.g., “I have control over my emotions”). The scale is measured on a 5-point Likert scale with a range from 0 (strongly disagree) to 4 (strongly agree); moreover, higher scores imply higher levels of emotional intelligence. The scale was found to be internally consistent (Cronbach’s α = 0.9; Schutte et al., 1998). The Chinese version adopts the scale translated by professor Wang (2002) of South China Normal University (Cronbach’s α = 0.84). EIS’s internal consistency reliability in this paper is 0.9.

The Wong-Law Emotional Intelligence Scale (WLEIS)

The WLEIS is a 16-item Likert scale in which response data are evaluated on a 7-point scale from 0 (strongly disagree) to 6 (strongly agree) by Wong and Law (2002). The Chinese version was translated by Wang (2010) of Central South University (Cronbach’s α = 0.9). WLEIS’s internal consistency reliability in this paper is 0.93.

IRT in brief

Compared to the test level in the CTT framework, IRT comprises methods for generating abundant information and greater convenience for item-level applications (Embretson, 1996). A crucial drawback of CTT is that test/item statistics are dependent on a sample of candidates/participants. The other disadvantage is that it is a test-oriented theory, which provides a modicum of information at the item level for a special group of interviewees (Hambleton et al., 1991). IRT models overcome the above shortcomings by performing at the item level and providing sample-free measurements. Next, we introduce the IRT model and item selection indicators used in the study.

The REIS is a Likert multilevel scoring scale, but the multilevel scoring models include the graded response model, generalized partial credit model and nominal model. Therefore, the study compares the three scoring models and selects the most suitable scoring model for the study. The results are shown in Table 3. As shown, the GRM model has the smallest AIC, AICc, SABIC, BIC and -2logLik, so this study selected the GRM model as the scoring model.

Graded response model

The graded response model (GRM; Samejima, 1969) is a development of the two-parameter logistic model and analyses polytomous response data, for instance, letter grading and Likert-type scales. In the bifactor case, the graded response model is:

$${P}^{*}\left({u}_{ij}\ge k|{\theta }_{0i},{\theta }_{si}\right)=\frac{1}{1+\mathrm{exp}\left[-\left({d}_{k}+{a}_{0j}{\theta }_{0j}+{a}_{sj}{\theta }_{sj}\right)\right]},$$

where \({P}^{*}\) is the probability that a participant with a potential trait of \(\theta\) has a response equal to \(k\left(k=0, 1, 2,\dots ,K-1\right)\) or above in item \(j\) (Gibbons et al., 2007; Mao et al., 2018). Therefore, the probability on the \(k\) is equal to the difference between the cumulative probability on the two adjacent categories:

$$P\left({u}_{ij}=k|{\theta }_{0i},{\theta }_{si}\right)={P}^{*}\left({u}_{ij}\ge k|{\theta }_{0i},{\theta }_{si}\right)-{P}^{*}\left({u}_{ij}\ge k+1|{\theta }_{0i},{\theta }_{si}\right).$$

The model has a permit of items to contain a different number of categories, and each item is represented by a general slope parameter \(\left({a}_{0j}\right)\), a special slope parameter \(\left({a}_{sj}\right)\) and by numerous threshold parameters \(\left({d}_{k},{d}_{k}=-\left({a}_{0j}{b}_{jk}+{a}_{sj}{b}_{jk}\right)\right)\) that are one less than the number of categories within the item (Gibbons et al., 2007; Mao et al., 2018). These threshold parameters describe the location with a probability of more than 50% on the θ scale, and the response is in the related or the higher category.

Item fit

The S-X2 (Orlando & Thissen, 2000) statistic was applied here to check the item fit. In this method, the item fitting statistic is based on the observed and expected frequency of correct and incorrect scores for each summary. The prominent advantage of this method is that these expected frequencies can be directly compared with the frequencies observed in the data. The significance level of S-X2 indicators can be set as 0.01 according to requirements. When the p value of S-X2 of the item is less than the significance level, the item fit is poor and should be considered for deletion.

Test and item information function

In the IRT framework, reliability (the degree of error-free measurement) is evaluated by the item or test information function (IIF or TIF). The amount of information provided by each item is a function of the trait level of the subjects tested. The IIF or TIF varies with the value of the trait level. The item provided the most information when the subject's trait level matched the item's threshold value. What determines the overall level of information provided by an item at the competency trait level is also the parameter characteristics of the item itself. Among them, the most critical factor that affects the amount of information of an item is the item discrimination parameter. The larger the item discrimination parameter is, the more information it provides. The formula of the item information function is as follows:

$${\text{IIF}}\left(\theta \right)\text{=}\frac{{\left[{p}_{j}^{^{\prime}}\left(\theta \right)\right]}^{2}}{{p}_{j\left(\theta \right)}\left(1-{p}_{j}\left(\theta \right)\right) } ,$$

where θ is the measured trait, \({p}_{j}\left(\theta \right)\) is the subject response function of item \(j\), and \({p}_{j}^{^{\prime}}\left(\theta \right)\) is the first derivative function of \({p}_{j}\left(\theta \right)\). The IIF shows that the item is more helpful in measuring the level of respondents within the \(\theta\) (Reeve & Fayers, 2005) and indicates the item’s contribution to the evaluation of traits. IIF is also an index to appraise item quality (it can distinguish the abilities of respondents) and is therefore used to appraise the relative performance of each item. TIF is the aggregation of the information functions of the items contained in the test.

Differential item function

DIF is a process of evaluating whether the item is biased against a certain group and comprehensively evaluating the fairness of the test. DIF not only affects the estimation of the ability of the subjects but also affects the validity and fairness of the test (Kim, 2001). When members of different groups have different item responses, the differences should be due to variations in the underlying trait, not to unintended, construct-irrelevant factors. In the case of gender, for example, men and women with the same trait level should have the same predicted response. When the expected response varies for groups with the same trait level, the item is said to show DIF and is, hence, biased in favor of one group (Penfield & Camilli, 2006). Criteria based fully on statistical significance can detect DIF items that are not actually relevant (Crane et al., 2007). We chose the following criteria for meaningful differential item functioning: the logistic regression square and McFadden's pseudo R2 (Crane et al., 2006; Prati & Pietrantoni, 2016). When the change in R2 is greater than 0.02, it indicates the existence of deviation in the item and should be deleted (Crane et al., 2006; Gu & Wen, 2017). In the current study, we conducted DIF analysis of gender to exclude the influence of factors related to personal background on the item.

Confirmatory factor analysis

The development sample and validation sample used confirmatory factor analysis to verify the structure of the original and the short version of REIS, respectively. We used the development sample to conduct CFA for the original version, while we used the validation sample to conduct CFA for the short version. Using CFA, we compared the consistency of six models: the original hierarchical four-factor structure (Pekaar, et al., 2018), the hierarchical five-factor model, the hierarchical six-factor model and their corresponding bifactor structure models. The bifactor model permits researchers to empirically determine which items accurately describe general dimensions or conceptually narrower dimensions after regulating the general factor (Gibbons et al., 2007; Li et al., 2021; Reise et al., 2010). That is, in the bifactor model, each term is loaded on both the general and the special dimensions, both of which are orthogonal. The fit of the proposed models was evaluated by four indices: the comparative fit index (CFI), the Tucker–Lewis index (TLI), the root mean squared error of approximation (RMSEA), and the standardized root mean squared residual (SRMR). The fit indices were interpreted using values that should be close to 0.95 or higher, indicating an excellent or at least acceptable (> 0.90) model fit for CFI and TLI, an approach to 0.06 for RMSEA, or an approach to 0.08 for SRMR (Hu & Bentler, 1999; MacCallum et al., 1996).

In addition, based on the bifactor model, three types of reliability indicators can be calculated by omega \(\left(\omega \right)\), omega subscale \(\left({\omega }_{s}\right)\) and omega hierarchical \(\left({\omega }_{H}\right)\). The coefficient omega is expressed by \(\omega\), also known as internal consistency reliability, which reflects the correlation between all the items. When \(\omega\) is above 0.70, the reliability of the synthesis score of the multidimensional test is acceptable. The omega subscale \(\left({\omega }_{s}\right)\) is the same as the coefficient omega \(\left(\omega \right)\); the higher the \({\omega }_{s}\), the higher the reliability of the subscale composite score. The coefficient omega hierarchical, expressed in \({\omega }_{H}\), reflects the extent to which all items measure the same trait, and the larger the \({\omega }_{H}\), the more homogeneous the test. When \({\omega }_{H}\) is greater than 0.50, the homogeneity can be considered high, in which case it makes sense to synthesize the whole test score (Gu & Wen, 2017; Rodriguez et al., 2016). Both CFA and Omega coefficient results are considered in the study.

Procedure

Before the study, the scales’ purposes and regulations were explained, and we obtained students’ counselors’ consent to conduct the study in their classes. We assured respondents that all answers were confidential and that nobody in their classes would have access to the personalized data. In addition, respondents were notified that they could refuse to participate in the study anytime, and if they did not want to fill out the scales, they could fill in blanks.

Statistical analysis was performed using R (R Core Team, 2014) and M-plus (Wang, 2014). We conducted confirmatory factor analysis with M-plus and ran a GRM using the MIRT package (Chalmers, 2012). We analyzed item fitting, discrimination, and test and item/test information function to evaluate the psychometric characteristics of the items and tests. The sex-based DIF was also examined via the lordif package (Choi et al., 2011).

The process of simplifying the scale is as follows: first, the item that does not fit according to item fit is deleted; next, the item with DIF is deleted; then, according to item discrimination and IIF, the items that do not meet the requirements are deleted. Furthermore, the test information and correlation of REIS and two related scales (i.e., EIS and WLEIS) were analyzed through the short version scale to verify its performance.

Results

Descriptive statistics and CFA of the original REIS

Table 2 shows the descriptive statistics of each item of the original REIS with the development sample (N1 = 494), including the mean score, standard deviations and item response frequencies. For all items, the majority of respondents (more than 40%) agree in favor of the category (option 4), and for some items (e.g., item 5 and item 20), a relatively large proportion of those supporting the category strongly agree, while few strongly disagree (option 1). The full-scale average was 71.51, with a standard deviation of 13.11 and a range from 11 to 110.

Table 2 Abbreviated item content and response percentages of the 28 Items of the REIS (N1 = 494)

In order to better explore the structure of the Chinese version of the REIS, we randomly divided the development sample into sub-Sample1 (N = 244) and sub-Sample2 (N = 250). Firstly, exploratory factor analysis (EFA) is performed on sub-Sample1 to analyze the structure of Chinese REIS. The results show that the Kaiser–Meyer–Olkin (KMO) and Bartlett’s spherical test is 0.91 and the hierarchical six-factor structure was extracted according to the criterion that the eigen value was greater than 1. The six-factor structure was optimized for 59.65% of the total variance. Then, we performed CFA for the hierarchical six-factor structure by sub-Sample 2, and the result was not ideal (RMSEA = 0.07, CFI = 0.82, TLI = 0.80, SRMR = 0.07). We believed that the hierarchical structure was not ideal because there were too few items on two factors, with only 3 and 2 items, respectively. Therefore, we combined the two factors into one factor because they had similar content and reperformed the CFA of the hierarchical five-factor, but the result was still unsatisfactory: RMSEA = 0.07, CFI = 0.82, TLI = 0.80, SRMR = 0.08. In addition, CFA was conducted on the original scale’s hierarchical four-factor structure by sub-sample 2 (Pekaar, et al., 2018), but the fit index values did not fit well (RMSEA = 0.07, CFI = 0.78, TLI = 0.76, SRMR = 0.08).

Since bifactor models are widely used to solve dimension analysis problems in behavioral science and other related fields, we try to construct bifactor models with four special factors, five special factors and six special factor structures. Among them, the bifactor model with four special factor structures was the most ideal (RMSEA = 0.07, CFI = 0.83, TLI = 0.80, SRMR = 0.07), while the bifactor model with five special factor structures and the bifactor model with six special factor structures could not be identified. The model comparison results are shown in Table 3. Compared with other structures, the fitting index of the bifactor model with four special factor structures was the best, and the results of AIC, SABIC, BIC and -2logLik were also the smallest, indicating that the data fit the bifactor model with four special factor structures. Therefore, the following studies are based on a bifactor model with four special factors.

Table 3 Comparison of multidimensional item response theory models using confirmatory factor analyses (sub-Sample 2)

Table 4 presents the coefficient omega, coefficient omega hierarchical and omega subscale of the REIS based on the bifactor model. The coefficient omega (\(\omega\)) of the whole scale is 0.92, indicating that the reliability of the total scale is above 0.70, which meets the psychometric standard. The omega subscales \(\left({\omega }_{s}\right)\) of the four subscales are 0.81, 0.75, 0.79 and 0.76, respectively, which are all greater than 0.7, indicating a highly reliable multidimensional composite scale. The coefficient omega hierarchical (\({\omega }_{H}\)) is 0.83 (above 0.50), which indicates that the total score is meaningful.

Table 4 Omega \(\left(\omega \right)\), Omega subscale \(\left({\omega }_{s}\right)\) and Omega hierarchical \(\left({\omega }_{H}\right)\) for the full Scale (sub-Sample 2)

Development and verification of the short version of REIS

Development of the REIS short form

The GRM with a bifactor model was used for calibrating the 28 items of the REIS utilizing the data from the development sample. The factor load path diagram of the original scale can be seen in the Appendix. The item parameters of GRM, item information, IIFC, model fit and DIF are displayed in Table 5 and Fig. 1, respectively. Five slope parameters \(\left(a, {a}_{1}, {a}_{2}, {a}_{3} and {a}_{4}\right)\) and four threshold parameters \(\left({b}_{1}, {b}_{2}, {b}_{3} and {b}_{4}\right)\) were estimated for each five-category item. The discrimination parameter measures the strength of the relationship between the item and the construct being measured—the stronger the relationship, the better the item. In addition, it is used as an indicator of item quality—the higher the slope, the more discriminating the item. The value range of general factor discrimination (a) of the original scale is from 0.97 to 2.33. The slope parameter \({a}_{1}\) ranges from -0.09 to 1.68, \({a}_{2}\) ranges from -0.20 to 1.37, \({a}_{3}\) ranges from 0.24 to 1.37, and \({a}_{4}\) ranges from 0.64 to 1.33, which shows deviation in the discrimination values of the items.

Table 5 IIF, item parameters, model-fit(S-X2, df, p) and R.2 for the 28 Items of the REIS (N1 = 494)
Fig. 1
figure 1

Item information function curves of the items (N1 = 494)

The threshold parameters \(\left({b}_{1}, {b}_{2}, {b}_{3} and {b}_{4}\right)\) represent four locations on the \(\theta\) scale (REIS). The parameter \({b}_{1}\) represents the location where the probability exceeds 50% that the response is in the category strongly disagree or in the higher category disagree. Similarly, \({b}_{2}\) represents the location where the probability exceeds 50% that the response is in the category disagree or in the higher category agree; and \({b}_{3}\) represents the location where the probability exceeds 50% that the response is in the category not sure. Finally, \({b}_{4}\) represents the location where the probability exceeds 50% that the response is in the category agree or in the higher category strongly agree. The values of the four threshold parameters \(\left({b}_{1}, {b}_{2}, {b}_{3} and {b}_{4}\right)\) were very similar to all items. According to Dodeen and Darmaki (2016), if the intercept parameter ('difficulty' parameter) of all items shows a consistent trend, this parameter is not helpful in distinguishing between items. Parameter \(b1\) ranges from -4.58 to -2.01 for all items, parameter \(b2\) is greater than \(b1\) (ranging from -2.84 to -1.06) for all items, parameter \(b3\) ranges from -0.82 to 1.04, and \(b4\) ranges from 0 to 2.57. These results show how the items in the scale work. Over the scale items, respondents with high EI tend to select strongly agree, while respondents who have lower EI tend to select strongly disagree. Because the results of these items were similar, the intercept parameter ('difficulty' parameter) had no effect on filtering items. The difficulty of the 28 items is shown in Fig. 2, which ranges from 0.42 to 0.71, meaning that the test can maximize the difference between individuals.

Fig. 2
figure 2

Item difficulty distribution curve of 28 items (N1 = 494)

The mean and standard deviation of IIF of the items were 0.16 to 1.23 and 0.02 to 0.70, respectively. The value of S-X2 ranges from 54.06 to 113.59, with a p value between 0.04 and 0.83, and the R2 change degree of DIF is between 0.00 and 0.02. In the self-focused emotion appraisal dimension, items 2, 4, 5 and 7 provide significantly more information than items 1, 3 and 6. In the other-focused emotion appraisal dimension, items 8 and 9 provide more information than other items. In the self-focused emotion regulation dimension, items 18, 19 and 21 obviously provide more information than the other items. In the other-focused emotion regulation dimension, items 24, 25, 26 and 27 provide more information than the remaining items (see Table 5).

Based on the above information, four indicators were used for selecting the suitable items for the short form of the scale: item fit (p value of S-X2), DIF, item discrimination (\(a\)) and IIF. First, all items fit the GRM well, with p values of S-X2 less than 0.01. Therefore, the item-fit index does not delete items. According to the second judgment criteria, only item 15 has DIF, whose R2 change is greater than 0.02, and then we delete item 15.

Item discrimination is a very important index for item quality under the framework of IRT. According to the item slope (\(a\)) parameter, the item with the highest slope value was identified. Considering the slope parameters of the general and special factors together, the self-focused emotion appraisal dimension was reserved for items 2, 4, 5 and 7; the other-focused emotion appraisal dimension was reserved for items 8 and 9; the self-focused emotion regulation dimension was reserved for items 18, 19 and 21; and the other-focused emotion regulation dimension was reserved for items 23–28. Existing studies indicate that at least three items should be reserved for each dimension to ensure the reliability and validity of the subscale (Prati & Pietrantoni, 2016). Hou et al. (2004) also showed that each dimension requires at least three items to ensure model identification. Therefore, we set the slope parameter of the other-focused emotion appraisal dimension to 0.72 (below 0.80) but retained item 10, which contains more information. That is, the other-focused emotion appraisal dimension and the self-focused emotion regulation dimension each contain 3 items (items 8–10, item 18, item 19 and item 21), and items are no longer deleted according to the amount of information.

Next, we remove the items with content redundancy by considering the IIF and the item information function curves. The rule of IIF is that the more information the item gives, the better; then we selected the items to maximize the total amount of information across the entire continuum with minimal content overlap. Specifically, in the case of content redundancy, we selected items that reached high levels of information or that were the highest at the higher or lower end of the continuum and, therefore, were informative for differentiating among participants with high or low scores on a scale. As seen from Fig. 1 and Table 5, items 2, 4, 5 and 7 can provide the maximum amount of information for the subjects in the inability range. Therefore, these four items are retained within the self-focused emotion appraisal dimension. In the other-focused emotion regulation dimension, it can be found from Fig. 1 that the item information provided by items 24–27 overlaps with that provided by item 23 and 28. Combined with the item information provided by Table 5, we choose item 24–27 that can provide higher item information. Based on the above analysis, we retained 14 items (2, 4, 5, 7, 8, 9, 10, 18, 19, 21, 24, 25, 26, and 27) as short versions of REIS.

Validation of the REIS short form

According to the above analysis, the final short scale of the REIS contains fourteen items. The short scale fits the bifactor model well (CFI = 0.96, TLI = 0.94, RMSEA = 0.05, SRMR = 0.04), which was higher than the full scale (CFI = 0.88, TLI = 0.86, RMSEA = 0.06, SRMR = 0.05). Table 6 reports the short-scale reliability coefficients. The coefficient omega \(\left(\omega \right)\) of the whole scale is 0.90 (above 0.70), which meets the standard of psychometrics. The omega subscales \(\left({\omega }_{s}\right)\) of the four subscales are 0.80, 0.71, 0.77 and 0.73, respectively, which are all greater than 0.7, indicating a highly reliable multidimensional composite scale. The coefficient omega hierarchical \(\left({\omega }_{H}\right)\) is 0.77 (above 0.50), which indicates that the total score is meaningful.

Table 6 Comparison of Omega \(\left(\omega \right)\), Omega subscale \(\left({\omega }_{s}\right)\), Omega hierarchical \(\left({\omega }_{H}\right)\) and CFA of short version (N2 = 497)

The standardized factor load coefficient of the shortened scale is shown in Fig. 3 with the validation sample. In the general factor, the load value was greater than 0.4. The value range of general factor discrimination (a) of the short scale is from 1.23 to 2.67. The slope parameter \({a}_{1}\) ranged from 0.69 to 1.89, \({a}_{2}\) ranged from 0.65 to 1.65, \({a}_{3}\) ranged from 1.46 to 1.81, and \({a}_{4}\) ranged from 0.78 to 1.54, which shows that the items of the short scale were highly differentiated and could well distinguish subjects with different abilities. The values of the four threshold parameters \(\left({b}_{1}, {b}_{2}, {b}_{3} and {b}_{4}\right)\) were very similar across all items (see Table 7). The difficulty of the short scale is shown in Fig. 4, which ranges from 0.46 to 0.69, meaning that the test can maximize the difference between individuals. Over the scale items, respondents with high EI tend to select strongly agree, while respondents with lower levels of EI tend to select strongly disagree, which is the same as the full scale.

Fig. 3
figure 3

Path coefficient of the 14 Items of the REIS (N2 = 497). Note: All the path coefficients in the figure are statistically significant at the level of 0.01. f1 = Self-focused emotion appraisal; f2 = Other-focused emotion appraisal; f3 = Self-focused emotion regulation; f4 = Other-focused emotion regulation

Table 7 Means and standard deviations of IIF, Slope (\(a\)) Parameter and Location (\(b\)) Parameter Estimates for the 14 Items of the REIS (N2 = 497)
Fig. 4
figure 4

Item Difficulty Distribution Curve of 14 items (N2 = 497)

The IIF mean and standard deviation ranged from 0.50 to 1.57 and 0.08 to 1, respectively. In addition, we also calculated S-X2 and DIF for the short scale, and the results showed that the items of the short scale all met these two standards.

To compare the long and shorter versions of REIS, we calculated and compared two forms of TIF. The data are graphically represented by two curves in Figs. 5 and 6. Each curve represents the information about the emotional intelligence levels (\(\theta\)) offered by each version. The figure clearly shows that, of all the theta values, the long version of REIS supplies further information than the short form. Additionally, the mean of the TIF of the long form was 18.13, while the mean of the TIF of the short form was 12.29. This is an expected result, since TIF is the collection of all IIFs, and there are more items in the long version; thus, we expect to obtain more information.

Fig. 5
figure 5

Test information function curves of the original version and the short version of the REIS for the general factor (N2 = 497)

Fig. 6
figure 6

Test information function curves of the original version and the short version of the REIS for the four special factors (N2 = 497)

Finally, two criterion scales were used to verify the short version of the REIS: the EIS and the WLEIS. Based on the results of coefficient omega analysis, we only conducted correlation analysis on the scale total scores of REIS, EIS and WLEIS and did not conduct comparative analysis among subscales. The correlations between the EIS, WLEIS and REIS using the long REIS version were compared with those using the short form, respectively. The correlation values (see Table 8) were also very close, which verifies the performance in the short version.

Table 8 Correlations between the REIS (Original and Short Form), EIS, and WLEIS

Discussion

The objective of this research was to develop a short version of REIS and to evaluate its performance. First, we explore a new EI structure, the bifactor model, which has the better fitting indices with CFA. The bifactor model consists of a general factor and four conceptually different EI dimensions (such as the original scale), which include self-focused emotion appraisal, other-focused emotion appraisal, self-focused emotion regulation, and other-focused emotion regulation, which indicates that the short version covers all four dimensions of the original version. Next, the IRT method was used to shorten the REIS based on two factors: one is that the IRT method plays an important role in the development of modern psychometric tests and scales (Mungas & Reed, 2000), and the other is that it is especially suited for scale construction and refinement, containing the reduction of scales (Petersen, et al., 2006). By comprehensively considering S-X2, DIF, item discrimination and IIF, this study obtained a short scale with 14 items. Compared with the original scale, although there are fewer items, the short scale has a perfect fitting index, which is important because reducing the scale should not be achieved at the cost of losing important content. In addition, the S-X2 of the short scale was acceptable, and there was no item with DIF. TIF shows that the short scale is good at distinguishing individuals within the range of potential traits. More importantly, the convergence validity of the short REIS was established by its high correlation with other EI scales. The total score of the REIS scale was significantly moderately or highly positively correlated with the total score of the EI and WLEIS scales, respectively, indicating that the REIS had sufficient convergence validity. For the usefulness of the scale, considering the small number of items in each subdimension, we believe that, although the reliability of the subscale reaches 0.7, which meets the standard of psychometry (Gu & Wen, 2017), caution should be exercised when using the current subscale as a measurement tool by itself. In terms of coverage of test information, the short scale retains 66.96% of the original scale's information, which indicates that the short REIS could systematically capture self- and other-focused EI similar to the original scale and provide a wider scope of EI dimensions. In conclusion, compared with the original version of REIS, the developed short version saves 50% of the items and has competitive validity and reliability.

The current study is not without limitations. First, the REIS is a self-reported instrument that has the potential influence of a social desirability bias. That is, participants can easily disguise or deliberately distort their responses to make themselves present a favorable self-representation. The second limitation is that this study focuses on college students, but the scale has been widely applied to all kinds of people. Therefore, it is necessary to be cautious in interpreting the results of the study because the respondents were not recruited through probability sampling and were concentrated among college students, which also leads to the limited generalizability of our study results. Sampling bias may have reduced the attitudes of different groups toward REIS because of differences in emotional intelligence between groups. The third limitation was that participants were biased against female participants. We note that the gender distribution (58% women) is skewed; although most items do not have DIF on gender, the gender ratio is not balanced. The last limitation is the use of subscales. Due to the small number of topics in each subscale, caution should be taken when using the current subscale as a measurement tool by itself. We suggest that the short scale should be used as a whole.

Despite the limitations of this research, we believe that the development of the scale has a great number of potential meanings for future research. Although previous studies have proven that REIS are multidimensional, the current data distinctly indicate evidence in favor of a bifactor model. The research result of a bifactorial model supplies the basis for current practice.