Emotional intelligence (EI) became an alluring construct to the wider public, as it is believed to encompass personal abilities that can be advantageous in a variety of settings (e.g. intra and interpersonal relationships with family, peers, in schools or workplace environments, etc.). However, most scientific evidence supporting this notion is correlational in nature, consisting of a plethora of studies between EI and several overlapping psychological constructs (e.g. personality, intelligence) and outcomes that, at best, suggest evidence of a small improvement in academic or professional success and quality of life (Côté and Miners 2006; Gardner and Stough 2002; Joseph and Newman 2010; Parker et al. 2004; Power and Dalgleish 2007; Schutte et al. 2007; Zeidner et al. 2004). Research also tends to overlook hypotheses related to negative outcomes of high EI skills (Côté et al. 2011). Furthermore, the concept of EI still raises a considerable debate within the scientific communities over different conceptualization of the construct, along with criticisms and cautionary notes about the pitfalls and overreliance in existing EI assessment abound (Fan et al. 2010; Joseph and Newman 2010; Roberts et al. 2001b). While some authors regard EI as personality traits or behavioural tendencies, others propose EI is a set of cognitive abilities or competencies (Evans et al. 2019; MacCann et al. 2014; MacCann and Roberts 2008; Mayer et al. 2000, 2008a, b; Mayer and Salovey 1997). Those theoretical stances have direct implications for measurement of EI and, consequently, on any theoretical formulation of models and empirical evidence involving the construct.

Meta-analytic studies indicate that, despite the long history of theoretical and empirical research in ability EI in the psychological fields, evidence for criterion validity is still very scarce for ability models (Joseph and Newman 2010). Empirical studies also highlight the scarcity and inconsistency of findings regarding gender and race differences in EI ability, and the need to include community samples and address possible biases in ability testing for different sociodemographic groups (Allen et al. 2014; Fan et al. 2010; MacCann and Roberts 2008; Sharma 2017). In an examination by Petrides (2013), it was argued that a clearer distinction between the different theoretical proposals is necessary, because many assessment measures present convergence problems and show interrelationships with different constructs (intelligence, effectiveness and declarative knowledge, and other cognitive abilities) (Barchard and Hakstian 2004; Mayer et al. 1999; Petrides 2013; Roberts et al. 2001a). This issue highlighted the need to rethink assessment methods of emotional intelligence components, whether by making them similar to traditional intellectual and cognitive abilities tests, or by creating novel approaches that are more suitable for the measurement of this construct and related personality and cognitive abilities.

We adopt the conceptualization of EI as a set of abilities, similarly to the theoretical model initially proposed by Mayer and colleagues (Mayer and Salovey 1997). The hierarchical model of Emotional abilities includes skills related to how individuals perceive and express their emotions, how emotions are integrated into other thought processes, the understanding of the complex interplay between emotions and context, and the management or regulation of emotional experiences. While early formulations postulated a 4-branch model, empirical evidence has pointed out to a 3-factor hierarchical structure (perception, understanding and management) as a part of a broader second-stratum ability-EI factor within a general intelligence structure (Evans et al. 2019; Fan et al. 2010; MacCann et al. 2014). Furthermore, these components of emotional processing skills also fall within the wider domain of social cognition, which is related to, but distinct from, other non-social cognitive abilities (da Motta et al. 2019; Frajo-Apor et al. 2017; Lysaker et al. 2014; Phillips et al. 2003; Schmidt et al. 2011; Singer 2012).

The focus of the current study is the evaluation of emotion understanding, a skill that can contribute to the measurement of the general concept of EI. For years, the most widely used measure to date has been the MSCEIT, a measure that has received considerable criticism regarding its construction, scoring methods yielding inconsistent results, and validity issues (Fan et al. 2010; MacCann and Roberts 2008; Petrides 2010; Roberts et al. 2006). In an attempt to increase the diversity of measurement of EI, and to surpass its known shortcomings, MacCann and Roberts (2008) developed the Situational Test of Emotion Understanding (STEU), which presented promising results for the evaluation of the third branch of Emotional Intelligence.

Few psychometrically sound EI measures have been made freely and directly accessible to researchers and clinicians. Although most EI assessment tools are proprietary measures subjected to considerable reproduction and administration fees, the permissions for reproduction and use of the STEU can be easily obtained from the APA PsycTests (see Method section). The STEU is available in long (42 items) and brief (19 items) forms (Allen et al. 2014; MacCann and Roberts 2008). The test items were devised within the Roseman’s appraisal theory (Roseman 2001) and consist of several scenarios that participants can rate, in a multiple-choice format, which emotion is likely to be felt in that given situation. Responses are considered correct or incorrect according to that underlying theoretical model. This method has demonstrated to be a good approach to maximum performance testing of emotional abilities (Austin 2010; MacCann and Roberts 2008) and previous studies suggested emotional understanding ability holds the strongest relationship to cognitive abilities, supporting the idea that emotion understanding involves conceptual knowledge and representations about emotions (Allen et al. 2014; MacCann 2010; MacCann et al. 2014). While both forms presented good psychometric properties in their initial studies, the STEU presented a slightly lower reliability in its shortest form in comparison to the longer version. This is a common trade-off when retaining the smallest pool of items (less than 50%, in this case) that preserve the largest possible amount of information in a measure (Allen et al. 2014). The use of brief measures, whether in clinical or in the research field, is particularly emphasized due to their cost-effectiveness, especially when working with more vulnerable individuals or avoiding bias due to respondent fatigue or carelessness. In this regard, previous research also points out to the advantages of carrying studies using Item Response Theory (IRT) to explore estimates of ability and differential item functioning according to sociodemographic characteristics, since studies about this measure (and of EI measures in general) in independent samples are still lacking (Allen et al. 2014; Barchard and Hakstian 2004; Barchard and Russell 2006; Fan et al. 2010; MacCann and Roberts 2008; Sharma 2017). To the best of our knowledge, no psychometric studies were further carried out with the brief version of the STEU. Thus, the goal of the current study is twofold: to translate and adapt the STEU-B to make it available for researchers and practitioners working with Portuguese speaking communities, and to study the psychometric properties of the STEU-B using item response theory and the relationship with more basic EI processes and cognitive abilities. In addition, we expand previous works by studying differential item functioning and suggesting scoring techniques that can more precisely locate an individual’s skill level across the Emotional Understanding construct.

Method

Participants

A total of 899 participants took part in this study. Three-hundred sixty participants (40%) were males and 539 (60%) were females, with ages between 14 and 72 years old (M = 22.46; SD = 9.98). Most participants were single (N = 748; 83.2%), followed by 125 (13.9%) participants who were married or in a civil union, 21 participants who were divorced (2.3%) and 2 widowed (0.2%). Three participants (0.3%) did not report their civil status. Most participants (N = 388; 43.2%) were attending or completed elementary school and 344 (38.3%) were attending or completed mandatory education (high school). The remaining 151 (16.7%) participants had a higher education (college, masters or doctoral degree) or alternative curricula (e.g. professional school), while 16 participants did not report their education level (1.7%).

Measures

Situational Test of Emotional Understanding – Brief (STEU-B)

(MacCann and Roberts 2008, Portuguese version by da Motta et al. 2016). The STEU-B is a 19-item version of the STEU (42 items), obtained from IRT analysis performed in the studies by Allen et al. (2014), that provides the maximum amount of information with the least testing time. The items depict different interpersonal scenarios, with different degrees of difficulty, to which participants are asked to choose the more likely emotional response resulting from each situation (e.g. Clara receives a gift. Clara is most likely to feel? Response options: a) happy; b) angry; c) frightened; d) bored; e) hungry). Items were scored dichotomously, rated as correct or incorrect according to the key provided by the authors (C. MacCann, personal communication, March 22nd, 2016) and available as a supplemental file in the paper by MacCann and Roberts (2008). The STEU-B is the shortest available ability-based Emotional Intelligence test of emotion understanding, with increased utility for assessment time restrictions and suitable for the general public (Allen et al. 2014).

Penn Emotion Recognition Task

(da Motta et al. 2019; Gur et al. 2010) is a measure of social cognition that tests ability to decode and identify specific emotions presented in 40 faces (fear, anger, sadness, joy or no emotion), balanced by gender and extreme or mild emotional expressions.

Short Penn Verbal Reasoning Test

(da Motta et al. 2019; Gur et al. 2010) is a measure of verbal intellectual ability, comprising the 8 best predictors of the full 29-questions version of the test (Gur et al. 2001). It consists of verbal analogies that are answered in a multiple-choice format.

Short Raven Progressive Matrices

(da Motta et al. 2019; Gur et al. 2010) comprises a computerized version of 9 questions from the 60-question regular Raven test, chosen according to a previous analysis that demonstrated their predictive power of the scores from the 60-questions version (Gur et al. 2001). The task measures non-verbal reasoning for solving each matrix problem.

Procedures

Prior to data collection, the authors obtained permission to translate, reproduce and use the Situational Test of Emotion Understanding – Brief (see Measures section) from the authors and APA.Footnote 1 We followed several guidelines for adequate cross-cultural adaptation of psychological assessment measures (Beaton et al. 2000; Sousa and Rojjanasrirat 2011). The only change in the content was made in some of the first names of the persons described in STEU scenarios, which were changed to names or spellings more familiar to the Portuguese native-speakers (e.g. Susan was replaced by Suzana). The first translation of the contents to Portuguese was carried out by a bilingual psychologist (first author). The back translation was performed by a U.S resident bilingual teacher that was blind to the study goals. No significant deviations from the original version of the test were found between the original English version and back-translated version, and the final Portuguese translation was revised by a senior psychologist. Changes made at this stage referred to grammar/punctuation, or deciding the best term whenever an English word could be translated in more than one way in Portuguese. Prior to its administration to a wider sample, the Portuguese STEU-B was administered to 5 adult participants from the general population. In the pilot administration, participants were asked to carefully read all the scenarios and response options before responding, and to mark any items in which they felt something was wrong, difficult to read or had words they did not understand. After the test was completed, the researcher asked each participant about any issues found in the STEU-B and participants did not report difficulties regarding items’ clarity or comprehension.

A convenience sample of 899 participants was administered a research protocol including the STEU-B as a part of a wider research project. Participants were adolescents and adults from the Portuguese mainland and Azores islands communities. All participants were informed about the research goals, warranting the anonymity and the voluntary character of participation, and only participants who provided their written informed consent were administered the measures included in the research protocol. Adult participants were recruited through non-probabilistic sampling method (e.g., the authors’ contact networks, advertisements in local institutions, newspapers and other media). Underage participants were contacted in local schools and a signed informed consent was obtained from their legal representatives prior to participation.

Statistical Analyses

The statistical procedures were computed using WINSTEPS Rasch Analysis (version 3.93, SWREG Inc., 2017), MPLUS (Muthén and Muthén 2010) and IBM SPSS Statistics (version 23 for Microsoft Windows, IBM Inc. Armonk, NY) software.

The large and diverse sample size in the current study warrants the robustness of statistical analyses using the Item Response and Classical Test theories. The STEU-B was analysed through Rasch Model analysis (RM; Rasch 1961), as responses differed from item to item in terms of content, and so that items could be evaluated individually and independently. The one-dimensional structure of the STEU-B was firstly assessed with a Confirmatory Factor Analysis (CFA) and assessed through Principal Component Analysis of Residuals (PCAR), or contrasts of standardized Residuals variance (Aryadoust and Raquel 2019). Eigenvalues of unexplained variance obtained on the second contrast should not exceed 2 (Linacre 2011; Raîche 2005), otherwise it indicates multidimensionality problems. The local independence, another assumption of RM, was assured by calculating residual correlations between items and Yen’s Q3 was used to identify possible dependent pairs of items. Although no critical values are defined for Yen’s Q3, the largest observed correlation of residuals between pairs of items should not be over the absolute value of .20 (Christensen et al. 2017; Linacre 2011).

Parameters of persons and items in RM are transformed on a unit called “measure” (represented by a θ, theta) that is distributed along a continuum. The units of measure of θ are called logits (log–odds units), a scale with theoretical ranges being ± infinite, but typical amplitude ranging between ±5 (Prieto and Velasco 2006), and 0 being the average difficulty point set for the measure.

The Infit Mean Square (Infit MNSQ) and Outfit Mean Square (Outfit MNSQ) was calculated to evaluate the fit of the data to the model. The Infit is a standardized information (weighted mean square) that provides information regarding items and possible structural problems (Baker 2001; Prieto and Velasco 2006). Infit statistics values between 0.5 and 1.5 are productive for measurement, while values greater than 2.0 can degrade the measurement; values between 1.5 and 2.0 are regarded as unproductive for measurement construction; and values smaller than 0.5 are less productive for measurement, but not degrading (Linacre 2011). The Outfit is an unweighted mean square and is sensitive to outliers (Linacre 2011). Both indices are produced for items and persons.

To provide initial evidence of the validity of the STEU-B, we assessed possible item bias through Differential Item Functioning (DIF) analyses. We inspected for possible changes in item’s difficulty parameters across gender, educational level and age while keeping ability constant. It is expected that the items do not present differences in difficulty (e.g. favours a group over another) between male and female participants, participant’s educational level or age group. Whenever significant DIF is found with Rasch-Welch and Mantel-Haenszel tests, DIF that yield a logit difference below .43 are considered negligible, values above .44 are considered slight to moderate, and large difference in item functioning exists across groups when above .64 logits (Zwick et al. 1999).

Lastly, we complemented these analyses with Classical Test Theory practices by calculating convergent validity and exploring gender differences in STEU-B scores, and the relationship between STEU-B scores with age and performance in computerized neurocognitive tests. The scores from the computerized tasks were obtained from a sample of 170 adult participants and calculated from the total number of correct responses in each task (accuracy scores). We reported Pearson product-moment correlations with bootstrap resampling method set to 2000 samples and 95% bias-corrected Confidence Intervals.

Results

Dimensionality

Results from the unidimensional CFA model yielded adequate fit to the data: χ2(152) = 328.975, p < .001; CFI = .92; TLI = .91; RMSEA = .036 P[rmsea < .05] = 1. On an item-level analysis, item loadings ranged between −.16 and .75, the items 3, 4, 5, 16 and 18 presented individual loadings below .25. A closer inspection of the items presenting lower loadings revealed those items presented the highest percentage of incorrect response (over 55% receiving 0 points in each item). The value obtained through PCAR analysis was 1.43, which also provided support for a unidimensional model. Because Rasch model analyses provide information on item difficulty and no multidimensionality problems were identified in this phase, items with lower loadings were kept in the model for further analyses.

Low residual correlations were observed between items, with values ranging from r = −.18 to r = −.13. Yen’s Q3 tests yielded weak positive correlations for all items (r = −.19 to r = −.13), indicating local independence for all pairs of items.

Global Fit Statistics

The model tested has involved all participants and items from the STEU-B. Persons’ and items’ global fit measures are presented in Table 1. Most items presented adequate adjustments, with average Infit and Outfit values of .98 (SD = 0.09) and 1.22 (SD = .66), respectively. The maximum value of the Outfit was of 3.61, which suggests the presence of outliers or, at least, one item with poor adjustment. The amplitude of the measure for the items ranged from −1.58 to 2.41 logits and the measure’s standard error was low (between .07 and .11, M = 0.08; SD = 0.01). Person fit showed appropriate to Infit (M = .98, SD = 0.25) and Outfit (M = 1.19, SD = 1.11) average values. The maximum values of outfit also inform about the existence of participants with results that did not fit the model. The inspection of extreme infit and outfit values revealed that about 32 participants presented response patterns diagnosed matching “careless” or “lucky guessing” (Linacre and Wright 1994). All participants that presented this pattern were adolescents (participants under 18 years old) and were excluded from analysis. Because these response patterns might be a possible cause of measure distortions observed in item’s fit statistics, the model was analysed excluding the 32 participants identified as presenting problematic response patterns.

Table 1 Global fit statistic of STEU-B (N = 899)

Global fit statistics presented in Table 2 showed that the current set of items has the necessary conditions to produce an assessment tool with adequate metric properties. Item’s fit statistic now falls within adequate values of infit and outfit, without any indicators of measure degradation within the model (Infit and Outfit <2, Table 3). Reliability and separation values for person was .60 and 1.23 (lower bound), indicating a lower ability to discriminate participants in the sample according to their level of performance, while the item reliability and separation were .99 and 14.07 (lower bound), suggesting a good item difficulty hierarchy. The Cronbach alpha (KR-20) for the scale was .63.

Table 2 Global fit statistic of STEU-B (N = 867)
Table 3 Total score, θ, Standard Error, Infit e Outfit by Item, and Point-bisserial between item and total score of STEU-B (n = 867)

Items-Person Map

The distribution of persons and items across the measure (θ) is represented in Fig. 1. Most items align vertically across the logits scale, with only 3 pairs of items appearing parallel. In other words, the items in parallel positions represent a similar degree of ability (emotional understanding), and these items could be excluded to make the measure shorter. However, no item was excluded from the analysis because those items did not present any fit problems and it is preferable to keep the STEU-B identical across all versions. The items also referred to distinct interpersonal scenarios that may be relevant to the theoretical construct at hand and screening for careless/random response patterns (e.g. a participant with a certain degree of emotional understanding ability has the same probability of choosing the correct response irrespective of the contexts in which a situation takes place).

Fig. 1
figure 1

Item-person map of STEU-B (n = 867)

The average θ for persons was .06 logits, as the θ average for the items is zero by convention. Most participants fall within a range of 4 logits (between −2 and 2), with items 3 and 4 presenting less frequent rate of correct responses. This suggests items can evaluate higher capacity of emotion understanding than what was observed in most participants. The items measuring the highest degrees of emotional understanding are, thus, items 3 and 4 (placed near point 3 of the logits scale), and the items measuring the lesser degree are items 6 and 13 (placed near point −2 of the logits scale). Finally, each item’s statistics of STEU-B present adequate values of Infit and Outfit values, low standard errors and adequate point-bisserial correlations to the total measure, as presented in Table 3.

Differential Item Functioning

To test for differences in item functioning, we carried out Differential Item Functioning (DIF) analysis of items across groups according to gender (males vs. females), age (participants with more vs. less than 18 years old) and years of education (mandatory education or below vs. higher education). As shown in Fig. 2a, no statistically significant differences in item difficulty was found in pairwise comparisons in males and female participants, except for item 17, which reached the significant threshold and presented a .42 logits favouring males, a DIF value within a negligible threshold according to Zwick et al. (1999). Regarding educational level, 5 items presented statistically significant differences in measures in pairwise comparisons of participants with mandatory education or less, or higher education. As shown in Fig. 2b, 3 items favoured those with higher education levels, while 2 favoured the group with less years of education. The magnitude of this difference was considered moderate to large, as the 5 items DIF contrasts had values above .64 logits (or below −.64 logits, when items favoured participants with less years of education). Finally, when comparing groups regarding age (below or above 18 years old), nearly half the items presented large differences in item’s difficulty measures, as presented in Fig. 2c and Table 4. A more thorough analysis of the plot and the table show that items favouring youths were located among the easiest and medium difficulty items on the DIF measure plot, and the more difficult items tend to favour the older participants. Differential Group Functioning analysis was non-significant, indicating that observed differences in item difficulty between age groups did not impact the total score of the STEU-B measure.

Fig. 2
figure 2

Changes in item difficulty between male and female participants (a), level of education (b) and age (c) (n = 867); **: p ≤ .001

Table 4 Differential item functioning between youths and adults (n = 867)

Convergent and Discriminant Validity

To assess convergent and divergent validity, we explored the relationships between test scores and age and sex differences. A positive, but weak statistically significant correlation, was found with age, indicating older participants achieve higher scores in the STEU-B (r = .222, p < .001, 95 % CI [.166,  .277]). Females showed a statistically higher average score (M = 50.69; SD = 9.81) in comparison to male participants (M = 48.46; SD = 10.47; t(897) =  − 3.241, p = .001), but the gender effect was small (Cohen’s d = .22).

Additionally, a sample of 170 participants responded to the STEU-B and the measures of Emotion Recognition, Verbal Reasoning and Raven’s Matrices computerized tasks that recorded accuracy (number of correct responses), as a collateral source of information. Correlation analysis showed a significant association with the accuracy in measures of complex cognition assessed by Verbal Reasoning task (r = .370, p < .001, 95 % CI [.218,  .498]), with the Emotion Recognition task (r = .259, p < .001, 95 % CI [.049,  .438]) and Raven’s Progressive Matrices test (r = .338, p < .001, 95 % CI [.186,  .480]). Results indicate the convergence of scores involving more sophisticated cognitive processes required to appraise an emotion-eliciting scenario, and a weaker correlation with more elementary emotional processes, as the ability to recognize emotional states from facial stimuli, and non-verbal abstract reasoning.

Scoring

To more accurately take into account the differences in item difficulty and avoid several psychometric issues that arise from summing raw correct responses, test administrators are encouraged to convert raw test scores into logits or to a scaled linear score. Table 5 presents a score conversion to both logits and a linear scale (ranging from 0 to 100, in order to resemble percentages), and indicates the respective percentiles to more precisely locate the respondents actual measure and level of performance (Boone 2016). For example, a respondent that scores 12 correct responses should be attributed a score of .64 logits or 56 points, and his/her performance is slightly above the 70th percentile. In the current sample, 138 participants (15.9%) have scored 12 points and almost 80% of participants have 12 or less correct responses.

Table 5 Table of sample norms (500/100) and frequencies corresponding to complete test

Discussion

Emotion understanding is a skill of relative complexity, which is why it is considered a component that occupies a higher branch in hierarchical Emotional Intelligence (EI) ability models. It encompasses the interplay between cognitive and non-cognitive abilities to allow the understanding of the complex interplay between emotions and context. Notwithstanding the extensive theoretical debate and empirical research on models of EI ability and the importance of its outcomes to several fields of research and practice (e.g. occupational, educational, clinical, etc.), the issues related with assessment and measurement validity of EI abilities are still unsettled, and new measures and assessment paradigms have arisen in the past decade to address the existing methodological gaps (MacCann et al. 2011; MacCann and Roberts 2008; Roberts et al. 2008).

The current work sought to provide data on the psychometric properties of the brief version of the Situational Test of Emotion Understanding, a test devised to assess the ability to understand contextual cues that can elicit different emotional responses. We used Rasch Model (Item Response Theory), a suitable approach to ability measurement in dichotomous scales (e.g. 0 = incorrect/ 1 = correct), as it allows to create a measurement unit that considers the item’s difficulty within the same construct continuum and provides a qualitative assessment of a scale relevant for construct validity. Rasch model is a useful approach to psychometrics because it conforms to measurement theory rather than focusing on a measure’s performance in a particular sample. In the field of psychology, it offers additional advantages in terms of generalizability of measurement models to distinct samples and populations, and to the identification of extreme or unexpected individual responses and item particularities that are unavailable to Classical Theory testing (Fox and Jones 1998).

While studies on the reliability and (uni)dimensionality of the scale provide statistical validity, differential item functioning was used to control possible item bias toward specific groups, such as gender, educational level and age. The absence of bias in item difficulty between males and females provides evidence for the application of the STEU-B to the wider public. After ascertaining the invariance of the model, females have scored significantly higher than males, similar to previous studies using EI ability measures (e.g. Brackett and Mayer 2003). This is possibly due to females being more stimulated to develop certain social skills based on cultural and/or social pressures, although differences in emotional processing skills tend to fade in some contexts (see Brody and Hall 2010). However, no sufficient evidence has been gathered regarding gender differences in emotion understanding ability, specifically. The differences in item functioning were more prominent regarding age, and findings demonstrated that some items could be somewhat biased for older participants, but others to younger participants. Nevertheless, DIF cancellation was observed in the DIF analysis regarding possible bias according to education or age - items favouring one group or the other were balanced and did not compromise the overall test score (Wyse 2013). Irrespective of academic level, older participants can be ahead in terms of developmental processes relevant to social skills (e.g. complete brain maturation and development of important cognitive functions underlying emotion understanding). Furthermore, due to the relationship between Emotion Understanding and crystallized intelligence, which tends to increase over the lifespan, it is possible that this ability may also increase with age (Phillips et al. 2002). Older participants are expected to have accumulated experiences and knowledge over their lifespan that may be fundamental to the development of the capacity to more precisely estimate the emotions that can arise from a wide range of interpersonal contexts (Kafetsios 2004; Sharma 2017). However, when holding person’s abilities constant, it is important to ensure the measure is invariant across participants’ characteristics. This aspect can be particularly relevant when devising new items or instruments and when assessing emotion understanding ability in groups with a wide age range.

The reliability (KR-20) obtained in the current study was only adequate, but higher than those of the studies of the 42-item version (MacCann and Roberts 2008) and equal to those of the brief version by Allen et al. (2014). Lower reliability is expected in RM analyses because of local independence of items, in other words, the responses are not similar, nor depend or correlate with each other. The more typical reliability indices (e.g. Cronbach’s alpha, KR-20) are prone to be artificially inflated by local dependence and these same indices can be also affected by the number of alternatives in multiple-response forms (Jensen 1980). The possibility of studying the item’s properties and the ability range of respondents is an advantage of IRT over Classical Test Theories. According to Linacre (2011), a Rasch reliability above .5 can suffice to distinguish at least 2 levels of performance (e.g. high/low performers on a given measure), which is satisfactory for a brief evaluation of emotion understanding ability. The item map showed that the items could be precisely located across the latent variable in terms of difficulty hierarchy, but some gaps were wider in the top half (more difficult items). It is possible that the lower person reliability and separation values found in the current study be due to the reduced number of items or the sample presenting a narrower ability range, as it included only participants from the general population and no other extreme groups (e.g. clinical samples, participants with deficits in cortico-limbic functions). The ceiling effects found in the original studies of the STEU-B (Allen et al. 2014) and a similar limitation found in the study of the longer version of the STEU, which included a sample with a higher average education and age (MacCann and Roberts 2008), were not found in the current study. For this reason, the former explanation - the lower person reliability and lower separation values could be simply due to the small pool of items included in the brief version - should be more likely. Hypothetically, the longer version (more items) could be preferable for studies in which there is the need to further discriminate individuals by more nuanced levels of performance, as the increased number of items may increase the sensitivity of the STEU for more intermediate levels of ability. Conversely, the shorter version would be more advisable to avoid participant fatigue, when working with more vulnerable individuals or in studies in which more rough distinction between low and high ability testing would suffice (Allen et al. 2014; Boone 2016). Future research regarding this measure should explore how the longer and shorter forms are able to discriminate different levels of performance in order to clarify this hypothesis.

Findings regarding convergent and divergent validity showed that emotion understanding correlated positively with verbal reasoning, emotion recognition and general intelligence tests. The moderate and positive association with verbal ability is comprehensible to the extent that the ability to comprehend (e.g. read and/or interpret the descriptions of a scenario) and correctly provide a verbal label to an emotion can be an ability associated with emotion understanding, especially when tests are presented in a written format (MacCann et al. 2014). While it would be expected that EI abilities converge strongly to each other and less with cognitive abilities, findings are congruous with studies demonstrating the associations between emotion understanding and emotion perception can be significantly lower when tests consist of visual stimuli in comparison to other types of stimuli (Elfenbein and MacCann 2017). Previous findings also showed emotional understanding had stronger relationships with cognitive abilities in comparison to other EI branches, and these relationships were stronger for crystallized intelligence in comparison with fluid intelligence, providing support to the idea that emotion understanding involves conceptual knowledge and representations about emotions (Allen et al. 2014; MacCann 2010; MacCann et al. 2014). Nevertheless, studies need to further clarify how emotion understanding relates to other cognitive abilities within existing intelligence frameworks. The sample used in these analyses was small and bias-corrected confidence intervals had a relatively wide range, and future studies should aim to study further the STEU-B convergence, preferably in larger samples for more consistent estimates, and explore the measure’s temporal stability.

Overall, the 19-items provide a robust assessment of the emotional understanding construct, and findings are indicative of a substantial equivalence to the original version of the STEU-B (Allen et al. 2014; MacCann and Roberts 2008). The availability of a shorter form of the measure with a simple scoring procedure (correct/incorrect) is a particularly useful measure when working with specific populations (e.g. clinical samples) or when researchers face assessment time constraints. Test scores diverged from scores in other relevant cognitive processing abilities. Thus, it provides an important contribution to the assessment of EI abilities, as the field can benefit from the diversification of assessment measures and methodologies that can improve the operationalization of the EI-related constructs. The use of novel tests with demonstrated construct validity can fulfil the existing methodological gaps left by the overuse of the MSCEIT and overcome some difficulties related with the access to commercial measures for EI research. Furthermore, because the STEU-B focuses specifically on the Emotion Understanding construct, it can provide important insights about this EI component and research aimed at refining existing models of EI.

Increased performance in emotional understanding (or the ability to adequately elicit/predict affect and emotional states) is a higher-order process that requires both social and cognitive processing abilities to collect information and form global appraisals from social, contextual and/or motivational cues. Ability-based tools can help further clarify the relationship between specific higher-order EI skills and more complex cognitive abilities, as well as be used in efficacy studies of programs or interventions that promote different emotional skills and more adjusted behaviour (e.g. Cognitive Behavioural Therapy, Mindfulness-based cognitive therapies, Social-Emotional Learning programs, interventions for patient’s deficits in empathy, etc.; Dattilio et al. 2009; Liotti and Gilbert 2011; Wölwer et al. 2005). Furthermore, the use of more robust assessment tools of EI ability may be beneficial to managers, educators and practitioners acting in several settings (e.g. schools, workplace, healthcare), as well as researchers carrying out empirical studies with cross-sectional and longitudinal designs in a large variety of areas, that can tackle the potentialities and controversies around Emotional Intelligence.

Conclusion

The Portuguese version of the STEU-B is a novel and cost-effective measure available for professionals working with Portuguese-speaking communities. The dissemination of this tool is of added-value, taking into consideration the high number of Portuguese native-speakers worldwide (Lewis et al. 2016), and the possibility of increasing the diversity of assessment measures to be used in cross-cultural studies or empirical research on EI and social cognition domain. On the other hand, this aids the refinement of existing models of EI and may be a useful tool to clarify the interdependence between social cognition, cognitive ability and related constructs in future studies. Finally, it should be highlighted that the use of specific tests of emotional skills, including emotion understanding, can help devise more targeted evaluation and intervention strategies in clinical, occupational or educational settings, and identify the mechanisms related to adjusted interpersonal functioning and psychological well-being.