Introduction

The System Usability Scale (SUS) developed in 1986 by Digital Equipment Corporation© is a ten-item scale giving a global assessment of Usability, operatively defined as the subjective perception of interaction with a system (Brooke 1996). The SUS items have been developed according to the three usability criteria defined by the ISO 9241-11: (1) the ability of users to complete tasks using the system, and the quality of the output of those tasks (i.e., effectiveness), (2) the level of resource consumed in performing tasks (i.e., efficiency), and (3) the users’ subjective reactions using the system (i.e., satisfaction).

Practitioners have considered the SUS as unidimensional (Brooke 1996; Kirakowski 1994) since the scoring system of this scale results in a single summated rating of overall usability. Such scoring procedure is strongly based on the assumption that a single latent factor loads on all items. So far this assumption has been tested with inconsistent results. Whereas Bangor et al. (2008) retrieved a single principal component of SUS items, Lewis and Sauro (2009) suggested a two-factor orthogonal structure, which practitioners may use to score the SUS on independent Usability and Learnability dimensions. This latter finding is very inconsistent with the unidimensional SUS scoring system as items loading on independent factors of Usability and Learnability cannot be summated according to the classical test theory (Carmines and Zeller 1992). Furthermore, these factor analyses of the SUS have been carried out by exploratory techniques, nevertheless these techniques lack of the necessary formal developments to test which of the two proposed factor solutions is the best account of collected data.

Unlike exploratory factor analysis, confirmatory factor analysis (CFA) is a theory-driven approach who needs a priori specification of the number of latent variables (i.e., the factors), of the observed-latent variables correlations (i.e., the factor loadings) as well as of the correlations among latent variables (Fabrigar et al. 1999). Once the model’s parameters have been estimated, the hypothesized model is evaluated according to its ability to replicate sample’s data. These features make the CFA approach the state of the art most accurate methodology to compare alternative factorial structures and eventually decide which is the best one.

Purpose

In the present study, we aim at comparing three alternative factor models of the SUS items: the one-factor solution with an overall usability factor (overall SUS) resulting from Bangor et al. (2008) (Fig. 1a); the two-factor solution resulting from Lewis and Sauro (2009) with uncorrelated Usability and Learnability factors (Fig. 1b) and its less restrictive alternative assuming Usability and Learnability as correlated factors (Fig. 1c).

Fig. 1
figure 1

SUS models tested: one-factor (a), two uncorrelated factors (b), two correlated factors (c)

Methods

Procedure

One hundred and ninety-six Italian students of University of Rome “La Sapienza” (28 males, 168 females, age mean = 21) were asked to navigate a website (http://www.serviziocivile.it) in three consecutive sections (all the students declared they never had previous surfing experience with the website):

  1. 1.

    In the first 20-min pre-experimental training section, the participants were asked to navigate the website freely in order to learn features, graphic layouts, information structures and lays of the interface.

  2. 2.

    Afterwards, in the second no-time-limit-scenario-based navigation section, the participants were asked to navigate the website following four scenario targets.

  3. 3.

    Finally, in the third usability evaluation section, the SUS-Italian version was administered to the participants (Table 1).

    Table 1 Synoptical table of the English and Italian versions of the SUS

Statistical analyses

All models were estimated by the Maximum Likelihood Robust Method as the data were not normally distributed (Mardia’s normalized coefficient = 10.72). This method provided us with the Satorra–Bentler scaled chi-square statistic (S–Bχ 2), which is an adjusted measure of fit for non-normal data that is more accurate than the standard ML statistic (Satorra and Bentler 2001). According to the inspection of the model’s χ 2, virtually any factor model can be rejected if the sample size is large enough, therefore many authors (McDonald and Ho 2002; Widaman and Thompson 2003) recommended to supplement the evaluation of the model’s fit by some more “practical” indices. The so-called Comparative Fit Index (Bentler 1990) was purposefully designed to take sample size into account, as it compares the hypothesized model’s χ 2 with the null model’s χ 2. By convention (Hu and Bentler 2004), a CFI greater than 0.90 indicates an acceptable fit to the data, with values greater 0.95 being strongly recommended. A second suggested index is the Root Mean Square Error of Approximation (Browne and Cudeck 1993). Like the CFI, the RMSEA is relatively insensitive to sample size, as it measures the difference between the reproduced covariance matrix and the population covariance matrix. Unlike the CFI, the RMSEA is a “badness of fit” index as a value of 0 indicates perfect fit and the greater the RMSEA the worse the model’s fit. By convention (Hu and Bentler 2004), a RMSEA less than 0.05 corresponds to a “good” fit and an RMSEA less than 0.08 corresponds to an “acceptable” fit.

Results

Table 2 shows that the S–Bχ 2 was statistically significant for all the models we tested regardless of the number of factors and of whether the factors were correlated or not (Bentler 2004). The inspection of the CFI and RMSEA fit indexes indicated, however, that the less restrictive model assuming Usability and Learnability as correlated factors (Fig. 1c) resulted in a good fit (i.e., CFI > 0.95 and RMSEA < 0.06), whereas the unidimensional factor model (Fig. 1a) proposed by Bangor et al. (2008) resulted only in an acceptable fit (i.e., CFI > 0.90 and RMSEA < 0.00). Differently, the two-factor model proposed by Lewis and Sauro (2009) with uncorrelated factors (Fig. 1b) did not meet with any of the recommended fit indexes.

Table 2 Exact and close fit confirmatory factor analysis statistics/indices maximum likelihood estimation for the system usability scale

Since both the Bangor’s and the Lewis and Sauro’s factor models are nested within the less restrictive and best fitting model (i.e., the model with Usability and Learnability as correlated factors) we could formally compare the fit of each of the model proposed in the literature to the fit of the model which they were nested in. Nevertheless, given that we used the Satorra–Bentler scaled χ 2 measure for not multivariate normal data, we could not merely assess the χ 2 difference of two nested models. Rather we have assessed the scaled S–Bχ2 difference according to the procedures devised by Satorra and Bentler (2001). The first contrast, which involved the comparison of the Lewis and Sauro’s (2009) model (Fig. 1b) to the less restrictive two-factor model with correlated factors (Fig. 1c), was statistically significant (ΔS–Bχ 2 = 30.17; df = 1; p < 0.001). Likewise, the second contrast, which involved the comparison of the unidimensional model (Bangor et al. 2008) (Fig. 1a) to the less restrictive two-factor model with correlated factors (Fig. 1c), was also statistically significant (ΔS–Bχ2 = 28.54; df = 1; p < 0.001). Based on the inspection of absolute and relative fit indexes as well as on the results of formal tests of χ 2 differences, we may conclude that the two-factor model with correlated factors outperformed both the factor models proposed in the literature to account for the measurement model of the SUS.

The inspection of model parameters assessed for the best fitting model (Table 3) indicated that all the SUS items significantly loaded on the appropriate factor, with factor loadings ranging from |0.44| to |0.74| for Usability and greater than 0.70 for Learnability. Accordingly, the factor reliability assessed by the ω coefficientFootnote 1 yielded fairly high values, such as 0.81 and 0.76, respectively, for Usability and Learnability factors. The correlation of Usability and Learnability was positive and significant (r = 0.70) thus showing that the greater the perceived Usability the greater the perceived Learnability.

Table 3 Maximum likelihood standardized solution for the two-factor model of the system usability scale

Conclusions

Despite the SUS is one of the most used questionnaires to evaluate usability of systems, recent contributions have provided inconsistent results regarding the factorial structure of its items, which in turn has important consequences in determining the most appropriate scoring system of this scale for practitioners and researchers. The traditional unidimensional structure (Brooke 1996; Kirakowski 1994; Bangor et al. 2008) has been challenged by the more recent view of Lewis and Sauro (2009), assuming Learnability and Usability as independent factors. Based on a relatively large sample of users’ evaluations of an existing website, we tested which of the two alternative models was the best for SUS ratings. Our data indicated that both the proposed models had a not satisfactory fit to the data with the unidimensional model—being too narrow to represent the contents of all SUS items—and with the two-factor model with uncorrelated factors—being too restrictive for its psychometric assumptions. We thus released the hypothesis that Usability and Learnability are independent components of SUS ratings and tested a less restrictive model with correlated factors. This model not only yielded a good fit to the data, but it was also significantly more appropriate to represent the structure of SUS ratings. Albeit the literature reported greater reliability coefficients (e.g., >0.80) of the Overall SUS scale, the reliability of the two Learnability and Usability factors was in keeping with required psychometric standards for short scales (Carmines and Zeller 1992). Thus, we propose that future usability studies may evaluate systems according to the scoring rule suggested by Lewis and Sauro (2009) which is very consistent with the bidimensional and best fitting model we have retrieved in this study. However, since we have found a relative correlation of Usability factors with Learnability ones, future studies should clarify under which circumstances researchers may expect to obtain Usability scores dissociated from Learnability (e.g., systems with high Learnability but low Usability). In the present study, users evaluated a single system (i.e., the serviziocivile.it website) and this might have boosted up the association of the two factors. Alternatively, our sample of users, who is comprised of college students, might be considered a sample with high computer skills compared to the general population and this might have also boosted up the factor correlation. Other studies of the SUS should, then, consider different combinations of systems and users to test the generality of the correlation of the two factors.