Can UX Over Time Be Reliably Evaluated? - Verifying the Reliability of ERM

Kurosu, Masaaki; Hashizume, Ayako

doi:10.1007/978-3-030-22646-6_12

Masaaki Kurosu¹⁵ &
Ayako Hashizume¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11566))

Included in the following conference series:

International Conference on Human-Computer Interaction

3058 Accesses

Abstract

Experience Recollection Method (ERM) was announced at HCII 2018 conference by the author and is a research method for the UX (User Experience) that evaluates the contents and degrees of UX. It is a memory-based measurement method for the UX where the informants are asked about the content and degree of episodes concerning the use of targeted artifact (product, system or service) along with the rough time scale. Although the method requests episodes on the artifact usage, it doesn’t ask the time information in detail considering the vagueness of memory, hence it does not generate the visual representation of UX curve/graph in such methods as CORPUS, iScale, UX Curve and UX graph.

This presentation is based on the data that was obtained in FY 2017 for the same informants once in September and another in January. We compared two data and checked if the information obtained by ERM has a certain degree of reliability, i.e. the nepisode in the first survey is kept in the second survey and if the rating scale value are almost the same. This presentation is based on the comparison of two datasets. Generally speaking, ERM was confirmed to have the high reliability and provide a reliable information on the UX.

You have full access to this open access chapter, Download conference paper PDF

User Experience Evaluation by ERM: Experience Recollection Method

Temporal Development of Quality of Experience

Motivation to Self-report: Capturing User Experiences in Field Studies

Keywords

1 Evaluation of UX

1.1 Previous Approach

The UX (User Experience) can be defined as “person’s perceptions and responses resulting from the use and/or anticipated use of a product, system or service (ISO9241-11:2010) [1].” As an extension of this definition, Kurosu [2,3,4,5] proposed that the quality in design including the usability is the basis for the quality in use and the latter is related to the UX. Thus, the UX should be evaluated or measured in relation to wider range of quality characteristics including usability, functionality, performance, reliability, safety, attractiveness, etc.

From this perspective, evaluation methods that have been proposed until now will be screened to a small set of methods, although there are so many so-called “UX evaluation methods” have been proposed as listed in the website of All About UX [6].

UX evaluation methods can be classified into two categories; real-time UX evaluation, and memory-based UX evaluation.

The real-time UX evaluation methods including the questionnaire can obtain the information in-situ and just in time. But, because of their invasive nature, i.e. informants are requested to answer the question during their everyday life and it is sometimes an obstruction for them, it is not recommended to repeat the survey for more than a few weeks. In other words, real-time UX evaluation methods can be repeatedly applied only for a limited time range.

On the contrary, memory-based UX evaluation methods do not have such temporal limitations. They can be applied to the UX over the long-term period. But they do have limitations originated from the nature of human memory. People forget many events even though they were important, and people may also edit or change the contents without any ill will. Hence the validity and reliability are important in terms of the memory-based methods.

Anyways, real-time evaluation methods include the questionnaires and methods for evaluating emotion and other methods. The questionnaire includes SUS [7], SUPR-Q [8], Product Reaction Card [9] and AttrakDiff [10]. The evaluation methods for emotion include 3E [11] and Emo2 [12], and the other real-time evaluation methods include ESM [13], and diary methods such as DRM [14, 15] and TFD [16].

Memory-based methods include CORPUS [17], iScale [18], UX curve [19], UX graph [20] and ERM [5, 21].

1.2 ERM

ERM was proposed by Kurosu et al. [5, 21] based on the reflection of advantages and disadvantages of memory-based methods until now. Similar to previous methods, informants are asked past events (episodes) and the rating for them. But the curve or graph will not be drawn in ERM, because of the idea that the memory will not have such preciseness as can be represented as the coordinate on time scale. As can be seen in Fig. 1, informants are given only 7 rough time zones: expectation, purchase, early use, use, recent use, present time and near future that include all phases of experience for an artifact (product, system or service).

Each time zone means;

Expectation: estimation of UX before the purchase
Purchase: evaluation of UX at the purchasing or obtaining the artifact
Early use: evaluation of UX just after the purchase (around a few weeks to a few months)
Use: evaluation of UX after the early use until the recent use. This may range from a few weeks to several years depending on the time of purchase and the time of survey
Recent use: evaluation of UX just before the time of survey (around a few weeks to a few months)
Present time: evaluation of UX at the time of survey
Near future: expectation and/or estimation of UX after the time of survey

Although ERM uses the letter-sized paper and the number of the row is limited, informants are allowed to write more than one episode in a row. Every episode should accompany the rating on the feeling from positive (+10) to negative (–10), i.e. 21 points scale is used.

1.3 Reliability of ERM

Similar to the psychological test, UX evaluation methods should possess a certain level of validity and reliability. Regarding the validity especially the content validity, there will be no problem because informants are asked about their own experience regarding the use of an artifact they’ve been using. But the reliability is not yet confirmed and there seems to be no researches conducted to investigate this issue. This is the reason why this paper deals with the reliability issue of the memory-based UX evaluation method.

2 Verifying the Reliability of ERM

2.1 Method

The basic idea of the verification of reliability of ERM is to compare the result of two surveys for the same group of informants, i.e. re-test method. Luckily, I had a class at the graduate school and thought that I should ask students to collaborate for the survey twice, once at the first lecture and another at the last lecture of the semester. And the first survey was conducted on Sep. 25, 2017 and the second survey on Jan. 22, 2018. There were 119 days between two surveys. It would have been almost impossible for the informants to remember what they answered at the first survey when they were subject to the second survey.

Informants

There were 26 students registered to my class and the attendants at the lecture on Sep. 25, 2017 and Jan. 22, 2018 were shown in Table 1. Because of the absence at each lecture, total of 23 students attended either of the lecture and, from among them, 17 students attended both lectures resulting 17 data available for the analysis. Unfortunately, it became clear that 3 of them purchased a new model during the analysis and were excluded from the analysis that followed. Finally, 14 data were actually used in the total analysis. To our regret, all the students were male perhaps because my class was opened at the engineering department.

Table 1. Attendants of the class on Sep. 25, 2017 and Jan. 22, 2018

Full size table

Targeted Device

Because all of the students were using the smartphone (not the cellphone), it was decided to be the targeted device. Because it is the multi-purpose device and almost all users are using it daily, or more to say, many times during the day, its user must have various experiences from a positive one to a negative one.

Procedure

ERM sheet was delivered to each of the informants, then a brief instruction was given, and 30 min were allowed to write down their experiences. During the instruction, informants were told that

This is to ask you to write down your personal experience on the smartphone
First, write down your university ID, sex, age, and description of the smartphone
There are seven time-zones including the expectation, purchase … near future (with the explanation of each time range)
You are requested to write down what you experienced in the episode slot and the degree of satisfaction vs. dissatisfaction or positive feeling vs. negative feeling from +10 through 0 to –10 depending on your subjective impression in the rating slot. Please use the integer
You may begin with the expectation to the near future, but it depends on your feeling which time zone you would write. You can go back to the previous time zone that you have already filled
Although there are limited number of slots, you can write two or more episodes and ratings in one slot if you need more than one
If you don’t remember what you experienced at any time zone, you can skip it

Obtained Data

Handwritten ERM sheets were obtained and were input to Excel, then were translated into English. Appendix shows all the raw data.

3 Results

3.1 Rating Results

Rating values correspond to the vertical position of the curve/graph in CORPUS, iScale and UX curve. But unlike those methods, ERM only separates rough time zones corresponding to the quasi-continuous horizontal position in them.

One point that should be warned for the use of the time zone during the reliability verification is the meaning of each time zone in the first survey and in the second survey. As shown in the imaginative data of Fig. 2, each time zone shifts bit by bit depending on the displacement of the time when the surveys were conducted. For example, to an imaginative informant in this figure, the smartphone was purchased, of course, at the same time. But the following time zone represents a bit displaced physical time depending on when the survey was conducted. For example, the present time at the first survey was Sep. 2017 while the present time at the second survey was Jan. 2018. This displacement will become larger as the span between the purchase and the survey becomes shorter.

Fortunately, informants who purchased their smartphone in 2017 (informant C, F, H, I, L and M) showed similar ups and downs of the rating value for the first survey and second survey as can be seen in the following sections. This may mean that the time zones for the informants were not exact but rough, hence horizontal position in curves/graphs in CORPUS, iScale and UX curve were not exact based on the equal time unit and thus they should be called quasi-continuous.

3.2 Reliability Measure

Usually, the reliability (ρ) is represented by the correlation coefficients (r). In this study, Kendall’s coefficient of concordance (W) was also calculated. These values were calculated based on the average rating for 7 time zones. The distribution of r is shown in Fig. 3 and that of W is shown in Fig. 4. These graphs show rather high reliabilities.

3.3 Episode Results

Episodes were verbal in nature, thus will be analyzed one by one in the next chapter.

4 Analysis of Each Data

ID of each informant was randomly assigned. Episode is assigned its ID as <informant ID><episode number>-<year> , e.g. A4-2017. Please refer to the Appendix.

Informant A generally gave positive ratings except for the size of device that was negatively rated (A12-2017 and A13-2018) and its weight that was also negatively rated (A12-2017 and A8-2018) (Fig. 5).

Informant B generally gave positive ratings especially in 2017. But he rated negatively regarding the future (B10-2017, B10-2018) that the device may not be able to correspond to the new applications (Fig. 6).

Informant C gave strange ratings to the same aspect of the device that the specification is almost the same with the previous cellphone, one negatively and another positively (C2-2017, C2-2018) (Fig. 7).

In the beginning of use, informant D felt a negative impression on the trace of finger print on the touch-panel (D2-2017, D4-2018), but his evaluation gradually became positive during the usage (Fig. 8).

Informant E showed a drastic change of ratings that was quite high in early days and changed into negative during the usage in terms of the battery life (E9-2017, E9-2018), and the wifi connection (E10-2017, E10-2018) (Fig. 9).

Informant F wrote episodes differently for 2017 and 2018. But there are some common episodes such as the speed (F4-2017, F5-2018), the quality of photograph (F7-2017, F6-2018), the convenience of second screen (F11-2017, F8-2018), etc. (Fig. 10).

Informant G wrote about the joy of accessing internet (G1-2017, G1-2018) and that of using net contents (G5-2017, G3&G4-2018) positively (Fig. 11).

Informant H wrote about the high expectation (H1-2017, H1-2018) and the screen quality (H2-2017, H2-2018). Generally his ratings are higher in 2018 (Fig. 12).

Informant I gave no negative ratings. Positive evaluations are for the processing speed during the expectation (I1-2017, I1-2018) and the early use (I3-2017, I3-2018). Strangely he rated the fast battery loss positively (I10-2017, I9-2018) and he might have misunderstood the instruction (Fig. 13).

Informant J had different expectation one negatively for the poor operability (J1-2017) and another positively for the good performance (J1-2018). This informant did not give the consistent episodes except for the present time evaluation (J12-2017, J12-2018) (Fig. 14).

Informant K generally gave positive ratings except recently for the finger print recognition (K9-2017, K9-2018) and the unexpected break down (K10-2017, K10-2018) (Fig. 15).

Informant L gave negative ratings only recently for the lack of storage (L9-2017, L9-2018). Another negative evaluation was given differently, one to the heat (L10-2017) and another to the system freeze (L10-2018) (Fig. 16).

Informant M complained for the same problem of gyro sensor that occurred during the early use (M3-2017, M3-2018). He still pointed out that problem at the present time (M7-2017, M7-2018) (Fig. 17).

Informant N pointed out the usability of the plastic case cover (N2&N3&N4&N9&N10-2017, N3&N9-2018) generally a bit negatively (Fig. 18).

5 Conclusion

The reliability of ERM was tested in terms of the smartphone using the re-test method. Two surveys for the same 14 informants who continued to use the same model were conducted one on Sep. 2017 and another on Jan. 2018. By the use of ERM, episodes and subjective ratings were obtained for 7 time zones including the expectation, purchase, early use, use, recent use, present and near future. Two reliability measures (one is the correlation coefficient and another is the coefficient of concordance) were calculated and relatively high reliability was confirmed.

Based on the content analysis for each informant, the same episodes were found around the same time zone and were rated in the same way. This also confirmed the high reliability of ERM

References

ISO 9241-210:2010. Ergonomics of Human-System Interaction - Human-Centred Design for Interactive Systems (2010)
Google Scholar
Kurosu, M.: Re-considering the concept of usability. In: Keynote Speech at APCHI2014 Conference (2014)
Google Scholar
Kurosu, M.: Usability, quality in use and the model of quality characteristics. In: Kurosu, M. (ed.) HCI 2015. LNCS, vol. 9169, pp. 227–237. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20901-2_21
Chapter Google Scholar
Kurosu, M.: Theory of User Engineering. CRC Press, Boca Raton (2016)
Book Google Scholar
Kurosu, M., Hashizume, A., Ueno, Y.: User experience evaluation by ERM: experience recollection method. In: Kurosu, M. (ed.) HCI 2018. LNCS, vol. 10901, pp. 138–147. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91238-7_12
Chapter Google Scholar
All About UX. All UX Evaluation Methods. https://www.allaboutux.org/all-methods
Brooke, J.: SUS: A “quick and dirty” usability scale. In: Jordan, P.W., Thomas, B., Weerdmeester, B.A., McClelland, A.L. (eds.) Usability Evaluation in Industry. Taylor and Francis, London (1996)
Google Scholar
Sauro, J.: SUPR-Q: a comprehensive measure of the quality of the website user experience. J. Usability Stud. 10(2), 68–86 (2015). http://uxpajournal.org/supr-q-a-comprehensive-measure-of-the-quality-of-the-website-user-experience/
Google Scholar
Benedek, J., Miner, T.: Measuring desirability: new methods for evaluating desirability in a usability lab setting. In: Proceedings of Usability Professional Association (2002)
Google Scholar
Hassenzahl, M., Burmester, M., Koller, F.: AttrakDiff: a questionnaire to measure perceived hedonic and pragmatic quality (in German). In: Ziegler, J., Szwillus, G. (eds.) Mensch & Computer. B.G. Teubner (2003)
Google Scholar
Tahti, M., Arhippainen, L.: A proposal of collecting emotions and experiences. In: Interactive Experiences in HCI, vol. 2, pp. 195–198 (2004)
Google Scholar
Laurans, G., Desmet, P.M.A.: Using self-confrontation to study user experience: a new approach to the dynamic measurement of emotions while interacting with products. In: Desmet, P.M.A., van Erp, J., Karlsson, M. (eds.) Design & Emotion Moves. Cambridge Scholars Publishing (2006)
Google Scholar
Larson, R., Csikszentmihalyi, M.: The experience sampling method. New Dir. Methodol. Soc. Behav. Sci. 15, 41–56 (1983)
Google Scholar
Kahneman, D., Krueger, A.B., Schkade, D.A., Schwarz, N., Stone, A.A.: Method for characterizing daily life experience: the day reconstruction method. In: American Association for the Advancement of Science, pp. 1776–1780 (2004)
Google Scholar
Karapanos, E., Zimmerman, J., Forlizzi, J., Martens, J.-B.: User experience over time: an initial framework. In: CHI 2009 Proceedings, pp. 729–738. ACM (2009)
Google Scholar
Kurosu, M., Hashizume, A.: TFD (Time Frame Diary)–a new method for obtaining ethnographic information. In: APCHI 2008 Proceedings (2008)
Google Scholar
von Wilamowits-Moellendorff, M., Hassenzahl, M., Platz, A.: Dynamics of user experience: how the perceived quality of mobile phones changes over time. In: UX WS NordiCHI 2006, pp. 74–78 (2006)
Google Scholar
Karapanos, E., Martens, J.-B., Hassenzahl, M.: Reconstructing experiences with iScale. Int. J. Hum Comput Stud. 70, 1–17 (2012)
Article Google Scholar
Kujala, S., Roto, V., Vaananen-Vainio-Mattila, K., Karapanos, E., Sinnela, A.: UX curve: a method for evaluating long-term user experience. Interact. Comput. 23, 473–483 (2011)
Article Google Scholar
Kurosu, M.: Is the satisfaction evaluation by UX graph, a cumulative one or recency-based one? (in Japanese). In: Japan Kansei Engineering Society Spring Conference Proceedings (2015)
Google Scholar
Kurosu, M., Hashizume, A., Ueno, Y., Tomida, T., Suzuki, H.: UX graph and ERM as tools for measuring Kansei experience. In: Kurosu, M. (ed.) HCI 2016. LNCS, vol. 9731, pp. 331–339. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39510-4_31
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

The Open University of Japan, Tokyo, Japan
Masaaki Kurosu
Hosei University, Tokyo, Japan
Ayako Hashizume

Authors

Masaaki Kurosu
View author publications
You can also search for this author in PubMed Google Scholar
Ayako Hashizume
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Masaaki Kurosu .

Editor information

Editors and Affiliations

The Open University of Japan, Chiba, Japan
Masaaki Kurosu

Appendix

Raw data (episodes and ratings) of ERM at Sep. 2017 and Jan. 2018 for all 14 informants are shown in following tables.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kurosu, M., Hashizume, A. (2019). Can UX Over Time Be Reliably Evaluated? - Verifying the Reliability of ERM. In: Kurosu, M. (eds) Human-Computer Interaction. Perspectives on Design. HCII 2019. Lecture Notes in Computer Science(), vol 11566. Springer, Cham. https://doi.org/10.1007/978-3-030-22646-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-22646-6_12
Published: 27 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22645-9
Online ISBN: 978-3-030-22646-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us