Keywords

1 Introduction

This chapter is concerned with serious games and the data collected about serious games. It concerns itself with the process of turning such data into information and, through data analytics, drawing conclusions about that information. Thus, collected data is synthesized as information and ultimately into evidence.

There has been significant growth in research and industry attention, and public awareness, of the collection and analysis of the vast amounts of data that is available electronically. This has characterized the popularization of “Big Data” and the associated knowledge and value to be generated through both automated and perceptual techniques for data analytics. At one end of the data analytics pipeline is the notion of visual analytics, the use of visualization techniques to represent multidimensional data so that people can use their perceptual skills and incomplete heuristic knowledge to find useful patterns in the data. We might term these patterns information, and we note that these approaches rely on human skills that do not necessarily translate well to computers. More automated approaches often referred to as “data-mining” also exist. While the intention is the same, that is to identify useful patterns or information, the approach is complementary, relying on the strengths of computers to perform rapid, repetitive, error-free mathematical tasks that allow large amounts of data to be quickly processed.

Given that neither of these approaches to finding information in large data is discrete, good data analytics might well rely on combining the strengths of both approaches. However, information does not imply evidence. The accuracy of the information very much depends on the quality and validity of the data and the transformations that filter, abstract, and simplify the vast volumes of data to support analysis. If poor quality data is initially collected, then its progress through later stages of the analytics pipeline will be compromised and the validity of any identified patterns weakened.

Serious game analytics share many of the same challenges as data analytics in other computer systems that are focused on human activity. A typical challenge is how to collect data without influencing its generation and more fundamentally, how to collect and validate data from human participants where a primary focus is on what people are thinking and doing.

This chapter will explore data collection issues from serious games as the initial step to any serious gameplay analytics. We use a systematic review process to consider the metrics and measures across the human–computer interaction, gaming, simulation, and virtual reality literature. We identify how researchers gather data to try and answer fundamental questions about the efficacy of serious games and the design elements that might underlie their efficacy.

Data collection is interdisciplinary and a comprehensive literature review over computer science, psychology, and education, for instance, is outside the scope of this chapter. The focus here will be on the temporal aspect of data collection during serious game studies, namely how, and if, data is collected before, during, and after serious gameplay. The chapter uses a framework of traditional data collection methods to identify a core mapping to the serious game literature. The study is broad in that it covers diverse research from numerous disciplines, over a long time frame, that have used a wide range of methods and been driven by different motivations. This diverse body of research is first found by systematically selecting eight relevant literature reviews from 2009 to 2014 related to serious game research. To provide depth to the study, each of the study papers identified in these literature reviews (n = 299) are examined in terms of the way data is collected to assess the efficacy and usability of the games.

While the enthusiasm for serious games is unquestioned, the business case for serious games still requires more tangible evidence, both qualitative and quantitative. However, a first step to better evidence is a close examination of the data collected from serious games. It is this data that will be processed by any serious game analytics, and ultimately demonstrate the worth of the source serious games. This chapter provides a historical review of data collection as a resource for researchers in serious games, human–computer interaction, and anyone who is concerned about the collection and accuracy of gameplay data for future analytic purposes. Also, in the discussion section of this chapter, we will reflect upon the question of evidence and how well it relates to the two key issues of efficacy and usability in games that are used for serious purposes.

2 Study Method

To perform our study, we designed a systematic process that could be repeated or amended to accommodate both changes in scope and alternative research questions, and extended to incorporate future literature (see Fig. 2.1).

Fig. 2.1
figure 1

Overview of the process used in this study

2.1 Data Characterization

The process began with a data characterization activity where we identified the type of data we wished to collect about each individual game study (see 1.1 in Fig. 2.1). We focused on the temporal aspect of data collection during serious game studies, namely how, and if, data is collected before, during, and after serious gameplay (see Fig. 2.2). These stages are relatively standard across the range of methodological approaches used to evaluate games.

Fig. 2.2
figure 2

Overview of the data characterization used in this study

We decided to categorize the data collected during each of these stages based on common data gathering techniques (see Fig. 2.2). These frequently used techniques are taken from a list provided by Rogers, Sharp, and Preece (2011) in their popular text on human–computer interaction, and include:

  • Interviews

  • Focus Groups

  • Questionnaires

  • Direct observation in the field

  • Direct observation in a controlled environment

  • Indirect observation

Although the first three techniques are self-explanatory, the last three techniques require further clarification. Direct observation consists of observing actual user activity and typically involves the collection of qualitative data by capturing the details of what individual or groups are doing with a system, for example, with observers taking notes of user behaviors. This can be conducted in the field, where users are interacting in the target environment for a system. Examples could include a normal teaching session in a classroom for an educational system or on a walking tour for a mobile application (Rogers et al., 2011, p. 262). Thus, the system is being used in a real-life situation, e.g., with high ecological validity (McMahan, Ragan, Leal, Beaton, & Bowman, 2011; Smith, Stibric, & Smithson, 2013). Direct observation in a controlled environment is typically a laboratory-based environment where conditions can be controlled and standardized between participants and sessions. This allows users to focus on a task without interruption. However, results from studies in such environment may not generalize as the conditions are, by default, artificial.

In contrast to direct observation, where the users can see that they are actively being observed, indirect observation involves gathering data where users are not distracted by the collection mechanism. This could include collecting qualitative data, for example, from a user diary, or quantitative data from automated event logging. The latter is particularly attractive for serious games as event logs can be tailored to collect any pertinent information; for example, task sequences, task completion times, and/or percentage of tasks accomplished. Loh (2009) details a number of logging examples including basic game event logs, After-Action Reports as graphical game logs, and biofeedback data to capture physiological reactions. Such in-process data collection is by its nature objective and can provide substantial volumes of data for further analytic treatment.

While these techniques cover a good range of the mixed methods used in game research, we also recognize that other categorizations could have been adopted. For example, an alternative and more detailed classification of 16 different data collection techniques used in games studies is provided by Mayer et al. (2014). While there is merit for more complex categorization, we recognized the difficulty of collecting our own data; game studies from various disciplines do not have a standardized approach to describing data collection methods. Since we intended to be reviewing a large and broad range of studies, we sought to keep our data classification as simple as possible. Thus focusing on specific techniques, for example, the use of telemetry or Information Trails (Loh, 2012) for indirect observation of gameplay, is outside the scope of the current review. However, we will revisit issues surrounding the data collection process in the discussion section.

2.2 Identify Data Sources (Systematic Review)

We adopted a systematic approach to identifying existing reviews of serious game research across domains (see 1.2 in Fig. 2.1). A systematic review is developed to gather, evaluate, and analyze all the available literature relevant to a particular research question or area of interest, based on a well-defined process (Bearman et al., 2012; González, Rubio, González, & Velthuis, 2010; Kitchenham et al., 2009). The systematic review methodology is extensively used in the healthcare domain (Bearman et al., 2012) and has been widely adopted in other areas including business (González et al., 2010), education (Bearman et al., 2012), and software engineering (Kitchenham et al., 2009; Šmite, Wohlin, Gorschek, & Feldt, 2010).

A systematic review methodology requires the identification of all published works relevant to the requirements. The search strategy adopted covers key term searches in relevant scholarly databases. We included the Web of Science, Scopus, EBSCOhost, and Wiley Interscience bibliographic databases in the search. The search was conducted over article titles to restrict results to primary studies, and includes journal articles, book chapters, and review papers in the results. Search results were restricted to papers published between 2009 and 2014 inclusive.

The objective of our systematic review was to identify all review articles of studies using serious games. To conduct a review that meets our objective, the search term used needs to accommodate two key purposes. The first purpose was to find published works relating to serious games. We expanded the term, serious games, to include references to studies of games for applied, learning, teaching, or educational purposes (Crookall, 2010). The second purpose is to find review or meta-review articles only as the basis for “drill-down” to individual studies. We therefore included the terms review, meta-review, or meta-analysis in the search term. Several preliminary searches were conducted to refine the individual and combined search terms to develop a search string that located articles of interest without too many false positives. The resulting Boolean search string that we used for the systematic review was:

  • ((gam* AND (serious OR edutainment ORapplied gam*” OR learn* OR game-based learning OR educat* OR teach*) AND (review OR meta-review OR meta-analysis))

The search initially produced a total of 126 potential papers, of which 73 were found to be unique. These papers were then manually evaluated by title, abstract, and if necessary, by full text, based on the following inclusion criteria:

  • Focused on the review of studies:

    • Using randomized control trials, experimental pretest/posttest control group design or quasi-experimental structure

    • Evaluating computer, console or mobile games

    • That was directed at achieving teaching and learning outcomes

  • From any country

  • Written in English

Papers not meeting the inclusion criteria were excluded from the systematic review. The review process is shown in Fig. 2.3 and identified ten papers for the analysis.

Fig. 2.3
figure 3

Diagram of the selection process for the systematic review

The evaluation process detailed in Fig. 2.3 shows the inclusion of an additional paper that did not appear in the initial 126 papers. Wattanasoontorn, Boada, García, and Sbert’s (2013) comprehensive study of serious games for health was not located using the Boolean search string due to the non-inclusion of the term review in the article title. It was, however, identified and noted in the preliminary searches that we used to refine the search terms. Wattanasoontorn et al. (2013) include 108 references in the broad health domain in their final review, making this a relevant and comprehensive piece of work for inclusion in our analysis. However, expanding the search string to ensure that this article was located results in an unwieldy number of irrelevant search results. More importantly, it results in a search term that does not meet the objective of the systematic review, which is to identify reviews or meta-reviews of serious games. Thus, the paper was simply added to the end results of the systematic review process.

2.3 Data Collection and Analysis

During the data collection process, we examined in detail the ten high level literature reviews identified in our systematic review. A description of each of these papers is provided in the next section of the chapter.

Two literature review papers, on analysis, were rejected during the data collection process. Hwang and Wu (2012) analyzed the research status and trends in digital game-based learning (DGBL) from 2001 to 2010. Specifically, they explored (1) whether the number of articles in this area is increasing or decreasing, (2) what the primary learning domains related to DGBL are, (3) whether there is a domain focus shift between the first 5 years (2001–2005) and the second 5 years (2006–2010), and (4) which are the major contributing countries of DGBL research. From an initial set of 4,548 papers, Hwang and Wu selected a total of 137 articles for review. However, their paper does not provide details of the specific 137 articles selected, and we were therefore unable to identify data collection methods from these papers. Thus, we have excluded this review from our analysis. The Blakely, Skirton, Cooper, Allum, and Nelmes (2009) systematic review of educational games in the health sciences was also removed. They provide an analysis of the use of games to support classroom learning in the health sciences based on a review of 16 papers. However, it was deemed to be an earlier subset of the latter and more expansive review of serious games for health by Wattanasoontorn et al. (2013).

From the remaining eight literature review papers, we identified 299 referenced studies. Where possible, we then sourced each of the papers and recorded the data collection techniques used pre-game, during gameplay and post-game for each study. In a few cases, papers could not be sourced or the papers were not in English. These papers were excluded, as were all studies that were only reported in non-peer reviewed locations such as websites. We also excluded references to demonstrations or papers that only included a critical analysis of literature. Finally, we excluded duplicate studies so they were only included once in the analysis. This left a total of 188 referenced studies to be included in the final analysis.

3 Systematic Review Papers

The eight review papers identified from our systematic process that were used for data collection covered both general and domain-specific areas. Four of the papers are reviews of the general serious games area and were not focused on any specific area or application. However, two papers are focused on studies in the health domain, while the other two focused on medicine and the humanities. Each of these papers are described below and the number of contributory studies to our review are identified.

Connolly, Boyle, MacArthur, Hainey, and Boyle (2012) examined the “literature on computer games and serious games in regard to the potential positive impacts of gaming on users aged 14 years or above, especially with respect to learning, skill enhancement and engagement” (p. 1). This review paper focused on identifying empirical evidence and on categorizing games and their impacts and outputs. The majority of reviewed papers come under the serious games for education and training classification. We have reviewed data collection in these papers (n = 70) across Connolly et al.’s categories of affective and motivational outcomes, behavioral change outcomes, knowledge acquisition/content understanding outcomes, motor skill outcomes, perceptual and cognitive skills outcomes, physiological arousal outcomes, and soft skill and social outcomes.

Wattanasoontorn et al. (2013) consider the use of serious games in the health domain area. They provide a survey of serious games for health and define a new classification, based on serious game, health, and player dimensions. For serious game subjects, they classify by game purpose and game functionality, for health, they classify by state of disease and finally for player, they consider two types of player dimensions, player/non-player, and professional/nonprofessional. We have used Wattanasoontorn et al.’s (2013) classification and comparison of health games summary (n = 91) which considers the following areas: detection (patient), treatment (patient), rehabilitation (patient), education (patient), health and wellness (non-patient), training for professional (non-patient), training for non-professional (non-patient).

Anderson et al. (2010) describe the use of serious games for cultural heritage; specifically, the use of games to support history teaching and learning and for enhancing museum visits. Their state-of-the-art review includes both a set of case studies and an overview of the methods and techniques used in entertainment games that can potentially be deployed in cultural heritage contexts. Here, we have focused on the former and reviewed data collection as noted in the case studies (n = 5).

Girard, Ecalle, and Magnan (2013) review the results of experimental studies designed to examine the effectiveness of video games and serious games on players’ learning and engagement. They have attempted to identify all the experimental studies from 2007 to 2011 that have used serious games for training or learning, and assessed their results in terms of both effectiveness and acceptability. Girard et al. (2013) had a two pass process for article inclusion/exclusion where the stricter second pass, only considering randomized controlled trial studies, resulted in only nine articles. Here, we have used the results from their first pass of the literature which resulted in 30 articles (n = 29, we excluded one article written in French) published in scientific journals or in proceedings of conferences and symposia across the fields of cognitive science, psychology, human–computer interaction, education, medicine, and engineering where training has been performed using serious games or video games.

The systematic review of Graafland, Schraagen, and Schijven (2012) provides a comprehensive analysis of the use of serious games for medical training and surgical skills training. The authors focus on evaluating the validity testing evident in prior serious games research in the area and identify 25 articles through a systematic search process. Of these, 17 included games developed for specific educational purposes and 13 were commercial games evaluated for their usefulness in developing skills relevant to medical personnel. Of the 25 articles identified by Graafland et al. (2012), six were identified as having completed some validation process and none were found to have completed a full validation process. For the purpose of our study, we considered only articles explicitly identified by Graafland et al. (2012), that appeared in the supplementary information tables (n = 20).

Papastergiou (2009) presents a review of published scientific literature on the use of computer and video games in Health Education (HE) and Physical Education (PE). The aim of the review is to identify the contribution of incorporating electronic games as educational tools into HE and PE programs, to provide a synthesis of empirical evidence on the educational effectiveness of electronic games in HE and PE, and to scope out future research opportunities in this area. Papastergiou (2009) notes that the empirical evidence to support the educational effectiveness of electronic games in HE and PE is limited, but that the findings presented in their review show a positive picture overall. We have reviewed data collection methods in the research articles featured in this review (n = 19).

Vandercruysse, Vandewaetere, and Clarebout (2012) conducted a systematic literature review where the learning effects of educational games are studied in order to gain more insights into the conditions under which a game may be effective for learning. They noted that although some studies reported positive effects on learning and motivation, this was confounded by different learner variables and different context variables across the literature. Their review initially found 998 unique peer reviewed articles. After removing articles with (quasi) experimental research, only 22 journal articles were finally reviewed. It is these 22 articles that we have included in our data collection review.

Wilson et al. (2009) performed a literature review of 42 identified studies and examined relationships between key design components of games and representative learning outcomes expected from serious games for education. The key design components of fantasy, rules/goals, sensory stimuli, challenge, mystery, and control considered by Wilson et al. (2009) were identified as statistically significant for increasing the “game-like” feel of simulations (Garris & Ahlers, 2001) and key gaming features necessary for learning (Garris, Ahlers, & Driskell, 2002). These were examined in relation to both skill-based and affective learning outcomes. We included all 42 studies in our review.

4 Results

The total number of papers used in this study for data collection was 299. Table 2.1 provides a full list of the references examined, and the literature reviews from which those papers were sourced.

Table 2.1 List and source of all references examined in the data collection process

After examination of the 299 papers, and the filtering described in Sect. 2.3, we explored the data collection techniques described in 188 papers spanning 1981–2012. Eighty-four percent of the 188 papers were from the 10 year period of 2003–2012 (see Table 2.2). Also, 51 % of the papers were from the mid-region of this 10 year period, i.e., 2006–2009. However, this does not necessarily indicate a surge in serious game research but is more likely a consequence of publication time frames. Although the literature reviews determined by our search string were published up to 2014, the published research that they reported on was only up to 2012.

Table 2.2 Number of serious game papers for each year in the 10 year period from 2003 to 2012

In total, 510 data collection techniques were used in the 188 studies. Of these, 33 % of data collection occurred pre-game, 21 % during gameplay, and 46 % in post-game evaluation phases (see Fig. 2.4). On average the total number of data collection methods used per study, across the three phases of pre-game, during gameplay, and post-game, was 2.71 (SD = 1.2).

Fig. 2.4
figure 4

Number of data collection techniques used per phase of study

In terms of specific techniques for the pre-game phase (n = 169), 52 % of the studies used questionnaires, 42 % of the studies used some form of test, 4 % of the studies used an interview, 2 % of the studies used an indirect observation, while only a single study employed a focus group in the pre-game phase (see Fig. 2.5).

Fig. 2.5
figure 5

Number of specific data collection techniques used per phase of study

For the post-game phase (n = 235), 46 % of the studies used questionnaires, 37 % of the studies used some form of test, and 13 % of the studies used an interview. A single study used an indirect observation, while 4 % of the studies employed a focus group in the post-game phase (see Fig. 2.5).

In the context of the specific techniques used during gameplay (n = 106), 46 % of the studies used some form of direct observation in a controlled environment, 9 % of the studies used some form of direct observation in the field, 30 % of the studies used an indirect observation method, around 8 % used a test, while two of the studies employed an interview during the gameplay phase of evaluation (see Fig. 2.5).

5 Discussion

5.1 Issues Highlighted Within Our Study Outcomes

The majority of the studies we reviewed used multiple data collection methods (80.3 %, Fig. 2.6). Surveys and questionnaires are good at getting shallow data from a large number of people but are not good at getting, deep, detailed data; participants may try to impress interviewers during interviews (Podsakoff, MacKenzie, Lee, & Podsakoff, 2003); and duration logging systems may not take into account participant thinking time (Lazar, Feng, & Hochheiser, 2010). The studies included in our review were dominated by the use of questionnaires and formal tests (see Fig. 2.5). Also, the majority of data collection occurred post-game, followed by pre-game collection and finally captured during gameplay. Although these results may have been biased by our own sampling techniques, it may also highlight a need for integrating more focus groups and indirect observation techniques into game evaluations.

Fig. 2.6
figure 6

Percentage of studies that used multiple data collection techniques

Even allowing for sampling errors, it was notable that objective techniques such as biometrics or psychometrics, as well as newer techniques such as path tracking or crowd sourcing, were largely absent. This reflects traditional difficulties of capturing gameplay data, that is, if such data collection is not part of the game design, and it also highlights opportunities for new data capture approaches oriented towards data analytics. Loh (2012) notes that “… few [commercial] developers would actually be interested in ‘in-process’ data collection unless it somehow contributed to the usability of their games …” and goes on to consider alternative approaches for empirical data gathering via telemetry and psychophysiological measures.

Also, different data collection techniques have inherent biases (Podsakoff et al., 2003). Thus, it is important to consider multiple data collection methods. As Rogers et al. (2011) observe, it is “important not to focus on just one data gathering technique but to use them flexibly and in combination so as to avoid biases” (p. 223). The framework used by Mayer et al. (2014) provides a good examination of the breadth of approaches that can be used in data collection.

Most of the studies were collecting data to demonstrate the use of serious games as an intervention tool, for instance, to demonstrate the impact of a serious game in an educational setting. Thus, a minimum expectation could be for a pretest and posttest, and it would also be desirable to obtain some in game data, e.g., score or duration metric. As seen in Fig. 2.6, only 52.7 % of the studies reviewed used three or more data collection techniques. Exploring this further is outside the scope of this chapter, but is an important area for future research if serious game evaluations and experimental designs are to be considered demonstrably robust.

Another feature highlighted in the data collection that occurred during gameplay was the lack of direct observation in the field (10 %) compared to observations that were made in a controlled environment such as a computer laboratory (54 %). This is understandable, as research by nature tends to occur in university environments, and controlled environments allow contextual variations associated with data collection to be controlled in traditional experimental designs. Again, our own data sampling methods make it difficult to argue the significance of this finding but it still needs to be considered that experimental serious game research might need to be extended to include more situated case studies and perhaps participatory methods.

When reflecting further on the content of the various studies we encountered during our process, as well as some of the problems encountered in the data collection process, a number of generic issues of serious games research were highlighted. These generic issues, which we discuss next, include:

  • What data is being collected?

  • When data is being collected?

  • Where data is being collected?

  • Who is involved in data collection?

  • Why data is being collected?

5.2 What Data Is Being Collected?

In our study, we found there was a tendency to collect certain types of data during the different phases. During the pre-game, this data tends to include demographic information such as gender, age, nationality, and culture. It was also common to gather data surrounding previous experience and skills with computers, games, and related technology such as simulations and virtual reality. Less common is the collection of data of a participant’s attitudes, their intrinsic and extrinsic motivation, learning and personality styles, etc. In many cases, both pre- and posttests were used to generate skill or knowledge performance metrics directly related to intended serious outcomes of the game.

During the gameplay, measures tend to focus on issues of performance. Types of data include game metrics, such as time to complete, number of errors or levels of progress. Less common were measures that examined player approaches to completing the game and measures of experience such as flow, immersion, presence, and the general affective states of the participant (see, for example, Jennett et al., 2008).

Loh (2011) considers gameplay measures with the analogue of Black box and open box approaches in the context of game assessment metrics. Specifically, he defines ex situ data collection where the game environment is a Black box and data is collected without access to internal details. This could be the pre- and post-game collections metrics noted in this chapter (see Fig. 2.5) or psychophysiological measures collected during game sessions. The open box approach supports in situ data collection, for example, log files, game events, or user-generated action data, e.g., Information Trails (Loh, 2012). In contrast to psychophysiological measures, such in situ data would have no external noise as the data collection occurred within a closed environment. This could be of significant interest for serious game analytic approaches as a way to triangulate data across collection sources, similar to the use of immersidata to collect and index user-player behaviors from gameplay logs and video clips (Marsh, Smith, Yang, & Shahabi, 2006). Also, as noted in the previous section, there have traditionally been difficulties in capturing and using in situ data as it requires access to the internal processes of a serious game (e.g., to collect telemetry data), and it can be problematic to efficiently process the large volumes of data generated. Both topics are prominent in the other chapters of this book and are a focus of ongoing serious game analytics research.

During post-game evaluations, it was more common to obtain subjective feedback concerning game experience and issues surrounding fun and engagement. This phase was also when measures of player satisfaction with the game, such as clarity, realism, aesthetics, and ease of use, as well as perceived suitability were usually made.

Other types of data that might be useful to collect within studies include the quality or experience of any facilitators involved, the general context of field studies such as the interaction with others and their roles in the study, and potential organizational impacts such as management structure and culture (Mayer et al., 2014).

Some of the variation in data collection is related to the intention of studies and whether they relate to measuring the efficacy of serious outcomes, or the usability and quality of the game itself. Many studies address both efficacy and usability issues as they are related. One issue that needs to be considered in relation to what data is collected surrounds player profiling. The importance of this is highlighted in one study that used the specialized “Ravens advanced progressive matrices” to examine the relationship between general cognitive ability and any measured knowledge outcomes from the game (Day, Arthur, & Gettman, 2001). The inference is that underlying individual traits such as cognitive ability might be a good indicator of player performance in learning tasks. This suggests other psychological tests that might assist in measuring player traits such as risk-taking, general personality traits, performance under stress, learning styles, teamwork, and other factors that might be relevant in some applications.

A benefit of adopting these traditional instruments is that they have been validated and are well understood, at least under laboratory conditions. Otherwise, we might also need to question the validity of questionnaires, surveys, and other measuring instruments currently being used in game studies (Boyle, Connolly, & Hainey, 2011; Slater, 2004).

5.3 When Data Is Being Collected?

Our own study identified a variety of data being collected in the post-game, gameplay, and post-game phases of evaluation. Most of the data collected in our study occurred post-game, while the least occurred during the gameplay phase. Arguably there is an opportunity to improve levels of data collection occurring during the gameplay to support a better understanding of how specific game elements relate to the intended serious outcomes.

There are also other aspects of timing that should be considered in data collection. This may partly be related to the whether the study is intent on measuring aspects of the process or purely outcomes (Bowers & Jentsch, 2001). Thus, the relevance of process evaluation versus game efficacy or usability measures may impact on when data is collected.

In terms of learning applications, it may also be important to consider interactions between other forms of instruction that occur before, during, or after the game intervention (Van Eck, 2006). This might also apply to application of serious games for health, where additional treatments may occur in conjunction with game use.

This highlights the issue of deciding when and how often to collect data for evaluating games. Although many studies used mixed methods, data was not necessarily collected over the life of the study. By contrast, in the study by Squire, Giovanetto, Devane, and Durga (2005) games were played over 5 weeks and data was collected over this entire time frame. The time frame of data collection may be influenced by the intent and domain of the study, for example, whether the research is concerned with the direct and immediate influence of playing the game versus the indirect or long-term impact of the game. It is probably important to get short-term feedback involving gaining self-reported, subjective feedback from participants, for example, regarding participant satisfaction, or self-perceived learning as well as immediate changes to attitudes, skills, or knowledge. Medium or longer term data might be required to understand aspects of team or organizational change especially related to social issues.

We also found that the timing of outcome measures varied depending on domain. For example, in some learning applications there may be a greater tendency to measure longer term learning factors such as the time required to transfer or regain knowledge (Day et al., 2001; Dennis & Harris, 1998; Parker & Lepper, 1992). This implies testing skills, not just immediately after completing the game, but also at later intervals such as a few days, weeks, or months to measure the permanence of any immediate outcomes and issues of retention and reacquisition of knowledge.

5.4 Where Data Is Being Collected?

There was a tendency for evaluation to occur in controlled rather than field situations. Data was collected in a variety of contexts including primary schools, secondary schools, universities, and industry settings. One potential issue of study context is what else is happening during the study that might impact on outcomes and yet is not necessarily being reported (Van Eck, 2006).

The location of data collection also indirectly raises issues of cost. For example, onsite studies should be fast and efficient to ensure they do not unnecessarily impact on the time or resources of participating partners (Mayer et al., 2014). It also confirms the need for unobtrusive and perhaps covert data collection techniques (Mayer et al., 2014), not just to improve data validity, but to minimize impact on the standard workflow of participants involved in case studies, for example, the use of in situ methods (Loh, 2011). Stakeholders may also need to be persuaded that more extensive contextual data as well as extended longitudinal data gathering needs to occur beyond the obvious and minimal (Mayer et al., 2014).

5.5 Who Is Involved in Data Collection?

In our study, we identified a range of stakeholders involved in projects including students, teachers, researchers, game developers, and industry partners. All of these various stakeholders are candidates to be involved in evaluation. Such evaluations may need to bear in mind influences related to the motivations of stakeholders surrounding the process and outcomes. For example, the game designer may be enthusiastic to measure the aesthetics, the software engineer the usability, the scientist, the efficacy, and the manager the cost. All stakeholders may also be keen to find positive outcomes whether the motivation is for publication, ongoing employment, or other personal gains. Thus, it may be worthwhile to consider collecting data related to the exact role of various participants in the project and any intrinsic and extrinsic motivations of the parties (Mayer et al., 2014).

Marks (2000) highlights the obvious sampling issues in one project where university students were used to evaluate a game intended to teach military staff. It was not clear that measured effects on such a population would transfer to the intended group. By contrast, in another study three different questionnaires are used for pupils (players), parents, and teachers (McFarlane, Sparrowhawk, & Heald, 2002).

While most studies focus on individuals playing games, there is also interest in evaluating the efficacy of learning team-based, rather than individual, skills. Marks (2000), in considering some of the pros and cons of using computer simulations for team research, highlights the need for measuring the longitudinal impact of skills related to teamwork. Data may need to be collected that considers team performance rather than individual performance where games are designed to teach teamwork (Bowers & Jentsch, 2001). While a number of evaluation models exist that focus on the learning of individuals, there has been less attention given to the data required to assess learning in teams or in larger collectives such as organizations and informal networks (Mayer et al., 2014).

5.6 Why Data Is Being Collected?

During our meta-review, we encountered the use of games across a wide variety of different domains and not surprisingly, we found a variety of expectations about the most appropriate methods of data collection and the types of data collected. For example, in one study participants were partly evaluated on the basis of an essay that reflected on their experience using the game (Adams, 1998). This is in contrast to another learning study that directly measured changes to student knowledge using tests as well as surveying the students and seeking feedback from external stakeholders such as parents and teachers (Crown, 2001).

While it is easy to understand the reasons for such differences, the variations make it harder to compare and contrast data results from different game studies. The usefulness to the serious game community of adopting standardized testing approaches that allow for comparison has been highlight previously (Blunt, 2007; Mayer et al., 2014).

Despite some good work in the area of relating game design features to serious outcomes (Wilson et al., 2009), most studies focus on collecting data to support the message of efficacy rather than data that helps explain why and how they are effective or indeed how to apply design rules that lead to the required efficacy (Garris & Ahlers, 2001; Van Eck, 2006).

6 Conclusions

A complication of data collection for games is that not all games are created equal (Loh, 2009). Van Eck (2006) makes the key point that any taxonomy of games is as complex as learning taxonomies. While not all games are the same, the situation is complicated by the overlap of simulations, virtual reality, and partial gamification of traditional approaches. There is also wide variety in the types of games being used in studies. Some are small in scope and custom built by individuals while others are constructed in multi-discipline projects that involve discipline experts and professional game developers. Other studies simply make use of off-the-shelf games. This range of projects means that the data collection techniques need to be flexible.

In this study, we developed a review process for performing a meta-analysis on data collection techniques used in serious game research. We found that while many studies used a variety of methods, they were not necessarily intended to triangulate findings. The number of data collection techniques also varied considerably, with a number of studies using only a single measure. Our study also highlighted a number of variations and subsequently raised questions around what data is being collected, when data is being collected, how data is being collected, where data is being collected, and why data is being collected.

Our systematic review approach identified a number of significant literature reviews that allowed us to examine data collection processes across broad domain and temporal spaces. However, not all bibliographic sources were included in the search parameters and thus relevant literature has potentially been missed. Despite this, the results provide a representative sample of serious game research that allows us to draw valid conclusions about approaches and issues in data collection.

In summary, the data collected for serious game research is broad in scope, measuring both targeted performance skills, behavioral factors related to both the process and outcomes. For example, the data may be designed to measure changes in knowledge, attitudes, skills, or behavior. The data collected can also be multi-level in scope, designed to measure fine grain individual skills or large-scale organization attitudes. Data is collected using a wide range of objective and subjective methods that may fall across a range of longitudinal scales. The currently used data collection techniques might align more with discipline traditions than necessarily intentions of evaluations. Even though single studies often incorporate a variety of techniques, the data is not necessarily triangulated as might be expected in a true mixed method approach. The review also found that the majority of data collection occurred post-game, then pre-game, and finally during gameplay. This, perhaps, reflects traditional difficulties of capturing gameplay data and highlights opportunities for new data capture (i.e., in situ collection) and analysis approaches oriented towards data analytics. We suggest that more standardized and better-validated data collection techniques, that allow comparing and contrasting outcomes between studies, would be beneficial to the broader serious games community and specifically to those interested in serious game analytics.