1 Introduction

Since the launch of touch screen tablets, we have seen enormous growth in the industry of mobile applications (apps) (Kluver 2016a; Papadakis and Kalogiannakis 2017). A mobile app is a software application designed to run on mobile devices such as smartphones and tablet computers (Wikipedia 2016). PreschoolFootnote 1 educators have all bought into the appeal of apps (Chiong 2012) and as a result, apps are rapidly emerging as a new medium for providing educational content to preschool age children (Chiong 2012; Lee and Cherner 2015; Papadakis et al. 2016a, 2016b; Shoukry et al. 2014). In Apple’s and Google’s app store educational apps for young children are among the top categories of apps most accessed or purchased by users (Avtar 2014; Apple 2016; Bouck et al. 2016; Cardenal and Lopez 2015; Hutchison et al. 2012).

Although there are now hundreds of media that review apps - websites, blogs, podcasts, and even print media (Big Ideas Machine 2015) - the majority do not use sufficiently scientific criteria. Preschool teachers have a limited number of reliable tools to evaluate the appropriateness of these apps on children’s cognitive development (Emeeyou 2012; Kucirkova et al. 2014). There is a pressing need for an evaluation rubric that examines all aspects of educational apps (Lee and Cherner 2015) for preschool age children.

Using recent field research as a theoretical framework we present in this article the process of creating a rubric - as well as the rubric itself – which can be used by teachers to evaluate educational apps targeted at preschool age children.

1.1 The self-proclaimed educational apps

Although software companies currently produce a plethora of educationally relevant apps, many of which are meant to be suited for preschool learning (Richter 2015), for the majority this is misleading and there is no evidence of any learning value (Papadakis and Kalogiannakis 2017). Many educational apps for preschoolers are simply to entertain children (Higgins et al. 2005; Papadakis et al. 2016b). It is easy for developers to promote apps as educational that may not have any educational value for children at all (Guernsey et al. 2012; Kluver 2016a, 2016b; Watlington 2011). For example, most of the apps in Apple’s store involve very basic literacy skills, such as letters, phonics and word recognition (Kluver 2016a; Kucirkova 2014a, 2016a). Far fewer apps involved more advanced early reading skills such as comprehension and grammar (Guernsey et al. 2012). Although many software developers know it is important to consult with educational experts, much software still is developed without consideration of key educational factors that may affect learning (Geisert and Futrell 1995).

The majority of apps in today’s marketplace can be considered part of the “first wave” of the digital revolution (Hirsh-Pasek et al. 2015). Many of the educational apps (in the form of digital worksheets, games, and puzzles) have interactive yet repetitive game formats with “closed” content, that is, the content cannot not be changed or extended by the user (Flewitt et al. 2014). These games are primarily gamified literacy and numeracy apps (Papadakis et al. 2016c), with content presented as a series of interactive tasks, the completion of which is recognized and rewarded with animated multimedia tokens of achievement (Lynch and Redpath 2012). Drill and practice may foster rote learning of facts, but it is not likely to promote deeper conceptual understanding (Hirsh-Pasek et al. 2015). The issue of apps of poor or doubtful educational value is encountered in both free and in most of full and paid-for versions of the apps (Falloon 2013).

Although educators have tried seeking information about apps of interest across app stores and expert review sites or at producers’ websites, this kind or research is not sufficient or reliable (Vaala et al. 2015). The reason is that little information on the quality of apps is available beyond the star ratings published on retailers’ web pages or digital stores (Stoyanov et al. 2015) and reviewer comments (Bouck et al. 2016). App reviews are subjective by nature and may come from unreliable sources (Bentrop 2014). Only half of the popular paid, free, and award-winning apps in Apple’s store provide information about their development teams (Vaala et al. 2015).

Preschool educators need support in identifying quality apps or they risk wasting their time with inferior or inappropriate apps (Cooper 2012; Lee and Cherner 2015; Papadakis et al. 2016b). There is no concrete mechanism for them to determine whether an educational app is developmentally appropriate or not (Shuler 2009a), as studies have repeatedly shown that educators lack the knowledge to evaluate the appropriateness of educational apps for their students (Emeeyou 2012; Hirsh-Pasek et al. 2015; McManis and Parks 2011; Vincent 2012; Papadakis and Kalogiannakis 2017; PRWeb 2012; Watlington 2011; Zaranis et al. 2013). Despite the fact that educators have long been given guidance on which books can help children learn, no such help is on offer when it comes to apps (Kucirkova 2014b). Some educators are advanced and knowledgeable technology users themselves, but this does not mean that they necessarily understand the full implications of ICT products and services when used by young children, and this lack of understanding may, in turn, hinder their effectiveness when educating their students (Ebbeck et al. 2016).

1.2 A literature review of rubrics and frameworks for assessing – developing educational apps

Over the past few decades, there have been various attempts to evaluate the appropriateness of the educational software targeted at young children (Lee and Cherner 2015). Yet, rubrics that focus specifically on educational apps for preschoolers are extremely limited. One problem is that the evaluation criteria used in the majority of the rubrics is not clearly linked to previously conducted research, nor are the rubrics’ evaluative dimensions clearly defined (Lee and Cherner 2015). For example, Reeves and Harmon’s Systematic Evaluation of Computer-Based Education is widely recognized and adapted by many researchers as a foundation for many rubrics developed in the last two decades (Reeves and Harmon 1994). The fact is that Reeves and Harmon’s model does not address some of the new functionalities that smart mobile technology utilizes, such as sharing capability, crossplatform integration, and ability to save progress. Another example is the use of the Developmental Software Scale (DSS) (Haugland 1999), which is another popular tool for evaluating developmentally appropriate computer software for preschool age children. Although Haugland’s DSS includes several evaluating criteria such as the age appropriateness of an app, the adjustable difficulty levels etc., these underline the importance of design without emphasizing personalized learning or the curriculum content compatibility (Chau 2014).

McManis and Parks (2011) created the Early Childhood Educational Technology Evaluation Toolkit, integrating technology framework principles in preschool education as expressed in the joint statement of educational institutions National Association for the Education of Young Children (NAEYC) and Fred Center Rogers (NAEYC and FRC 2012). Their rating form included 20 questions in Likert-scale questionnaire format, in an attempt to investigate whether an application is educational, age appropriate, child-friendly, and pleasant to use. However, as Chau (2014) states, McManis and Parks (2011), as well as Haugland’s rating scale (Haugland 1999), mostly deal with the evaluation of software use in a classroom environment, lacking the specialization required for the evaluation of a new form of technology such as the tablet. McManis and Parks (2011) assessment tool aims to evaluate the appropriateness of the content rather than the design features, such as the use of visual and acoustic elements to enhance children’s learning. Based on several studies, Chau (2014) created a rating scale, more commonly known as Developmentally Appropriate App Design Evaluation Form that provides a framework for developmentally appropriate design practices for children’s educational apps. This framework comprises four basic design principles: interaction, visual, acoustic and instructional design.

Shoukry et al. (2012) created an evaluation framework in an attempt to investigate the effectiveness and suitability of educational games. The assessment framework they designed consisted of 15 categories, such as the screen design, the navigation and application control, the ease of use, the application design, the content presentation, as well as security, accessibility, usability, and cost issues. Hirsh-Pasek et al. (2015) offer a way to define the potential educational impact of current and future apps that belong to the first and the second “wave” of educational apps. Their aim is primarily to guide researchers, educators, and designers in evidence-based app development and secondly to create an evidence-based guide, in order to set a new standard for evaluating and selecting the most effective existing educational apps for children which are compatible with the second wave of app development.

Rodríguez-Arancón et al. (2013) created a rating scale which combines educational and technical criteria equally. In his assessment toolbox Vincent (2012) places great importance on the following five criteria: relevance, customization, feedback, thinking skills, participation and sharing. Walker (2010) is considered a pioneer in mobile application evaluation, as his rubric has been used as a template for subsequent scales by other researchers. Walker’s rating scale is based on six criteria such as connection with the curriculum, authenticity, user friendliness etc. As Walker states, his rubric, which was mainly created for the evaluation of apps for iPod-type devices, can determine whether an application is associated with a targeted skill or a curriculum concept (Walker 2010). Buckler (2012) created a rubric with six dimensions to evaluate apps for people with special needs, including domains such as feedback, adjustability, ease of use, cost etc. (Lee and Cherner 2015). Another interesting app evaluation tool has been developed by the website YogiPlay, a mobile learning apps service designed specifically for children from ages 3 to 8 (PRWeb 2012). The YogiPlay evaluation system is based on the following key evaluation axes: design quality, ease of use, learning activities and the content structure. There are also some rating systems by Children’s Technology Review, Common Sense Media, and a handful of parent-oriented app services. Although these rating systems have not been scientifically evaluated, they are widely used in the field (Hirsh-Pasek et al. 2015). Stoyanov et al. (2015), after conducting a survey of the literature (taking into account explicit web or app quality rating criteria published between January 2000 and January 2013), formulated their assessment criteria for the development of the “Mobile App Rating Scale” (MARS). The MARS creators claim that it is a simple, objective, and reliable tool for classifying and assessing the quality of mobile apps. Additionally, it can also be used to provide a checklist for the design and development of high quality apps in terms of customisation, interactivity, functionality, aesthetics etc.

Lee and Cherner (2015) designed one of the most comprehensive rubrics with 24-evaluative dimensions tailored specifically to analyse the educational potential of instructional apps using previously published research. The 24 dimensions are categorized into three domains: (A) Instruction, (B) Design, and (C) Engagement. Although this rubric is designed to be as comprehensive and inclusive as possible, it has limitations. One of the main limitations in our opinion is that individuals must be able to label apps as skills-based, content-based, or function-based depending on their purpose, because instructional apps will score differently on the rubric based on their design.

Most recently, Kucirkova (2016b) published a framework (iRPD) in an attempt to bring together teachers, researchers and designers into one space specifically focused on iPad apps. That iRPD framework aims to provide the community with some thinking tools (five principles) in order to enrich traditional design-based research with novel affordances of twenty-first century technologies. The five principles are triple collaboration, shared epistemology, interconnected social factors, awareness of app affordances and a child-centred pedagogy (Kucirkova 2016b). In an attempt to make educators aware of the various tools available to support app evaluation, Bouck et al. (2016) present 5 different rubrics in their research paper (More and Travers 2013; Ok et al. 2015; Tammaro and Jerome 2012; Walker 2010; Weng and Taber-Doughty 2015). For educators considering apps explicitly for students with learning disabilities the researchers propose the rubric by Ok et al. (2015).

2 Method

2.1 Draft rubric structure

The axes on which the initial construction of the rubric was based (criteria, levels of quality gradations, scoring strategy) are the following:

  • Educational research. Early evidence indicates that children can learn from well-designed educational apps (as mentioned in the literature review). Children learn best when they are cognitively active and engaged, when learning experiences are meaningful and socially interactive, and when learning is guided by a specific goal (Hirsh-Pasek et al. 2015). Educational apps must provide unprecedented opportunities for children to create their own content and participate in rich and dynamic learning contexts (Kucirkova 2014c, 2015).

  • App Rating Scales. Within the past five years, various app assessment tools have been developed (as mentioned in the literature review). The majority of rubrics lack the specialization required for the evaluation of this new form of technology for pre-schoolers, or they do not emphasize all the features required for an educational app (see Table 1).

  • Evaluation of the researchers. The apps which were available at the time of research on Play Google, had Greek content and targeted preschoolers.

Table 1 Relevant literature and existing app rating scales related to rubric dimensions

More specifically, the procedure followed by the researchers for the evaluation of the apps was as follows: at first the researchers visited Google’s online store (Play Google) and, based on the categorization of the apps and the use of keywords (such as preschool and / or game, education, children etc.), identified the free apps which targeted preschool age children. Randomly they chose 20 of these educational apps and installed them two different types of smart devices (smartphone & tablet). Subsequently, the researchers played each application separately until they reached the end. As in other studies, this phase focused on the technical and design specifications of the apps, as well as their pedagogic goals, in a more general sense. No in-depth methodological analysis of any particular app was therefore intended at this stage (Rodríguez-Arancón et al. 2013).

The researchers noted that the majority of applications were in edutainment format (matching cards, jigsaw puzzles). Learning goals could be attributed to drill and practice exercises. There were apps which were in the form of interactive electronic books, (“Read to Me” type). The questions were presented to children mainly as multiple choice or closed type and children could learn by the trial and error method. All apps were based on low level-thinking skills and did nothing more than promote rote learning instead of meaningful learning. Also, they tended to evaluate knowledge rather than introduce new concepts. In any case, they didn’t encourage exploration, experimentation, problem solving, and creative thinking. The apps targeted literacy (70%) (alphabet learning: letter and sound recognition) and mathematics (30%) (numbers and basic operations).

As regards the app designs, the researchers found that the majority were not created based on best design practices (Sesame Workshop 2012). In brief, no app included an on-screen character, there was a low quality use of multimedia elements (audio, image), a lack of “palm rest”, incorrect design of app interface, ineffective use of interactive hotspots, inability to configure, no levelling (e.g. at least three levels: easy, medium and difficult). In several apps, the researchers found in-app purchases, subscriptions, and advertising. No app monitored preschoolers’ progress, offered a portfolio system, or content sharing or synchronous play.

2.2 Rubric creation process

2.2.1 First stage

Researchers worldwide have set various standards for the construction of a scale-classified criteria (Allen and Tanner 2006; Moskal 2000; Roblyer and Wiencke 2003). More specifically, a rubric must:

  • Include an appropriate and optimal number of criteria. When the number of criteria is very large, the rubric becomes dysfunctional, whereas when the number of criteria is too small, the rubric does not provide sufficient information regarding the evaluated object.

  • Have operational criteria and performance-level descriptions. The rating scale should range from one as the worst performance to four as the best performance.

  • Contain performance-level descriptions that are as informative as possible.

Considering the relevant literature and based on the findings of the app evaluations, the researchers decided to create a rubric entitled “Rubric for the EValuation of Educational Apps for preschool Children” (REVEAC) in order to evaluate the quality of educational apps for preschoolers in the following four key areas: educational content, design, functionality and technical characteristics.

After defining the concepts to be measured, the aim in this first stage was to create a list of items and determine the format of measurement. A few key considerations about creating this rubric need to be clarified. First, each of the evaluative dimensions together comprise the entire rubric, and the dimensions discussed in the following sections refer back to it. Second, each evaluative dimension was designed to follow a consistent format. The format includes a prompt that focuses the dimension on a central question, and four indicator descriptors that describe the ways in which an app’s content, functionality or design may behave in response to a prompt. Table 2, illustrates the first row in the rubric. The way in which we proceeded was to first fill in the cell corresponding to the maximum rating, i.e. 4 (exemplary) in which all the sub-criteria were fulfilled, and gradually decrease the sub-criteria that were fulfilled as we moved down the scale, until the minimum rating, i.e. 1 (unsatisfactory/poor), was reached, where none of the sub-criteria was fulfilled.

Table 2 Learning provision sub criterion in the educational app evaluation rubric

After the first draft of the app evaluation rubric was created, the scale was distributed to a sample that consisted of preschool teachers and undergraduate students in the department of preschool education. Prior to this, the authors received Institutional Review Board (IRB) approval. The researchers wanted to check whether the proposals of the scale created interpretation problems for people who were not familiar with the use of educational apps for preschoolers. After semi-structured interviews and first stage data analysis, both the structure of some sentences, as well as some terms or expressions that were originally used on the scale were modified, merged, or removed in order to keep pace with the sample observations. For example, the preschool teachers proposed that when selecting educational software, the content and features should be appropriate not only for the targeted student’s chronological age, but also for a student’s developmental age, and suggested merging two subsectors into one (Age and Learning package to Knowledge package appropriateness). In summary, nine subsectors were merged or removed (Educational content sector: Age and learning level, Critical Thinking, Multiple Learning Styles Support, Evaluation Existence, Design sector: Scenery, Multimedia Elements Usage, Functionality sector: Interactivity, Technical Characteristics: Update mode, Compatibility). Two subsectors were moved and merged from the Functionality to Technical Characteristics sector (Electronic Transactions, Advertisements) (See Fig. 1).

Fig. 1
figure 1

The axes and sub axes of the rubric before (a) and after the completion of the first phase (b)

Based on this procedure, the draft rubric was redesigned so as to facilitate the app evaluation process. The rubric’s sub-axes were as follows:

  • The educational content section consists of seven subsectors: knowledge package appropriateness, learning provision, levelling, motivation/engagement, error correction/feedback provision, progress monitoring/sharing and bias free

  • The design section consists of four subsectors: graphics, sound, layout/scenery and app/menu design.

  • The functionality section consists of four subsectors: child friendliness, autonomy, instructions and customization.

  • The technical characteristics section consists of three subsectors: performance and reliability, advertising / electronic transactions and social interactions.

Calculating the scores of the subscales and an overall app total score is how the REVEAC is scored.

2.2.2 Second stage

A rubric should be shown to have sufficient validity and reliability to establish its usefulness in clarifying expected performance (Roblyer and Wiencke 2003). According to Taggart et al. (1999) rubric content validity can be improved by involving experts in its development. After the rubric was drafted, two experts (experienced instructional technology professionals) were asked to evaluate five different apps using the evaluation tool created by the researchers in the first stage for assessment of inter-rater reliability. The experts were provided with a form to complete ratings on each app, and then answering a few short-answer questions about the rubric. Their feedback was used to improve the clarity and comprehensiveness of rubric elements (Roblyer and Wiencke 2003).

Modifications were then made to the rubric based upon the experts’ suggestions (wording and relevance). The apps that were evaluated were selected randomly by the researchers. They were listed in Google’s digital store at the time of the research (April, 2016), where available in the Greek language, targeted at preschool-age children, were free and included the following: Play & Learn – Kindergarten, Kids ABC Numbers & Colors free, Greek Kindergarten Lite, Learn the numbers, Infant Tasks.

After the revised rubric was finalized, it was presented to the experts again to ensure that they understood each of the rubric’s dimensions and indicators. After reviewing the revised rubric, all the experts confirmed that its clarity had increased and that they understood each of the rubric’s dimensions and indicators. The inter-rater reliability between the two raters was analysed using Spearman’s correlations (see Table 3). Data were analysed with IBM SPSS Statistics version 23.

Table 3 Mean, Standard Deviation, Inter-Rater Reliability and Convergent Validity

The scale had a high level of internal consistency (Cronbach alpha = .79) and the average inter-rater reliability was rs(5) = .72, p < .01. Progress monitoring/sharing, rs(5) = .91, p < .01, Error correction /feedback provision, rs(22) = .92, p < .01, and Levelling, rs(5) = .82, p < .01, had the strongest correlations. All other correlations between individual items ranged from .58 to .78 and were large (Shrout and Fleiss 1979).

We analysed convergent validity using Pearson correlations. Each of the scores assigned on the rubric as well as an overall average score was compared to the scores given by researchers to the apps. Researchers used the rubric created by Lee and Cherner (2015) to rate the apps that were unrelated to this rubric. Researchers chose to use this rubric because they found it to be the most comprehensive rubric tailored specifically to analyse the educational potential of instructional apps. Results show that overall the rubric correlated with actual scores with a mean of r(5) = .52, p < .01. Individual items varied, but the Content appropriateness, r(5) = .74, p < .01, Motivation, r(5) = .63, p < .01, Error correction /feedback provision, r(5) = .71, p < .01, Progress monitoring/sharing, r(5) = .65, p < .01, Bias free, r(5) = .67, p < .01, Layout / scenery, r(5) = .61, p < .01, Child-friendliness r(5) = .65, p < .01, and Social interactions r(5) = .67, p < .01, yielded significant correlations. All other individual item correlations were not significant.

2.2.3 Third stage

For this stage, the researchers chose a new sample of five apps (Listen, find, learn; Place in order of size; Find the different thing; Write the word; Days, months, seasons), with the same criteria as the second stage, and installed them in a same type of mobile devices (a 10-in. tablet combined with the latest Android OS, Lollipop). Then, researchers recruited ten raters, five of whom were preschool educators and five students of the department of preschool education. Participation in the rubric evaluation was voluntary. The completed evaluations were anonymous with no personal information or identifying factors.

Independent ratings on the overall REVEAC total score of the 5 apps demonstrated an excellent level of inter-rater reliability (2-way mixed ICC = .81, 95% CI 0.79–0.87) (Landers 2015). The REVEAC total score had excellent internal consistency (Cronbach alpha = .85). Internal consistencies of the REVEAC subscales were also very high (Cronbach alpha = .80–.89, median .85), and their inter-rater reliabilities were fair to excellent (ICC = .50–.80, median .65). Detailed item and subscale statistics are presented in Table 4.

Table 4 Results of the analysis of the REVEAC items

Correlations also were done between scores on rubric ratings and the sample’s experience in using mobile and educational apps to determine whether the sample’s evaluation results were related to their experience as users of mobiles and/or educational apps. Similarly, correlations were done between the evaluation results and the sample’s preference for mobile devices, gender, etc. No significant correlations were found between any of these variables. Although the number of apps so far evaluated is still too small to statistically measure the raters’ agreement, the results shown in Table 4 do seem to show consistency between the evaluators and, therefore, allow us to be optimistic as to the usability of the rubric. The REVEAC is presented in Appendix Table 5.

Table 5 Rubric for the EValuation of Educational Apps for preschool Children” - REVEAC

3 Discussion and conclusions

We cannot insulate children from technology, but we need to ensure that they are not harmed in any way by it (Ebbeck et al. 2016). As Parette et al. (2010 p.2) state “with regard to specific technology applications available to young children, the issue is not, however, whether technology should be considered and used in education settings but how and whether it makes a difference in children’s learning and development”.

3.1 Implications of this study for preschool teachers

While electronic toys and video games were introduced to children as early as 1972, the release of the iPad in 2010 precipitated a dramatic new shift toward digital activities by young children (Huber et al. 2016). Applications on touchscreen devices have the potential of revolutionizing early childhood education and helping to build a stronger base for lifelong learning in the twenty-first century (Papadakis and Kalogiannakis 2010). However, this momentum needs to be systematically exploited since it is not certain that just putting digital devices and apps in the hands of young children-students will assist them in various different aspects of their development. The literature review shows that only fragmented efforts have been made to create principles that will be applicable to the construction of mobile educational apps for preschool age children. There is an immediate need for apps that go beyond the examples of “skill & drill” apps, which currently dominate a great part of the preschool mobile educational market. Only if educational researchers closely collaborate with the app producers and educators, can this become a source of inspiration for improved app designs and a novel way of implementing research insights into practice (Kucirkova 2016b).

The aim of this study was to develop an educational preschool app evaluation instrument (REVEAC) that incorporated research-based evidence and practitioner expertise. We constructed this assessment rubric under the guidelines of previous papers on rubric development and standardization, but also reflecting the fact that apps are very much dependent on technology, and should therefore not be evaluated from an exclusively pedagogical perspective. We have the notion that the rubric we propose in this paper could serve as a comprehensive tool, helping preschool teachers to select high quality mobile educational apps. It can also be a useful tool in the hands of software developers and designers, providing a flexible framework and offering guidelines for designing developmentally appropriate educational applications for preschool-age children.

3.2 Limitations of the study and future research directions

Although this rubric was designed to be as comprehensive as possible, following the latest developments in the pedagogical and technological area, there are several limitations to its use. One limitation of the study is a lack of inter-rater reliability, as the raters only coded each manuscript once. Another limitation is the relatively small sample and the type of apps (only free – only in the Greek language) analysed for convergent validity due to the restrictions we had set for the selection of the sample (apps). Both experts asked that an N/A (Not applicable) option be included on some rubric subscales as some apps by design are not assessable on all dimensions. On the other hand, the existence of an N/A option should be problematic as the users would find it difficult to determine when lack of information within an app should be rated as N/A or as a flaw. In a new revised version of the rubric we have to consider including an N/A option on some rubric subscales. The REVEAC was piloted on Android, rather than iOS apps. Although the scale has been partially applied to multiple iOS apps and no compatibility issues were encountered, future research should explore the reliability of the REVEAC with iOS apps both free and paid. A further analysis would be interesting for a correlation between the iTunes or Google star rating system and the total rubric score.

As Lee and Cherner (2015) state, creating a single rubric to evaluate all varieties of educational apps is not possible. Considering this, it is necessary to create comprehensive rubrics for the evaluation of different types of educational apps. On the other hand, we have the notion that the rubric presented here, is already too comprehensive for educators and probably this limitation will affect how the rubric can be used, especially outside the formal school environment. With this in mind, it is necessary to create two rubrics adapted for different audiences: parents and preschool teachers. In the future, we intend to create a rubric which will be primarily aimed at parents of preschool age children.