In the past few years, audience response technology (ART) has been widely adopted on college campuses. This type of instructional technology, also referred to as “audience feedback,” or “clicker” technology, as well as by a variety of brand names (e.g., the Classroom Performance System) has become especially popular among instructors of large lecture classes. Current ART packages with coordinated hardware and software allow instructors to ask varied types of questions, obtain immediate responses from students via their response devices (“clickers” or “remote controls”), and display the pattern of answers in a tabular or graphic format that preserves individual anonymity. Virtually any class size can be accommodated.

Proponents of ART have asserted that the technology improves student engagement and learning (Ward 2003). Such claims, coupled with the desire to improve teaching effectiveness, have prompted many college instructors to incorporate ART into their classes, especially for large lectures in which student involvement and learning outcomes can be less than ideal. To date, studies of ART suggest that students do respond favorably to the technology. However, the number of studies is still relatively small. Further, these studies have limitations that recommend caution in generalizing their findings to the undergraduate student in large-lecture classes. Such limitations include studies of atypical students and classes, evaluations conducted after limited use of the technology, and assessment on a limited number of evaluative dimensions (e.g., Rice and Bunz 2006; Fitch 2004; Latessa and Mouw 2005). Thus, it is important that student experiences with ART receive more empirical evaluation.

The current study was designed to respond to the limitations of prior research on ART by conducting a multi-dimensional evaluation of the technology by undergraduate students who used it in large lecture university courses over the duration of a semester. Importantly, the study not only provides a broad-scale evaluation (with the help of a questionnaire designed for this purpose) and examines change in student response over the course of the semester, it also examines the extent to which perceptions of ART are influenced by student diversity. In the following sections, we review existing research on ART, discuss the major limitations of this research, and explain the focus of the present study.

Evaluations of ART

Across studies in which student evaluations of ART have been obtained, the findings are most consistent and positive with respect to classroom engagement: several studies indicate that students view ART as a positive influence on their attention, interest, or involvement during class (Rice and Bunz 2006; Fitch 2004; Latessa and Mouw 2005; Nicol and Boyle 2003). Global assessments of the technology, such as liking, effectiveness, or desire to continue use also tend to be positive (Blackman et al. 2002; Fitch 2004; Nicol and Boyle 2003; Stuart et al. 2004), and students generally find ART easy to use (Copas 2003; Copas and Del Valle 2004).

Prior research indicates that students also perceive learning benefit from ART use (e.g., comprehending or remembering course material), but average responses to items about learning are typically not as positive as those for classroom engagement (Blackman et al. 2002; Guthrie and Carlin 2004; Latessa and Mouw 2005). Further, several studies have found that students do not see ART technology as having much influence on their preparation outside of class (Blackman et al. 2002; Fitch 2004; Nicol and Boyle 2003). Three quasi-experimental studies have examined the influence of ART on exam or course grades. Although two studies found that grades were higher for students in classes that used ART versus those that did not (Poulis et al. 1998; Schackow et al. 2004), another found no significant difference (Blackman et al. 2002).

Overall, existing research indicates that students like ART, find it easy to use, and see it as promoting classroom engagement, but are less convinced that it benefits their learning. Further, direct evidence of learning benefit is mixed. This research provides an important foundation for understanding student evaluation of ART. However, the existing studies also exhibit several methodological limitations that suggest the need for further research, especially for supporting generalizations about undergraduate response to the technology in large lecture classes.

The first limitation of existing studies is their focus on students, classes, and ART packages that are notably different from the “typical” undergraduate using a current ART device in a large lecture context. These include medical students and health care professionals (Latessa and Mouw 2005; Schackow et al. 2004), military cadets (Blackman et al. 2002), graduate students (Rice and Bunz 2006; Copas 2003), and management-level IBM employees (Horowitz 1988), all of whom might be expected to have a different perspective on education, instructional techniques, and technology from that of the average undergraduate. In addition, a number of studies have involved students who used ART in classes with fewer than 50 students (Blackman et al. 2002; Rice and Bunz 2006; Fitch 2004; Latessa and Mouw 2005; Schackow et al. 2004). Further, when ART has been evaluated by undergraduates in large lecture classes, these classes have often been in science, engineering, or mathematics, where students’ comfort and competence with technology or expectations for instruction may be different than in other contexts (Boyle and Nicol 2003; Poulis et al. 1998). Finally, some studies have been conducted with students using ART packages that are considerably more primitive than those currently available (Herr 1994; Horowitz 1988). For example, the students studied by Poulis et al. used a single-button (yes/no) system. In short, much of the existing literature does not speak directly to the perceptions of the undergraduate student using more current versions of ART in a large lecture classroom, and in courses that are not in STEM (Science, Technology, Engineering, or Mathematics) disciplines.

A second set of limitations concerns measurement and analysis. To date, each individual study has obtained evaluations of ART with regard to a small set of criteria (often, only two or three). The greatest attention has been given to aspects of classroom engagement, such as interest, attention, or fun, but there has been considerable variability in which aspects of engagement have been examined as well as in the items used to measure specific perceptions. A similar lack of standardization exists for other dimensions of assessment, such as learning or global evaluation (i.e., overall liking or effectiveness). Further, with one recent exception (Rice and Bunz 2006), existing research has employed single-item measures (e.g., one item on “fun,” one item on “learning”), without apparent attention to the reliability and validity issues that arise with the use of single items (DeVellis 2003). These factors make it difficult to compare or synthesize findings across studies. It is also the case that prior studies have not always provided sufficient statistical information from which to interpret the results; these problems include reporting averages without standard deviations by which to judge variability (Fitch 2004; Poulis et al. 1998), or no statistics at all (Blackman et al. 2002). Overall, a more systematic and rigorous approach to measurement is needed.

A third limitation in the existing evaluation literature is reliance on one-time evaluations. Although several authors have noted the possibility of a “novelty factor” or “Hawthorne effect” on student evaluations of ART (Poulis et al. 1998; Stuart et al. 2004), studies have not examined whether students evaluate the technology less or more positively as they use it over a period of time. In fact, some studies have reported evaluations from students who had very limited opportunity to employ the technology (Copas 2003; Horowitz 1988; Latessa and Mouw 2005). Because undergraduates will often be required to use the technology for an entire semester or quarter, it is important to know whether their perceptions of ART undergo change, and whether any such changes have instructional implications.

These methodological limitations indicate the need for a systematic, multi-dimensional assessment of ART, conducted over the course of a semester with undergraduate students who are not necessarily in the STEM disciplines. The major goal of the present study was to provide such an assessment.

ART technology and student diversity

A secondary goal of the present study was to examine the influence of student diversity on evaluations of ART. Although there is considerable diversity among the students represented in prior studies, only limited attention has been given to examining whether this diversity affects how students respond to ART (Rice and Bunz 2006; Jackson and Trees 2007). Obviously, studies with small and homogenous student populations are not well-equipped to examine whether factors like gender or year in school affect student response to ART. However, even studies with larger, more diverse samples drawn from multiple classes have generally neglected this issue (Boyle and Nicol 2003).

There is good reason to believe that student characteristics could influence evaluations of ART. Rice and Bunz (2006) found that ART was perceived as more fun and easier to use by graduate students who viewed themselves as more competent at computer-mediated communication. Thus, some student characteristics may affect comfort or competence with technology, and thereby affect evaluations of ART. For example, research suggests that women have somewhat more negative attitudes toward computers and are less comfortable using them (Whitley 1997). Correspondingly, women might evaluate ART less positively. Similarly, students whose majors require greater use of technology might report more positive perceptions. Student characteristics could also influence evaluations of ART by affecting instructional expectations. In a large-sample study with undergraduates in communication, physics, and astronomy classes, Jackson and Trees (2007) found that students who were younger (freshmen and sophomores), or who had less experience with lecture classes, perceived greater learning as a consequence of ART technology. They suggested that older, more experienced students expected a more passive learning environment and that instructors’ use of ART violated those expectations.

For confident generalization about student response to ART, it is important to ascertain the extent to which variability in evaluation is predicted by student characteristics. Further, from the practical standpoint of the instructor using ART in the classroom, it may be important to know how specific groups or types of students are likely to respond. Thus, a secondary goal of the present study was to examine several demographic characteristics to assess whether these variables influenced any perceptions of ART.

Dimensions of ART evaluation

Our goal of providing a systematic, multi-dimensional evaluation of ART necessitated careful selection of dimensions on which to obtain evaluations. Accordingly, we selected 15 dimensions, based on input from multiple sources. These included anecdotal or marketing claims about the benefits of ART (e.g., Ward 2003), qualitative data on ART strengths and weaknesses as reported by individual students and instructors (e.g., Fitch 2004), scale items used for prior evaluation studies (e.g., Rice and Bunz 2006), and concerns or interests that arose based on the authors’ experiences using ART in their classes. Three of these dimensions were chosen to address student perceptions of ART influence on their engagement during class: attendance (a precondition of engagement), attention during class, and sense of participation in the class. Two dimensions were selected to address the impact of ART on students’ ability to appraise their standing in the course: self-assessment, conceptualized as a student’s ability to gauge how well he or she is doing in the course, and exam preview, or the student’s ability to predict what will be expected on exams, quizzes, or other assignments. Three dimensions, preparation outside of class, motivation to learn course material, and learning of course content, were selected for their focus on learning processes and outcomes. Three dimensions were selected to focus on the experience of using the technology: ease of use, fun, and privacy (i.e., students’ perceptions that their answers are not available to other students). Three dimensions were included for global evaluation: liking, belief that using the ART was a good use of class time, and desire for future use. Finally, based on the authors’ experience of various student complaints, we included a dimension focused on students’ perceptions that using ART resulted in a negative grade impact.

Method

Participants

Participants were students at a large Midwestern university and were eligible for participation by virtue of using an ART system (the Classroom Performance System from eInstruction) in one of three large lecture classes during Spring 2005. In each of these classes, use of the CPS system to answer questions during class was a required element of the course. Students purchased CPS response devices (“pads”) from university bookstores at a cost of approximately $10. The pads had been updated just prior to that semester, so instead of sending signals via infrared (which had required “line of sight” between the device and the signal receiver), they operated on a radio frequency and students did not need to point their pads in the direction of the receiver. These pads did not have an LCD screen to display answers (an upgrade that become available from CPS in Fall 2006), but were otherwise similar to current technology; at the present time, both types of pads are still being used on the authors’ campus.

On entering their classrooms, students entered a 2-digit code which caused the receiver to recognize their presence and pass all subsequent input on to the CPS software being operated by the instructor. The software allowed instructors considerable flexibility with regard to the types of questions they asked—virtually any type of multiple choice, true/false, or numeric answer question was possible, including questions with an accompanying graph, figure, or picture. In addition, questions could be created in advance and displayed via the CPS software, or created “on the spot” and communicated orally.

The three classes that participated in the study were a subset of the 15 classes (with a total of approximately 3,000 students) that used CPS on the university campus in Spring 2005. Although CPS (and possibly other ART systems) had been used in some courses prior to that semester, Spring 2005 was the first semester that CPS was made available campus-wide, coinciding with the availability of the radio frequency pads and receivers. A “call” for participation in the present study was issued to all of the instructors who indicated the intention to use CPS that semester, and instructors of three classes with enrollments exceeding 200 agreed to make participation in the study a required part of their courses. These courses were Communication 102 (COM 102), Introduction to Communication Theory; Forestry and Natural Resources 103 (FNR 103), Introduction to Environmental Conservation; and Organizational Leadership and Supervision 274 (OLS 274), Applied Leadership: Functions, Structures, and Operations of Organizations. COM 102 surveys social scientific theory in communication, FNR 103 addresses a range of issues in natural resource conservation, and OLS 274 is an introduction to business. The instructors of COM 102 and FNR 103 are co-authors on this paper, and a third co-author directly supervised the lecturer who taught OLS 274.

There was no attempt to standardize, manipulate, or systematically measure how the course instructors used CPS in their classes, but there were points of considerable similarity. All three instructors were using CPS for the second or third time, so they were familiar with the technology. (Although their prior use had been with the infrared version of CPS, the change to radio frequency required little adaptation by instructors.) Further, all three instructors used ART to present students with comprehension questions, often based on the immediately preceding lecture material, but also on material from assigned readings or from prior classes, and to provide review prior to quizzes or exams. Typically, students would answer 3–5 questions per class period. In addition, in all three classes, students earned course credit for using CPS, and credit was awarded regardless of whether the answer was correct. Thus, course credit for CPS amounted to a grade for attendance.

Not surprisingly, there were also some differences between the classes. The COM 102 and FNR 103 instructors varied the placement of the ART questions within their lectures, whereas the OLS 274 instructor was more systematic, asking 1 or 2 questions at the beginning, middle, and end of each lecture. The FNR 103 instructor made use of some opinion questions (i.e., surveying students’ attitudes on a topic relevant to the lecture); the other instructors did not. The amount of course credit for answering ART questions was 10% in OLS 274, 6.6% in FNR 103, and 5% in COM 102. Participation in the present study was also required, and constituted a small portion of students’ grades (the largest amount of credit was 3% of the final grade, in COM 102). Requiring student participation in the study was consistent with the focus on evaluation of instructional technology, and was approved (along with all other study procedures) by the university’s Institutional Review Board.

At the beginning of the semester, there were approximately 220 students enrolled in COM 102, 410 enrolled in FNR 103, and 692 enrolled in OLS 274 (estimates are based on enrollment caps and early rosters; enrollment fluctuates considerably in the first two weeks of class). According to data provided by the registrar, a total of 1,192 students were enrolled in the three courses at the end of the semester; 219 in COM 102, 425 in FNR 103, and 548 in OLS 274. The OLS 274 instructor reported that the large decline in enrollment was typical; students routinely drop when they realize that the class is more demanding than they expected. Of the 1,192 students who finished the semester in these classes, 237 did not participate in any of the three surveys administered over the course of the semester (22 in COM 102, 27 in FNR 103, and 188 in OLS 274), leaving a total of 955 who responded to one or more surveys. Failure to participate in one or more surveys probably resulted from the relatively small amount of course credit that students received for participation in the surveys. Ninety participants were eliminated due to extensive missing data, and 11 were excluded due to indications that their data was unreliable. Participants’ data was deemed unreliable if they gave two or more non-matching responses on demographic items from two or more of the three surveys (e.g., reporting “male” in one survey and “female” in another and reporting ages that differed by more than one year). We assumed that one demographic discrepancy between surveys could be unintended error (and simply excluded those participants from analyses on those specific variables), but felt that two or more discrepancies indicated either sloppiness or dishonesty.

Thus, the total number of participants was 854 (71.6% of those who were enrolled at the end of the semester): 174 in COM 102 (79.4% of the class), 393 in FNR 103 (92.4% of the class), and 278 in OLS 274 (50.7% of the class). However, only about half of participants (444/37.2%) completed all three surveys, and of these, only 390 provided complete data on gender, ethnicity, and year in school (variable subsequently used in analyses). This “listwise sample” consisted of 102 from COM 102 (52.9% of the class), 207 from FNR 103 (48.7% of the class), and 81 from OLS 274 (14.7% of the class).

Demographics

Demographic information for both the total and listwise samples is provided in Table 1. The demographic characteristics of the two samples were very similar, with minor variations (e.g., somewhat higher percentages of women, freshmen, and European-Americans in the listwise sample than in the total sample). Both samples were relatively evenly divided between men and women. There were considerably more participants who were freshmen and sophomores than juniors or seniors, so the juniors and seniors were combined into the category of “upperclassmen” for purposes of analysis. Similarly, because 85.9% of the listwise sample indicated European-American/White ethnicity, the remaining participants were combined into a single “minority” category for purposes of analysis.

Table 1 Participant demographic characteristics

Participants were pursuing majors in many of the university’s 11 colleges and schools, with the largest numbers in liberal arts, consumer and family science, technology, agriculture, and management. We had intended to examine the influence of students’ majors (schools) on their evaluations of ART. However, in our data, school was substantially confounded with the course, especially in COM 102 and OLS 274. The vast majority (92.2% in the total sample) of COM 102 students were liberal arts majors. Further, 49.6% of OLS 274 students in the total sample were technology majors, and technology majors were virtually unrepresented in the other two classes. FNR 103 had the most even distribution across schools; in the full sample, 22.6% were from liberal arts, 23.1% from consumer and family sciences, 21.6% from agriculture, and 19.5% from management. Accordingly, we elected to control for course by including it in our primary analyses, and to conduct an exploratory analysis on the influence of school within the FNR 103 sample only.

Participants were asked about prior use of ART systems in college courses, or in primary or secondary schooling. In the total sample, 85.9% had never used any audience response technology. Another 8.3% were unreliable in their reporting of prior use across two or more surveys; this inconsistency may have been due to the wording of the question, which failed to emphasize that students should not count their current courses. Because so few students (5.7%) had any prior experience with ART technology, we were not able to examine the influence of prior use in our analyses.

Procedures

At three points in the semester, students in the three courses were directed to a website from which they could access and complete the survey online. The link to the website was made available via secured “courseware” (WebCT) to reduce the likelihood of anyone outside the relevant classes gaining access to the website or participating in the study. Instructional technology staff administered the website and provided instructors with the names of students who had participated in each survey so that appropriate credit could be awarded. However, the data provided by the students were not given to the authors for analysis until after student grades were recorded and instructional technology staff had stripped the data of names and other identifying information.

The first survey was made available for 12 days beginning in the 7th week of the semester and ending in the middle of the 8th week. Because the instructors began using CPS in the 3rd or 4th week of the semester, students had been using the technology for 3–4 weeks at the time of the first survey. The second survey was also made available for 12 days beginning in the 12th week of the semester and ending in the middle of the 13th week, and the third survey was made available for 6 days in the 16th week of the semester (the week before finals; the semester was 17 weeks including Spring Break in the 10th week).

Audience Response Technology Questionnaire

To measure student perceptions of ART (CPS), we developed a measure entitled the Audience Response Technology Questionnaire (ART-Q). Prior research has typically relied on single-item measures and obtained evaluations of ART on a small number of dimensions. We sought to improve the quality of assessment by providing a multi-dimensional scale with multiple items for each dimension.

Content and face validity

During item development, we were especially concerned with content and face validity (DeVellis 2003). As previously noted, our 15 dimensions for assessment were chosen based on a thorough review of the literature on the evaluation of ART, as well as the authors’ substantial experience with the technology in their classrooms. We then created a set of three, 5-point Likert-style items (1 = strongly disagree to 5 = strongly agree) for each of the 15 dimensions. These 45 items are listed in Table 2. Some of the items for the dimensions of fun, liking, and ease of use were adapted from scale items created by Rice and Bunz (2006), others were informed by single-item measures employed in other studies (e.g., Poulis et al. 1998; Schackow et al. 2004; Stuart et al. 2004). Thus, as recommended for higher content validity (DeVellis 2003) our scale items were developed with input from experts on ART and its evaluation. Although face validity is not necessarily required for a valid scale, we were interested in student evaluation of the ART and wanted this interest to be transparent to the students in the study. Consequently, ART-Q items were repeatedly edited not only for content relevance, but for clarity and simplicity.

Table 2 ART-Q—items, factor loadings (Time 1 survey), and scale reliabilities

Reliability

Item reliability for the 15 scales was assessed for all three surveys (Times 1, 2, and 3), and was good to excellent (Cronbach’s α = .70 and above) for 10 of the 15 scales at all three times. However, five of the scales (for the dimensions of class time, future use, privacy, participation, and fun) contained one item that significantly reduced scale reliability for one or more of the three surveys (typically, all three). In 4 of the 5 cases, the problematic item was negatively rather than positively worded (i.e., it had to be reverse scored). As DeVellis (2003) notes, negatively worded items often produce lower reliability. Dropping these problematic items resulted in two-item scales with reliabilities ranging from .64 to .85. Reliabilities for each of the 15 scales at Time 1 are reported in Table 2; reliabilities at Times 2 and 3 were similar, and are available from the first author. Indices for each of the 15 dimensions were subsequently created from the means of the appropriate 2 or 3 items. The means and standard deviations for these indices are reported in Table 3.

Table 3 Means and standard deviations of factor-derived dimensions of evaluation

Factor analysis

Although we identified 15 dimensions on which to obtain student evaluations, we also recognized the probability that some dimensions of evaluation, while conceptually distinct, would elicit similar responses from students. We also recognized the practical benefit of determining whether dimensions might be collapsed and subsequently assessed with a smaller number of items. Accordingly, we conducted a series of exploratory factor analyses (principal axis with oblique rotation) on the 40 items that remained after dropping the items that were unreliable with their respective subscales. These analyses, conducted separately on the data from each of the three surveys, produced highly similar, six-factor solutions using a .50/.30 criterion. (Using this criterion, an item was considered to load on a factor if its loading was at least .50 on that factor, and not more than .30 on any other factor. A few items that loaded between .45 and .49 on one factor and not more than .30 on any other factors were also retained because they had a good conceptual fit with the factor and did not reduce the reliability of the factor-derived scale.) The factor loadings for the Time 1 survey are reported at the bottom of Table 2; details of the Time 2 and Time 3 loadings are available from the first author. These factor analyses and subsequent reliability analyses informed the creation of scales for six factor-derived variables: Appraisal/Learning, Enjoyment, Preparation/Motivation, Attendance, Negative Grade, and Ease of Use.

All nine items from the original learning, exam preview, and self-assessment dimensions loaded on the first factor, which was labeled Appraisal/Learning. This factor has a coherent focus on the student’s perception that CPS is an aid to performance in the class—does the technology help with learning, knowing what will be on the exams, and/or figuring out whether you are doing well or poorly? Although we originally conceived of appraisal as distinct from learning, it is not difficult to understand why students might see them as closely related. The items loading on the second factor came from the original fun, liking, and future use scales, thus combining items that we conceptualized as representing classroom engagement (fun) with items from global evaluation scales (liking, future use). It is not difficult to see why the response patterns for these items were similar; someone who finds something fun to use will probably also like it and want to use it in the future. Accordingly, we gave this factor the label Enjoyment. Items loading on the third factor included all three original items for preparation, and one motivation item, thus we labeled it Preparation/Motivation. Items loading on the fourth, fifth and sixth factors corresponded exactly to the original negative grade, ease of use, and attendance items, so those factors were labeled with the original names of the dimensions: Negative Grade, Ease of Use, and Attendance.

Overall, the six factors do a fairly good job of representing the original 15 dimensions, incorporating items from 9 of the 10 dimensions that were represented in the factor analysis with three items, and 2 of the 5 dimensions that were represented with two items. Reliabilities for each of the six factor-derived scales (for each survey) ranged from acceptable to excellent (.64–.93) and are reported in Table 2. Accordingly, we elected to create factor-derived variables from the means of the items meeting the .50/.30 criterion for each factor and to conduct our analyses on these six variables rather than the original 15. Of the original dimensions not represented in the six factors, the most conceptually distinctive was that of privacy. The mean for this variable in the total sample (M = 3.94, SD = 0.77) suggests that students are satisfied with the privacy of their CPS responses. However, researchers with a particular interest in student perceptions of privacy while using CPS should consider creating additional items to assess this concern.

Construct validity

We conceptualized the study as focusing on student evaluation of ART rather than student outcomes due to ART use; thus, we were more concerned that students could understand the questions they were being asked about ART than that their responses be predictive of objective criteria (criterion validity). Nonetheless, we do have some post hoc evidence of construct validity for three of the six factor-derived scales. Evidence for construct validity is provided when the measure of one construct behaves in theoretically predicted ways with a measure of another construct. To provide this evidence, we drew on a set of variables that were measured as part of the overall data collection, but were not directly relevant to the focus of this article: students’ subjective learning (perceptions of learning in the course as a whole, measured with items such as “How well do you think you have comprehended the content of this course”), evaluation of the course, and final grades. Detailed descriptions of these measures and analyses involving these variables are reported elsewhere (MacGeorge et al. in press).

With regard to Appraisal/Learning, we hypothesized that students who perceived learning more in the course as a whole would view ART as making a more positive contribution to their learning. This relationship was supported by all three surveys, with correlations between students’ Appraisal/Learning evaluation and their subjective learning ranging from r = .12 at Time 1 (n = 749, p < .002) to .40 at Time 3 (n = 623, p < .001). For Enjoyment, we expected that students who were more positive toward the course overall would also view ART as more enjoyable. This, too, was supported by all three surveys, with correlations between Enjoyment and course evaluation ranging from r = .29 at Time 1 (p < .001) to r = .37 at Time 3 (p < .001). With regard to Negative Grade, we anticipated that students who received lower grades in the course would view ART as having a stronger negative impact on their grades. There was no correlation between Negative Grade and grade in the course in the first survey (when eventual course grades might have been less certain in students’ minds), but the expected relationship was supported by small but significant correlations at Time 2 (r = −.13, p < .001) and Time 3 (r = −.10, p < .02). We did not obtain data that could be used to assess construct or criterion validity for Ease of Use, Preparation/Motivation, or Attendance. In future research, the construct and criterion validity of all six scales might be further tested. For example, Ease of Use could be compared against records of student problems getting the ART technology to work. However, we also contend that there is adequate evidence of content, face, and construct validity for this (current) use of our ART-Q.

Results

Descriptive statistics

Means and standard deviations for the six factor-derived dimensions of evaluation are reported in Table 3 for the Time 1, Time 2, and Time 3 surveys. (Means and standard deviations for the original 15 dimensions are available from the first author.) Grand means (averaging across time) are also reported for the entire sample (n = 854) and the listwise sample (n = 390). Inspection reveals little difference between the full sample and listwise grand means. Each of the means was tested for statistically significant difference from the scale midpoint of 3; most were significantly different (p < .001, two-tailed). Ease of Use received the highest mean evaluations, followed by Attendance, Appraisal/Learning, and Enjoyment (all with similar means). Preparation/Motivation received mean evaluations somewhat lower than 3.0 (though not always significantly so), and Negative Grade was both significantly and strongly below 3.0 (again indicating that students did not view use of the ART as harmful to their grades).

Effects of student characteristics, time, course, and school

To examine the influence of gender, ethnicity, and year in school on evaluations of ART, and to assess whether student evaluation changed over the course of a semester’s use, we conducted a mixed-model MANOVA with gender, ethnicity (minority/non-minority), and year in school (freshman, sophomore, or upperclassman) as between-groups factors and time as a repeated measure (Times 1, 2, and 3); course was also included as a between-groups factor to control for its effects. The dependent variables were the six factor-derived dimensions. This analysis was conducted on the listwise sample (n = 390). The MANOVA was performed with sphericity assumed and with the Greenhouse-Geisser and Huynh-Feldt corrections. The Greenhouse-Geisser epsilons ranged from .93 to .98 for the six dependent variables, and all six of the Huynh-Feldt epsilons were 1.00. Consistent with these high epsilons, none of the significance outcomes was affected by either correction. Accordingly, the following F and p values are reported with sphericity assumed.

Fig. 1
figure 1

Mean evaluations of ART at Times 1–3

The MANOVA revealed significant multivariate within-group effects for time, Wilks λ = .95, F(12, 1414) = 3.39, p < .001, η 2 = .03, the interaction between time and course, Wilks λ = .95, F (24, 2840) = 1.69, p < .05, η 2 = .01, and the four-way interaction between time, gender, ethnicity, and course, Wilks λ = .94, F (24, 2840) = 1.89, p < .01, η 2 = .02. Univariate analyses indicated that the effect of time was significant for Ease of Use, F(2, 712) = 4.11, p < .05, η 2 = .01, Enjoyment, F(2, 712) = 3.25, p < .05, η 2 = .01, and Negative Grade, F(2, 712) = 9.23, p < .001, η 2 = .03 (see Fig. 1). Paired samples t-tests indicated that Ease of Use at Time 3 (M = 3.84) was significantly lower than at Time 1 (M = 3.95), t(389) = 2.48, p < .01, but there was no significant difference between Ease of Use at Time 1 and Time 2 (M = 3.88), t(389) = 1.59, p = .11, or between Time 2 and Time 3, t(389) = 1.07, p = .29. Enjoyment at Time 2 (M = 3.20) was significantly lower than at either Time 1 (M = 3.38), t(389) = 5.42, p < .001 or Time 3 (M = 3.34), t(389) = 3.97, p < .001, but Time 1 and Time 3 did not differ from each other, t(389) = 1.12, p < .27. Negative Grade increased over time: Time 2 (M = 2.26) was significantly higher than Time 1 (M = 2.16), t(389) = 2.79, p < .01, and Time 3 (M = 2.47) was significantly higher than Time 2, t(389) = 4.95, p < .001.

The interaction between time and course was significant for Preparation, F(4, 712) = 2.63, p < .05, η 2 = .02, and marginally significant for Appraisal/Learning, F(4, 712) = 2.32, p < .06, η 2 = .02. Because the course differences were not of primary interest in this study, we analyzed the comparisons descriptively rather than inferentially as above. These interactions indicated that Preparation increased somewhat over time in FNR 103 (Ms = 3.33, 3.49, and 3.47 for Times 1, 2, and 3, respectively) but remained essentially the same in COM 102 (Ms = 2.77, 2.77, and 2.75) and OLS 274 (Ms = 3.12, 3.18, and 3.17), and that Appraisal/Learning decreased somewhat over time in COM 102 (Ms = 3.58, 3.45, and 3.34) but increased slightly in FNR 103 (Ms = 3.33, 3.40, and 3.47) and remained essentially the same in OLS 274 (Ms = 3.11, 3.11, and 3.16). The interaction between time, gender, ethnicity, and course was significant only for Preparation, F(4, 712) = 3.15, p < .05, η 2 = .02. This interaction was due to a decline in Preparation at Time 2 (relative to Times 1 and 3) for non-minority females in OLS 274; because this interaction does not seem to have a theoretical or post hoc interpretation, it will not be discussed further.

To explore the possible influence of school (i.e., major) on evaluations of ART, we conducted a mixed-model MANOVA with participants who were students in FNR 103. As previously discussed, this was the only class in our study for which there was a relatively even distribution of students across multiple schools. School (Liberal Arts, Consumer and Family Sciences, Agriculture, or Management) was the between-groups factor and time was a repeated measure (Times 1, 2, and 3). The total sample size for this analysis was 206. Neither the effect of school, Wilks λ = .93, F(18, 558) = .86, p = .64, nor the interaction between time and school, Wilks λ = .84, F(36, 565) = .94, p = .58, was statistically significant.

Discussion

The central purpose of the present study was to provide a multi-dimensional evaluation of ART, conducted over the course of a semester with undergraduates using the technology in large lecture classes. The development of the ART-Q was important to our research effort. Scales were developed to measure each of the 15 original dimensions, and these scales proved reliable across all three surveys. Subsequent researchers may wish to develop items to replace the five that were dropped, especially if one or more of the original scales prove important in the context of their research questions. Factor analyses reduced the 40-item set to a manageable six factors, each of which was interpretable and produced a reliable scale across all three surveys. Conceptually, the six factors provide representation for most of the original 15 dimensions. Further, there is preliminary evidence of construct validity (as well as face and content validity). The ART-Q constitutes a notable methodological improvement over prior measures that used single items to assess a small number of dimensions, and should be useful for future research involving evaluations of ART.

Much of prior ART evaluation has been conducted with “atypical” students and classes. Importantly, the findings from the current study help to extend knowledge about ART benefits (and possible shortcomings) to traditional undergraduates in large lecture classes, including those that are not in STEM disciplines. Consistent with prior studies, students in the current project agreed quite strongly that ART is easy to use (Copas 2003; Copas and Del Valle 2004). There was a more modest level of agreement that using ART is enjoyable, and that it results in greater knowledge about instructor expectations, student performance, or course material (i.e., Appraisal/Learning). These findings dovetail with prior assessments of ART influence on classroom engagement and learning (e.g., Nicol and Boyle 2003), as does the finding that students did not view ART as influencing preparation for class (e.g., Fitch 2004). Prior studies have not provided much quantitative assessment of ART influence on attendance or students’ privacy concerns. The current findings suggest a modest influence on attendance and that students perceive their answers to be confidential. Further, prior studies did not directly consider whether ART use was perceived as hurting students’ grades. The present study found little of this negative perception.

One innovation in the current study was the use of a multi-survey design, examining whether evaluations of ART by the same set of students changed over the course of a semester. The current study found very little fluctuation in student perception; most dimensions did not vary significantly from the 7th week of the semester (when the students had been using ART for approximately 3 weeks) to the 16th week of the semester. However, students did see the use of ART as having more negative impact on their grades at the end of the semester than at the beginning. Most likely, this reflects some students’ greater certainty at Time 3 about receiving a low grade in the course, including how points awarded for ART use might be contributing to those grades. However, it should be noted that the effect size was modest (η = .03). Students also perceived ART as somewhat more difficult to use toward the end of the semester. This finding may reflect less positive perceptions of their courses as final grades approached, or greater awareness of their own and others’ occasional difficulties with batteries—or forgetting to bring their clickers! Overall, the current study suggests that student responses to ART are not driven by the novelty of the technology; instead, the perceived benefits persist over time.

Few previous studies had considered how student diversity influenced evaluation of ART. In the current study, we were able to examine the influence of gender, ethnicity, and year in school. Despite considerable power to detect effects (given the large sample size and repeated measures design), the only significant effect was a theoretically uninterpretable four-way interaction involving time and course as well as gender and ethnicity. Thus, the present study suggests that these demographic characteristics are not significant influences on student perceptions of ART. Although we were not able to conduct a full-scale assessment of the influence of students’ majors on ART evaluations, we did examine this question within the FNR 103 sample. Although liberal arts, consumer and family sciences, agriculture, and management majors might be reasonably expected to differ in some of their attitudes toward technology or expectations about instruction, there were no differences in ART evaluation. From a practical perspective, these findings suggest that ART may be equally beneficial to a wide range of students. However, further research may want to readdress this issue, perhaps examining other aspects of student diversity such as academic aptitude (MacGeorge et al. in press).

There were some indications of variation in ART evaluation across the three courses, but these can only be interpreted post hoc. For example, it appears that the FNR 103 instructor was especially successful at using ART to influence students’ preparation and learning because these evaluations increased over the course of the semester rather than remaining the same or declining as they did in the other two courses. This outcome may reflect the instructor’s considerable experience in the classroom and with that particular class; he was the most senior of the three instructors and was teaching FNR 103 for the 11th time that semester (as opposed to the 1st and 5th time for COM 102 and OLS 274, respectively). However, it may also reflect specific strategies for using ART or discussing its use with students, neither of which were assessed in the current study.

Thus, one important limitation of the current study is that the student perceptions we assessed cannot be tied directly to instructors’ ways of using ART. The present study, together with prior work, provides a good foundation for the claim that ART can provide a modest level of benefit to students on several dimensions. However, subsequent work should consider whether student benefit can be increased (or is lost) if ART is used in specific ways. To take one example, students in the current study did not view ART as promoting preparation for class. This may be unsurprising, since none of the instructors made students’ grades contingent on the correctness of ART answers, and questions were often on the immediately preceding lecture material (rather than reading that had to be accomplished outside of class). In addition, instructors’ use of ART questions as a way of providing review for quizzes and exams may actually have substituted for some student preparation outside of class.

Thus, future research should consider how instructors’ methods of using ART affect student perceptions and behaviors. For instance, to what extent would students choose to improve their preparation if ART was used in a “higher-stakes” way? Would this also increase their perception of learning? Would it have any negative effects, such as reducing liking for the technology? More broadly, what effects do instructors’ choices about number, type, or placement of questions, follow-up to students’ answers, and other aspects of ART use affect students? To provide instructors with information about best practices in using ART, these issues need to be systematically addressed in subsequent studies.

Finally, future research should include objective measures of ART influence on students’ learning process and outcomes; only a small number of these studies have been conducted to date (Blackman et al. 2002; Poulis et al. 1998; Schackow et al. 2004). Clearly, student perceptions such as those measured in the present study are an important component of evaluating the technology. However, there remains a difference between showing that students believe they know more as a consequence of using a technology, and demonstrating that they actually do know more. Further, things that students like about a technology may fail to benefit them, or even be detrimental to learning (Mayer and Moreno 2002) To address these issues with regard to ART, experimental or quasi-experimental research needs to be undertaken.

Pragmatic implications

Limitations notwithstanding, we believe the current study supports several conclusions with pragmatic implications for instructors. First, our research indicates that students find ART easy to use, and perceive it as enjoyable, beneficial to learning, and encouraging of class attendance. Overall, instructors can probably expect their students to respond to ART in a moderately positive way, especially when they use the technology in ways that are similar to the courses described in the present study. Second, students with varied characteristics—gender, ethnicity, year in school, and major—seem to react similarly to ART, suggesting that instructors need not be overly concerned about these characteristics when deciding whether to implement ART. Third, as indicated by variation in evaluation across the three courses (as well as common sense), instructors should expect that student responses to the technology will be affected by the way in which the technology is integrated into the curriculum.