Keywords

1 Introduction

Programming involves the process of generating a solution to a problem, thus one of the main learning outcomes of a programming course is to develop a student’s ability to solve problems [31]. Therefore, it is important for educators to be responsive to “the problem-solving skills students bring to programming, and to those required by programming” because students are influenced by the facilitated strategies [33]. Soloway et al. managed to show that students’ sensitivity to strategies while learning to program has significant effect on their performance [33]. However, first year students have a small skill set and the ability to read code [22]. Therefore, besides choosing the most appropriate programming approach, programming environment and tools, the educators should consider conveying and teaching problem-solving strategies (e.g. hill climbing, trial and error, divide and conquer, top down, and bottom up) that students could exploit and apply while learning coding [2]. In addition, Felder says, that students “should be given the freedom to devise their own methods of solving problems rather than being forced to adopt the teacher’s strategy” (p. 679) [16]. But all strategies are not equally good, thus students need feedback from educators in order to learn and improve. Moreover, the strategies that students employ to solve coding problems cannot be observed directly and must be inferred. Therefore, this study aims to analyze the program assignments of 600 students from an introductory Java university course. Consequently, we aim to investigate the programming behavior of freshmen while learning how to program, by utilizing data generated when solving their programming assignments. This allows us to ascertain the strategies students employ during coding activities and understand the efficiency of these different strategies, so educators can offer actionable feedback to nurture good programming habits and strategies [4]. Enhancing the learning experience of students with carefully designed coding exercises and support in assessing the required knowledge, should assist freshmen when faced with the difficulties of syntax and semantics, as well as understand error messages and control flow.

To capture students’ programming behavior and identify their strategies, the authors extended the Eclipse programming tool with a plug-in for data collection. The goal of this study is to identify successful students’ programming strategies. This will allow educators to provide meaningful personalized feedback promoting reflection and support, allowing students to improve the way they program. Consequently, the study addresses the following research questions:  

RQ1: :

What programming strategies do freshmen employ to succeed in their assignments?

RQ2: :

Which actions can predict students’ programming behavior and support educators in early detection of difficulties and misconceptions?

 

2 Related Work

Previous research has shown a multitude of individual factors influencing academic achievement at various educational levels (e.g. primary, secondary, university). Some of these factors include self-efficacy [14, 35], personality traits (e.g. conscientiousness) [3, 28], cognitive ability [6], prior knowledge and experience [14, 35], and motivational and strategic (e.g. learning strategies) aspects [30].

Consciousness has been shown to be the personality trait that is most influential on academic achievement according to past studies [3, 8, 13, 28]. Moreover it is the dimension most closely linked to the will to achieve [13]. Another key predictor of student learning and academic performance is self-regulated learning (SRL) [11, 12, 23, 27]. SRL leads to deep cognitive engagement with the learning resources [11] which in turn transitions the extrinsic motivational behavior to behavior that is driven by intrinsic motivation [12]. This path from deep cognitive engagement to high levels of intrinsic motivation was found to be correlated with student learning and academic achievement [40]. Another behavioral factor correlated with student learning (e.g. mastering the content) and academic achievement is performance approach [14] or deep strategy [30]. Deep learning strategies (when the student’s focus is to attain understanding of the content and not merely obtaining a higher grade) result in mastering the content [14] which may lead to higher examination success [30]. In past studies, researchers show the difference between strategies (deep vs. surface) and their relation to academic achievement, and concluded that deep and surface strategies were positively and negatively correlated with academic achievement [7], respectively. Finally, previous research has shown that intellectual (cognitive/mental) ability influences academic performance. Intellectual abilities can be measured in different ways such as IQ [1], general mental ability (American College Test scores) [35] and logical reasoning [9]. Although several different factors can influence student academic achievement, when it comes to programming, problem solving ability demonstrates the most significant correlation with student performance in solving coding tasks [21]. In this contribution we will focus on the behaviour of the students rather than the above mentioned constructs. These previous contribution are to give reader a brief summary of which factors affect the academic achievement.

In computer science education, student assessment still abides by traditional outcome-based assessment [10]. However, programming is more than just the capability to generate code. It is a problem solving skill. Past research has shown that this assumption has been neglected, leading to a gap in students’ ability to apply core programming concepts to real-world problems [32, 37]. To address this issue, educators must be able to guide students in determining correct strategy, and identifying the appropriate time to abandon an inefficient approach [17]. Thus, researchers need to collect more authentic data and explore the processes by which students arrive at their final solutions [34]. This idea has become reality with the increase in popularity and usage of automated code testing and assessment in computer science education. Existent systems aid educators in assessing various features of coding assignments and scale the assessment up for large courses [15]. For instance, Jadud introduced the idea of researching students’ compilation behaviour (i.e. “the programming behaviour students engage in while repeatedly editing and compiling their programs”), to better understand how students progress through a programming task, so that appropriate interventions can be applied [19]. Following this idea, Blikstein et al. utilized code snapshots to uncover differences between novices and experts’ programming strategies [4]. Expanding on these past research studies, we extended the Eclipse tool to collect data portraying students’ programming behaviour; with a goal to explore students strategies when solving coding tasks and their success in doing so.

Feedback is one of the most powerful variables influencing learning [18]. However, feedback is of little use if it only conveys a message of right or wrong. Feedback must be meaningful and actionable in order to help the learning process. Traditionally, in computer science education, students receive basic level of feedback presented by the compiler [29]. Compiler messages are not always helpful, as they do not allow students to understand why they fail in solving the coding task. In most cases, coding tasks have multiple ways of achieving multiple solutions. To complete programming tasks, students apply strategies that build on their previous knowledge [20]. This led researchers to categorize students based on their programming behavior and employed strategies. Perkins et al. classify novice programmers as “stoppers” and “movers” based on the strategy they choose when facing a problem [25]. Turkle and Papers proposed two categories, “tinkerers” and “planners” [36], while Bruce et al. identified five: “followers”, “coders”, “understanders”, “problem solvers”, and “participators” [5]. Turkle and Papert’s idea was not only related to categorizing the novice programmers, but also conveying epistemological pluralism. Epistemological pluralism highlights that students can have separate approaches to the same problem and communicate different behavior (e.g. “tinkerer” or “planner”) while achieving similar results. Consequently, educators recognized the importance of the students learning process when learning how to program, and developed tools and systems to support this progress [24, 29, 39]. This study contributes to a data-driven development of personalized feedback in programming by using the writing and testing behavioral indicators of the students as they attempt to solve coding exercises. Our aim for this contribution is to keep the behavioral indicators as semantic-less as possible to attain greater generalizability and reproducibility of results.

3 Methodology

3.1 Research Objectives

The context of this research is a compulsory course in object-oriented programming (OOP). This course is offered to second semester CS-majors (600 students) in Java. As an introductory to OOP, there is a substantial variation in motivation and skills. This course is the basis for later software development courses, thus, it is important to identify struggling students early, provide appropriate feedback and help them develop good strategies for solving programming problems. Hence, the goal of the research is twofold: (1) identify programming strategies that lead to success in solving coding exercises; and (2) find ways to quickly detect student difficulties and misconceptions.

3.2 Assignment Structure

The course has 10 assignments with a reward of 100 points for completing each successfully. A student needs 750 points to qualify for the exam. Seven of the assignments (1-3, 5-6 and 8-9) are composed of smaller coding exercises with specific requirements indicating what to code. This allows us to use unit tests for automatic grading, as well as collect rich data regarding student progression. Students are encouraged to test by writing and launching their own testing code. Due to the open nature of the remaining assignments (4, 7, 10), they have been excluded from this part of the study. The size (number of Java classes and methods) and difficulty level of exercises vary; thus, the students are granted a certain degree of freedom in selecting exercises based on their (self-assessed) skill level. Statistics indicate that exercise choice is evenly spread. As well, exercises use approximately the same amount of time each week.

3.3 Data Collection

We focus our data collection to the last 4 assignments, as the first three assignments were relatively basic for students to develop concrete strategy. For each of these exercises we provided Eclipse with detailed instructions about which files and activities to track. In particular, we collected the following data: (1) snapshots of files when they are saved, with compiler errors and warnings (2) student programs that are launched, typically for testing their own code (3) unit tests that are run, with information as to whether they pass or fail, and (4) the use of certain commands and panels, typically those used for debugging

All data is time-stamped and most are limited to the relevant files of a specific exercise, for both practical and privacy reasons. A special “Exercise panel” shows the details of which data has been collected, allowing the students to track their progress and review their process. The data is anonymized, but with identifiers corresponding to exam result, prior to its use in our research such that it can be correlated at a later stage.

3.4 Measurements

To analyze the behavior and predict the outcome of each assignment, we captured the following measures:  

1. :

Number of test runs: is the total number of times a student ran the unit tests to check their code. This is counted for each exercise in every assignment.

2. :

Improvement in unit test success:  each time a student ran the unit tests, they passed and/or failed a specific number of tests. The score they obtained is the number of passed tests divided by the total number of tests. As a result, the authors computed the improvement (or lack thereof) in this score between two consecutive test runs.

 

To predict and analyze a student’s programming behavior in terms of the above mentioned measures, the authors also computed the following variables from the student’s unit test running time series:  

1. :

Time difference launch: is the average time difference between two consecutive launches of their own test code, before the students runs another unit test.

2. :

Time difference edit: is the average time difference between two consecutive logs of saving the file(s).

3. :

Size difference: is the difference in the number of lines of code between two consecutive unit test runs, i.e. code growth.

4. :

Improvement in errors: is the reduction in number of errors and warnings between two consecutive unit test runs.

5. :

First test run score: is the unit test success score of the first time a student ran a unit test for each exercise in every assignment.

 

4 Results

In this section, we present the prediction results followed by the behavioral analysis based on student categorization using an explanatory model.

Prediction Results. To predict the dependent variables: (1) improvement in unit test success and (2) the number of test runs, we used four different independent (also termed predictor) variables: (1) time difference launch, (2) time difference edit, (3) size difference, and (4) improvement in errors fitting a Generalized Additive Model (GAM). We divided the data set into 80% training and 20% testing set. We performed 5-fold cross-validation for both the training and testing. On one side, considering the improvements in the unit test success, in Table 1 we can see that the overall prediction error using the combined data of the four assignments is 0.11; and the average prediction error using data from each assignment separately is 0.18 (SD \(=\) 0.03). On the other side, in the same table, considering the number of test runs, we can see that the overall prediction error is 0.18 and the average prediction is 0.24 (SD \(=\) 0.04). Table 2 show the coefficients of the explanatory variables.

Table 1. Prediction results for the final score in a given assignment and the total number of test runs using data from individual assignments and the complete data sets.

Relative to the number of test runs per individual assignment, we explore the question how early can we predict? Figure 1 demonstrates Root Mean Square Error (RMSE) of 0.10 from as early as the fourth test run. We can see that most of RMSE values are between 0.12 and 0.16, however the lowest value is observed at the \(4^{th}\) test run. This facts can be seen as a “proof of concept” for the hypothesis regarding early prediction of the total number of test runs.

Fig. 1.
figure 1

RMSE values for predicting the total number of test runs using the data up to a given test runs ID.

Table 2. Linear model for score improvement and total number of tests run, all the exercises combined in one data set, bold t-values are significant (p < 0.01). Unbiased risk estimation for score improvement \(=\) 0.01 and for number of attempts \(=0.03\)

Explanatory Models. Table 2 shows the linear model fitted over the complete data set for the improvement of unit test success. We observe that the time difference launch and the difference in size are positively correlated with the improvement in unit tests success. These results support the assumption that students who made larger and less frequent changes in their code showed greater improvement in unit test success. Furthermore, Table 2 also shows the linear model fitted over the complete data set for the number of tests run. Here we observe that the time difference launch and the difference in code size are negatively correlated to the number of test run. These results support the assumption that students who made larger and less frequent changes in code had fewer number of test runs. The average marginal effects are shown in Table 3.

Table 3. Average marginal effects for the models shown in Table 2

4.1 Categorization

In order to explain the coding behavior of the students in more details, we categorized the student population into three categories (i.e. intellects, thinkers, and probers) based on the total number of unit test runs by each student. Table 4 presents the number of students belonging to each category for every assignment and Fig. 3 shows the change in category between two consecutive assignments. Assumptions for the suggested three categories of students, we would like to point out here that the pragmatic sense of the category labels might be different from our interpretation in the paper:

 

1. :

Intellects: run tests less frequently, as they are skilled and confident.

2. :

Thinkers: run tests more frequently, to receive early feedback regarding progress.

3. :

Probers: run tests most frequently, as they experience difficulty.

 

We would like to point out here that the categories are for each assignment and could change student to student and even for one student from one assignment to other.

Table 4. Number of students in the different categories for the separate assignments.

The Difference from the Perspective of the Three Categories. We present the differences between the three categories with respect to the explanatory and dependent variables (Table 6). These results hold for individual assignments as well (barring a few exceptions) as shown in Table 5.

  1. 1.

    Significant difference on time between two student program launches (F [2,383] \(=\) 70.27, p \(=\) .00001): post-hoc pairwise comparisons show that intellects have higher time difference than thinkers; and thinkers have higher time difference than probers.

  2. 2.

    Significant difference on change in code between two tests (F [2,383] \(=\) 198.85, p \(=\) .00001): post-hoc pairwise comparisons show that intellects have greater code change than thinkers; and thinkers have greater code change than probers.

  3. 3.

    Significant difference on the average improvement in success (F [2,383] \(=\) 121.51, p \(=\) .00001): post-hoc pairwise comparisons show that intellects have greater success improvements than thinkers; while thinkers have greater success improvements than probers.

  4. 4.

    Significant difference on average change in number of errors and warnings (F [2,383] \(=\) 5.79, p \(=\) .01): post-hoc pairwise comparisons deptict intellects reduce more errors than thinkers; while thinkers and probers have no significant difference based on reducing the number of errors in the code.

  5. 5.

    Significant difference on average success in first test run (F [2,383] \(=\) 16.60, p \(=\) .001): post-hoc pairwise comparisons show that intellects score higher in the first attempt than thinkers; while thinkers and probers have no significant difference based on first test run scores.

Table 5. ANOVA results for difference measures for the three categories.
Table 6. Linear model for improvement with all the exercises combined in three data sets, one each for intellects, thinkers, probers, bold t-values are significant (p < 0.01).

Figure 2 shows the explanatory variables corresponding to the three categories with progress based on the number of test runs. Upon inspection of Fig. 2, (left panels) it is evident that there exists a clear difference in the time between two student program launches and the average improvement between the intellects (shown with red) and the remaining two categories for the test runs 5–10 (i.e. time between main method launches) and 15–25 (i.e. improvement). However, the other differences are not as pronounced.

Fig. 2.
figure 2

Different measures for the three categories for each test run ID. (Color figure online)

Fig. 3.
figure 3

Students changing their strategies across the different assignments. a51 : 233 shows that in assignment a5, there were 233 students in category 1. Category labels: 1 \(=\) intellects; 2 \(=\) thinkers; 3 \(=\) probers.

From the explanatory models for each category (Table 6), we observe that the behavior of the students in each category is subtly different than the other two categories. The intellects have two positively significant coefficients: the wait between two student program launches and the change in code size. This indicates that intellects take their time to alter the code and remove errors and bugs. The thinkers have only one positively significant coefficient: the wait between two student program launches. That means the thinkers take time to test, but nothing clearly can be said about the other parameters. The probers have change in code as a negative and significant coefficient, meaning that they make smaller code changes between two successive unit tests runs.

Finally, it could be expected that students belong to more than one category while attempting to solve programming assignments. Figure 3 shows students changing across the categories intellects, thinkers and probers, for different assignments. For example, the intellects are a larger group (163) than the thinkers (131) for assignment 5 (a5); for the next assignment (i.e., a6) we see that similar to a5, the largest category is intellects followed by similar numbers of thinkers and probers. Also, a large majority of intellects did not change category, while most thinkers and probers either stayed the same or interchanged categories.

5 Conclusion and Discussion

In this study we analyzed the programming patterns of 600 students from an introductory university course in object-oriented programming using an Eclipse plug-in to collect data. Results from the analyses supported our two assumptions: (1) there are different programming strategies that lead students to success when attempting to solve coding exercises, and (2) we can early identify low performers. Using semantic-less measures from students’ coding and debugging behavior (e.g. time difference launch, time difference edit) and one code-base measure (i.e. growth in size), we managed very early (fourth attempt) to predict improvement in unit test success at a low granularity level of one student with one assignment. Our focus on semantic-less-ness lead to better reproducibility and generalizability of the results, because we can not, at least with current state-of-art, know without explicitly asking students if they are experiencing difficulty with the coding constructs (e.g. loops, recursion) or in the domain (e.g. Fibonacci numbers). Moreover, our study also adds to the growing body of research utilizing low granularity data compared to previous studies that have successfully provided predictive models that either looked at the students’ level as a whole class, or focused only on code-based variables [4, 26, 38]. In addition, none of the previous studies attempted early prediction.

Furthermore, we also presented behavioral analysis of students practicing different programming strategies. Thus, we can say that intellects as a group are characterized by having the highest first test run score; the highest improvement in unit test success; the lowest total number of test runs among the three categories; the longest wait time between two student program launches; and finally, the most changes in the code between two unit tests. Thinkers are characterized as follows: a low first unit test score; a short wait time between two successive student program launches; a lower change in code size than the intellects but higher than the probers; and unit test success that is higher than the probers but lower than the intellects. Finally, probers are characterized by having low first unit test score; the shortest wait time between two successive student program launches; the least code size change between two successive tests; and finally, the least improvement in unit test success. The key difference between thinkers and probers is the modifications they make to the code in a similar duration of time. The thinkers appear to have a strategy to fix errors and bugs in the code, while the probers appear to employ a trial and error approach. This is also evident from Fig. 2 (bottom-left), where we can see that for a large number of attempts, the probers have slow growth (close to 0.25, that is, 4 unit test runs for passing one unit test); where as, after certain test runs students from the remaining two categories require one or two test runs to pass one unit test. This exponential improvement is demonstrated earlier by the intellects than the thinkers, indicating that intellects initially make fewer mistakes and hence require fewer test runs to pass the complete set of unit tests. However, thinkers show more regulated and informed behaviour of testing the code than probers, and this might be a plausible explanation for why probers require more tests run to pass all of the unit tests. Consequently, from past studies we know that the weaker students have less understanding of what is tested by each test, and that makes them more likely to use a trial and error approach [25].

Finally, the prediction results presented in this study could support educators in providing motivational feedback to act as incentive to students to test their code a few more times before giving up. For example, we can predict the number of tests run a student would carryout at an early stage and we can also predict their projected improvement in unit test success at each test run. Given the current TestRunID and unit test score of the student, we could provide him/her with a target number of test runs at his/her given pace of improvement which might motivate the student to change their strategy (from probing to thinking) or to continue testing the code (if he/she is relatively close to the target number of tests run).

Limitations and Future Work. Our approach carries a few limitations that we plan to overcome in the next studies. First, this is a “black box” approach because we do not examine the code, instead we look into behavioral patterns when coding. In future work, we plan to analyze the mistakes made by the students and observe the corresponding strategic category. Next, we also did not consider any semantic features computed from the code; incorporating code metrics into the analysis could improve the prediction results. Finally, we do not gather or utilize data about students (e.g. consciousness, SRL, exam performance) or their motivation during the course, which hinders us in providing personalized feedback at this stage. Thus, we plan to incorporate this information in future studies in order to provide feedback that is not only timely and actionable, but personalized and adaptive as well.