Using Machine Learning to Recognise Novice and Expert Programmers

Lee, Chi Hong; Hall, Tracy

doi:10.1007/978-3-030-91452-3_13

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 13126))

Included in the following conference series:

International Conference on Product-Focused Software Process Improvement

737 Accesses

Abstract

Understanding and recognising the difference between novice and expert programmers could be beneficial in a wide range of scenarios, such as to screen programming job applicants. In this paper, we explore the identification of code author attributes to enable novice/expert differentiation via machine learning models. Our iteratively developed model is based on data from HackerRank, a competitive programming website. Multiple experiments were carried using 10-fold cross-validation. Our final model performed well by differentiating novice coders from expert coders with 71.3% accuracy.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Using Machine Learning to Identify Patterns in Learner-Submitted Code for the Purpose of Assessment

Use of Machine Learning Methods in the Assessment of Programming Assignments

Modeling Learners’ Programming Skills and Question Levels Through Machine Learning

Keywords

1 Introduction

Agrawal et al.’s study [1] suggests that grouping strong students with weaker students could improve student achievement overall. Lui et al.’s study [8] shows that in programming courses, identifying and isolating weak students to apply a different teaching method could lead to higher performance in the examination and improve confidence in programming. In contrast, companies hiring programmers have been shown to exclude exceptional candidates where students’ grade point average was used to pre-screen or filter candidates [5]. The ability to distinguish experienced programmers from novices could allow companies to gain insight into candidates quickly and accurately, improving their hiring process.

Code stylometry is the study of code authorship or related analysis through feature identification in code. Many research studies have been conducted in this field, with at least 57 publications published between 1957 and 2020 related to this topic [7]. Most previous research focuses on attributing the author of a piece of code, very few previous studies classify features of the code author, such as coding experience. The main aim of this paper is to explore the use of code stylometry with machine learning to classify coding experience. Our three main Research Questions are:

RQ1. Can programmers’ experience levels be classified using features from their code?
RQ2. What code features improve the performance of a machine learning model to predict programmer experience?
RQ3. Which modelling techniques achieve the best classification performance for programmer experience?

A prediction model was developed using data from novice and expert programmers. Data was crawled from HackerRank^{Footnote 1}, a competitive programming or programming interview preparation site. Different configurations to fetch such data and features to extract were experimented with to optimize model performance. Our final model achieves 71.3% accuracy for correctly classifying expert and novice programmers.

2 Related Work

Halstead [6] proposed a set of metrics to evaluate a piece of code that measures the complexity of the code, which is heavily affected by the algorithms used in the code. Halstead’s Metrics were reported to have a strong correlation with student performance on computer programming courses [4]. Oman and Cook’s work [9] on code authorship attribution used a set of 16 metrics. These metrics identified features in code comments and formatting and were inspired by techniques used for English literature authorship analysis. Oman and Cook [9] were able to categorize most code by author accurately by applying cluster analysis. Spafford and Weeber [10] analyzed code left in a computer system by a hacker, to establish his or her identity. The proposed features for source code or binary code files including:

Language (The programming language used)
Formatting (e.g. Number of statements per line, etc.)
Special features (Such as code specific for certain compilers to read)
Comments (Consistency, frequency and length)
Variable names (e.g. naming styles of local temporary variables)
Language features (e.g. use of for loops versus while loops, etc.)
Scoping (Global versus local identifiers, scope of helper functions)
Errors
Other metrics (Halstead metrics, McCabe metrics, etc.)

Burrows and Tahaghoghi [3] explored the use of n-grams for code authorship analysis. Using 1,640 files from 100 authors, Burrows and Tahaghoghi achieved a 67.01% accuracy. Wisse and Veenman [11] combined 10 traditional structural, style and layout features with 4 different n-gram metrics to reach a 91.4% accuracy with 10 authors and 85% accuracy with 45 authors. Bhattathiripad [2] proposes different types of metrics, programming blunders, to be used to evaluate code. Bhattathiripad points out that most explored code features focus on the general coding styles or algorithmic choices of the whole piece of code, Bhattathiripad explores the use of programming blunders to identify code authorship. Examples of programming blunders include unused variables, unused imported libraries, or a section of code that never gets executed.

3 Methodology

3.1 Dataset

We chose HackerRank as the data source for this project as code samples are in a one-file script format and it is possible to attribute the author’s experience level. HackerRank is a company that provides programmer hiring solutions to technology companies by using their coding test systems. Their website provides coding test practice questions, with over 7 million programmers using the website in 2020. All submitted code is visible to the public, along with the programmer’s profile, with some programmers also providing a link to their LinkedIn profile. We used programmers’ LinkedIn profiles for insight into levels of experience.

In this paper, we have defined a novice programmer as one that received programming training, but has not had any work experience, including part-time work or internships. We have defined an expert programmer as someone who has had full-time programming-related work experience. With this definition, developers with marginal experience, lying between novice and professional, is eliminated as we believe this would help develop a more accurate model. This binary rather than continuous classification was chosen due to the limitation in the size of the data set. However, in reality, a programmer’s expertise is not binary. The classifications were made by manually analyzing information on the programmer’s LinkedIn profile. For all experiments, the same 199 code samples were used (1 for each author). Where 91 samples were classified as novice and 108 as expert.

3.2 Features

Initially, we selected 15 features (base feature set) to extract from the data set. These features are similar to those used in previous studies [9, 10] and include: 1) number of lines of code 2) ratio of empty lines to all lines 3) average length of lines 4–7) number and average length of line comments and block comments respectively 8–9) number and average length of the names of variables 10–15) number of if, for, do, while, switch and cast statements. These features are simple to extract and have been previously shown, by Oman and Cook [9], to reliably attribute code to authors.

3.3 Machine Learning Models

Ten machine learning algorithms, shown in Table 1, were selected to be trained and tested for all experiments. A wide range of different types of algorithms is used, such as linear models, neural networks and decision trees. This was done to identify the best model in experiments as well as which models generally do better with the data set.

The models are built using the 10-fold cross-validation technique, and predictive performance is measured using accuracy, F1 and Matthews Correlation Coefficient (MCC).

Table 1. Models used in experiments

Full size table

4 Results

This section presents the key experiments performed with a full list of experiments provided in the Appendix. We describe how we iteratively evolved the models via a series of experiments in order to gain improved predictive performance. Note that all experiments were conducted based on the same dataset as described in previous sections.

A replication package is available on the repository on GitHub.^{Footnote 2}

4.1 Exp0: First Experiment

This experiment was performed to establish a baseline against which subsequent experiments are compared. Using all unmodified 199 code samples, and the initial set of features listed in the previous section, the average accuracy achieved by all 10 models was 0.605, where the top 5 averaged at 0.643.

Exp1: Base Model, Code Cleaning. Following experiment 0, we looked into the code samples and the values of the extracted features. We noticed that in each code sample, there is a section, referred to as general code, that is identical in each file, such as the code which prints the function result out to standard output for evaluation, this code is provided by HackerRank in their code template. We proceeded to perform experiment 0 again but with all general code removed. This resulted in an average accuracy of 0.627 (3.67% increase), and 0.662 in the top 5 models (2.96% increase). We have decided that the set of modified files with general code removed will be used in all future experiments.

Exp2: Halstead Metrics. This set of experiments is done aiming to compare the model accuracy using data with and without Halstead metrics (volume, difficulty and effort value). In experiment 2.1, only Halstead metrics’ values were extracted from the code samples. The models had an average accuracy of 0.548, and 0.570 in the top 5 models. In experiment 2.2, we combined the Halstead metrics with the set of features used in experiments 0 and 1. This increased the average model accuracy to 0.601, and 0.639 for the top 5 models, but these accuracy values were still lower than those from previous experiments.

Exp3: Cyclomatic Complexity. In this experiment, we included cyclomatic complexity as one of the features to extract from the code samples. The models had an average accuracy of 0.630, where the top 5 averaged at 0.670. The inclusion of cyclomatic complexity resulted in a slight increase in accuracy.

Exp4: Style Features. two new style-related features were included in this experiment, cond_space_ratio and bracket_line_ratio. cond_space_ratio is the ratio of cases where a space is absent before the opening bracket (e.g. "if(") of a condition to where the space is present (e.g. "if ("). bracket_line_ratio is the ratio of cases where curly brackets are placed on the currently (e.g. ") {") to placed on a new line (e.g. ")<line break>{"). The average accuracy of all models was 0.623, where the top 5 reached 0.669. The top 5 models with style features performed slightly better than the top 5 models from the base experiment.

Exp5: Normalization. In this set of experiments, normalization techniques were applied to the dataset. In experiment 5.1, a min-max scaler was applied, resulting in a 0.646 and 0.667 accuracy in all models and the top 5 respectively. A standard scaler was applied in experiment 5.2, resulting in a 0.650 average accuracy across all models and 0.683 in the top 5 models. Applying the standard scaler resulted in a 3.67% and 6.89% improvement in the average accuracy value for all models and the top 5 models.

Exp6: Final Model. Learning from the results of all previous experiments, a final experiment was conducted. This experiment used the same dataset as the previous experiments, with general code removed. Features-wise, Halstead metrics were not included, while the cyclomatic complexity value and style features were included. Standard scaler was also applied to the dataset values. With this configuration, the final set of models achieved a 0.653 average accuracy and 0.678 in the top 5 models. The best model, logistic regression, reached an accuracy of 0.713. The models’ accuracy, f1 and MCC values are shown in Fig. 1. The coefficients found in the best performing model is shown in Fig. 2.

5 Ethical Concerns

Given that the application is designed to give corporations the ability to determine one’s programming experience by analyzing their code, ethical concerns may arise regarding related issues.

The preselection of job application candidates would be the biggest concern of all concerns. While it is understood that such pre-selection methods are not always accurate and causes the loss of talents or results in hiring candidates who are not suitable for the job, it could be arguable that these methods help save time for both the company and the candidates, making the entire application process more efficient.

The main concern is if the pre-selection method is fair. As the model was trained using a dataset, biases could be existing in the dataset and be carried over to the final model during the training model. These biases could be due to the bias in the data source itself, or by human-induced, or systematic bias during the collection of data. These biases could damage the reliability of the model, as well as its fairness as it could give an advantage or disadvantage to a group of people whose code exhibits features picked up by the model but non-related to their development experience.

The model developed in this project is not ready for real-life deployment. Not only does it have low accuracy, but it was not thoroughly studied as well, such as to establish the fact that whether there are biases in the model. Before these concerns are investigated and addressed, such a system should not be deployed.

6 Conclusion

6.1 Review of the Research Questions

RQ1. Can Programmers’ Experience Levels Be Classified Using Features from Their Code? A logistic regression model was trained using 199 code samples, achieving 71.3% accuracy.

RQ2. What Code Features Improve the Performance of a Machine Learning Model to Predict Programmer Experience? From the final model’s coefficient values, the top 5 features influencing the classification are as follows: Average length of block comments, ratio of bracket without to with a space before it, ratio of empty lines to all lines, average length of lines, and the average length of comments.

RQ3. Which Modelling Techniques Achieve the Best Classification Performance for Programmer Experience? Comparing the results of all experiments, logistic regression performed best in the final, as well as most, experiments.

6.2 Further Work

A More General and Larger Data Set. The current data set is very specific and focused as it helps narrow the scope of this paper. However, the data set must be more generalized for real-world use, such that different types of code could be analyzed. Besides, a larger data set could further benefit the model’s accuracy.

Experiments on Features and Models. Most of the features used in this paper were focused on syntactic features, such that they are easily extracted. More experiments could be carried on more complex features, such as n-grams and more abstract syntax tree features, as well as to test different combinations of those features against the accuracy of the models. Furthermore, while the set of models selected for this study is rather diverse, further work could be done to experiment with different configurations of those models, or to test new models, in an attempt to reach better results. More work could also be done on the final model, where the correlation found between features and the classification could be further studied, as well as to test the model in extreme cases and against possible biases.

Notes

Reference

Agrawal, R., Golshan, B., Terzi, E.: Grouping students in educational settings. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining -KDD 2014 (2014). https://doi.org/10.1145/2623330.2623748
Bhattathiripad, P.V.: Software piracy forensics: a proposal for incorporating dead codes and other programming blunders as important evidence in afc test. In: 2012 IEEE 36th Annual Computer Software and Applications Conference Workshops (July 2012). https://doi.org/10.1109/compsacw.2012.46
Burrows, S., Tahaghoghi, S.: Source code authorship attribution using n-grams (January 2007)
Google Scholar
Castellanos, H., Restrepo-Calle, F., Gonzalez, F.A., Echeverry, J.J.R.: Understanding the relationships between self-regulated learning and students source code in a computer programming course. In: 2017 IEEE Frontiers in Education Conference (FIE) (October 2017). https://doi.org/10.1109/fie.2017.8190467
Clark, J.G., Walz, D.B., Wynekoop, J.L.: Identifying exceptional application software developers: a comparison of students and professionals. Commun. Assoc. Inf. Syst. 11, 8 (2003). https://doi.org/10.17705/1cais.01108
Halstead, M.H.: Elements of Software Science (Operating and programming systems series). Elsevier Science Inc. (May 1977)
Google Scholar
Kalgutkar, V., Kaur, R., Gonzalez, H., Stakhanova, N., Matyukhina, A.: Code authorship attribution. ACM Comput. Surv. 52, 1–36 (2019). https://doi.org/10.1145/3292577
Lui, A.K., Kwan, R., Poon, M., Cheung, Y.H.Y.: Saving weak programming students. ACM SIGCSE Bull. 36, 72 (2004). https://doi.org/10.1145/1024338.1024376
Oman, P.W., Cook, C.R.: Programming style authorship analysis. In: Proceedings of the Seventeenth Annual ACM Conference on Computer Science : Computing trends in the 1990’s Computing trends in the 1990’s - CSC 1989 (1989). https://doi.org/10.1145/75427.75469
Spafford, E.H., Weeber, S.A.: Software forensics: can we track code to its authors? Comput. Secur. 12, 585–595 (1993). https://doi.org/10.1016/0167-4048(93)90055-a
Wisse, W., Veenman, C.: Scripting dna: identifying the javascript programmer. Digit. Investig. 15, 61–71 (2015). https://doi.org/10.1016/j.diin.2015.09.001

Download references

Author information

Authors and Affiliations

Lancaster University, Lancaster, LA1 4YW, UK
Chi Hong Lee & Tracy Hall

Authors

Chi Hong Lee
View author publications
You can also search for this author in PubMed Google Scholar
Tracy Hall
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Chi Hong Lee or Tracy Hall .

Editor information

Editors and Affiliations

Politecnico di Torino, Torino, Italy
Luca Ardito
Fraunhofer Institute for Experimental Software Engineering, Kaiserslautern, Rheinland-Pfalz, Germany
Andreas Jedlitschka
Politecnico di Torino, Torino, Italy
Maurizio Morisio
Politecnico di Torino, Torino, Italy
Marco Torchiano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lee, C.H., Hall, T. (2021). Using Machine Learning to Recognise Novice and Expert Programmers. In: Ardito, L., Jedlitschka, A., Morisio, M., Torchiano, M. (eds) Product-Focused Software Process Improvement. PROFES 2021. Lecture Notes in Computer Science(), vol 13126. Springer, Cham. https://doi.org/10.1007/978-3-030-91452-3_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-91452-3_13
Published: 23 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91451-6
Online ISBN: 978-3-030-91452-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Using Machine Learning to Recognise Novice and Expert Programmers