Keywords

1 Introduction

Agrawal et al.’s study [1] suggests that grouping strong students with weaker students could improve student achievement overall. Lui et al.’s study [8] shows that in programming courses, identifying and isolating weak students to apply a different teaching method could lead to higher performance in the examination and improve confidence in programming. In contrast, companies hiring programmers have been shown to exclude exceptional candidates where students’ grade point average was used to pre-screen or filter candidates [5]. The ability to distinguish experienced programmers from novices could allow companies to gain insight into candidates quickly and accurately, improving their hiring process.

Code stylometry is the study of code authorship or related analysis through feature identification in code. Many research studies have been conducted in this field, with at least 57 publications published between 1957 and 2020 related to this topic [7]. Most previous research focuses on attributing the author of a piece of code, very few previous studies classify features of the code author, such as coding experience. The main aim of this paper is to explore the use of code stylometry with machine learning to classify coding experience. Our three main Research Questions are:

  • RQ1. Can programmers’ experience levels be classified using features from their code?

  • RQ2. What code features improve the performance of a machine learning model to predict programmer experience?

  • RQ3. Which modelling techniques achieve the best classification performance for programmer experience?

A prediction model was developed using data from novice and expert programmers. Data was crawled from HackerRankFootnote 1, a competitive programming or programming interview preparation site. Different configurations to fetch such data and features to extract were experimented with to optimize model performance. Our final model achieves 71.3% accuracy for correctly classifying expert and novice programmers.

2 Related Work

Halstead [6] proposed a set of metrics to evaluate a piece of code that measures the complexity of the code, which is heavily affected by the algorithms used in the code. Halstead’s Metrics were reported to have a strong correlation with student performance on computer programming courses [4]. Oman and Cook’s work [9] on code authorship attribution used a set of 16 metrics. These metrics identified features in code comments and formatting and were inspired by techniques used for English literature authorship analysis. Oman and Cook [9] were able to categorize most code by author accurately by applying cluster analysis. Spafford and Weeber [10] analyzed code left in a computer system by a hacker, to establish his or her identity. The proposed features for source code or binary code files including:

  • Language (The programming language used)

  • Formatting (e.g. Number of statements per line, etc.)

  • Special features (Such as code specific for certain compilers to read)

  • Comments (Consistency, frequency and length)

  • Variable names (e.g. naming styles of local temporary variables)

  • Language features (e.g. use of for loops versus while loops, etc.)

  • Scoping (Global versus local identifiers, scope of helper functions)

  • Errors

  • Other metrics (Halstead metrics, McCabe metrics, etc.)

Burrows and Tahaghoghi [3] explored the use of n-grams for code authorship analysis. Using 1,640 files from 100 authors, Burrows and Tahaghoghi achieved a 67.01% accuracy. Wisse and Veenman [11] combined 10 traditional structural, style and layout features with 4 different n-gram metrics to reach a 91.4% accuracy with 10 authors and 85% accuracy with 45 authors. Bhattathiripad [2] proposes different types of metrics, programming blunders, to be used to evaluate code. Bhattathiripad points out that most explored code features focus on the general coding styles or algorithmic choices of the whole piece of code, Bhattathiripad explores the use of programming blunders to identify code authorship. Examples of programming blunders include unused variables, unused imported libraries, or a section of code that never gets executed.

3 Methodology

3.1 Dataset

We chose HackerRank as the data source for this project as code samples are in a one-file script format and it is possible to attribute the author’s experience level. HackerRank is a company that provides programmer hiring solutions to technology companies by using their coding test systems. Their website provides coding test practice questions, with over 7 million programmers using the website in 2020. All submitted code is visible to the public, along with the programmer’s profile, with some programmers also providing a link to their LinkedIn profile. We used programmers’ LinkedIn profiles for insight into levels of experience.

In this paper, we have defined a novice programmer as one that received programming training, but has not had any work experience, including part-time work or internships. We have defined an expert programmer as someone who has had full-time programming-related work experience. With this definition, developers with marginal experience, lying between novice and professional, is eliminated as we believe this would help develop a more accurate model. This binary rather than continuous classification was chosen due to the limitation in the size of the data set. However, in reality, a programmer’s expertise is not binary. The classifications were made by manually analyzing information on the programmer’s LinkedIn profile. For all experiments, the same 199 code samples were used (1 for each author). Where 91 samples were classified as novice and 108 as expert.

3.2 Features

Initially, we selected 15 features (base feature set) to extract from the data set. These features are similar to those used in previous studies [9, 10] and include: 1) number of lines of code 2) ratio of empty lines to all lines 3) average length of lines 4–7) number and average length of line comments and block comments respectively 8–9) number and average length of the names of variables 10–15) number of if, for, do, while, switch and cast statements. These features are simple to extract and have been previously shown, by Oman and Cook [9], to reliably attribute code to authors.

3.3 Machine Learning Models

Ten machine learning algorithms, shown in Table 1, were selected to be trained and tested for all experiments. A wide range of different types of algorithms is used, such as linear models, neural networks and decision trees. This was done to identify the best model in experiments as well as which models generally do better with the data set.

The models are built using the 10-fold cross-validation technique, and predictive performance is measured using accuracy, F1 and Matthews Correlation Coefficient (MCC).

Table 1. Models used in experiments

4 Results

This section presents the key experiments performed with a full list of experiments provided in the Appendix. We describe how we iteratively evolved the models via a series of experiments in order to gain improved predictive performance. Note that all experiments were conducted based on the same dataset as described in previous sections.

A replication package is available on the repository on GitHub.Footnote 2

4.1 Exp0: First Experiment

This experiment was performed to establish a baseline against which subsequent experiments are compared. Using all unmodified 199 code samples, and the initial set of features listed in the previous section, the average accuracy achieved by all 10 models was 0.605, where the top 5 averaged at 0.643.

Exp1: Base Model, Code Cleaning. Following experiment 0, we looked into the code samples and the values of the extracted features. We noticed that in each code sample, there is a section, referred to as general code, that is identical in each file, such as the code which prints the function result out to standard output for evaluation, this code is provided by HackerRank in their code template. We proceeded to perform experiment 0 again but with all general code removed. This resulted in an average accuracy of 0.627 (3.67% increase), and 0.662 in the top 5 models (2.96% increase). We have decided that the set of modified files with general code removed will be used in all future experiments.

Exp2: Halstead Metrics. This set of experiments is done aiming to compare the model accuracy using data with and without Halstead metrics (volume, difficulty and effort value). In experiment 2.1, only Halstead metrics’ values were extracted from the code samples. The models had an average accuracy of 0.548, and 0.570 in the top 5 models. In experiment 2.2, we combined the Halstead metrics with the set of features used in experiments 0 and 1. This increased the average model accuracy to 0.601, and 0.639 for the top 5 models, but these accuracy values were still lower than those from previous experiments.

Exp3: Cyclomatic Complexity. In this experiment, we included cyclomatic complexity as one of the features to extract from the code samples. The models had an average accuracy of 0.630, where the top 5 averaged at 0.670. The inclusion of cyclomatic complexity resulted in a slight increase in accuracy.

Exp4: Style Features. two new style-related features were included in this experiment, cond_space_ratio and bracket_line_ratio. cond_space_ratio is the ratio of cases where a space is absent before the opening bracket (e.g. "if(") of a condition to where the space is present (e.g. "if ("). bracket_line_ratio is the ratio of cases where curly brackets are placed on the currently (e.g. ") {") to placed on a new line (e.g. ")<line break>{"). The average accuracy of all models was 0.623, where the top 5 reached 0.669. The top 5 models with style features performed slightly better than the top 5 models from the base experiment.

Exp5: Normalization. In this set of experiments, normalization techniques were applied to the dataset. In experiment 5.1, a min-max scaler was applied, resulting in a 0.646 and 0.667 accuracy in all models and the top 5 respectively. A standard scaler was applied in experiment 5.2, resulting in a 0.650 average accuracy across all models and 0.683 in the top 5 models. Applying the standard scaler resulted in a 3.67% and 6.89% improvement in the average accuracy value for all models and the top 5 models.

Exp6: Final Model. Learning from the results of all previous experiments, a final experiment was conducted. This experiment used the same dataset as the previous experiments, with general code removed. Features-wise, Halstead metrics were not included, while the cyclomatic complexity value and style features were included. Standard scaler was also applied to the dataset values. With this configuration, the final set of models achieved a 0.653 average accuracy and 0.678 in the top 5 models. The best model, logistic regression, reached an accuracy of 0.713. The models’ accuracy, f1 and MCC values are shown in Fig. 1. The coefficients found in the best performing model is shown in Fig. 2.

Fig. 1.
figure 1

Cross validation scores of all models in experiment 6

Fig. 2.
figure 2

Coefficients found in the final logistic regression model

5 Ethical Concerns

Given that the application is designed to give corporations the ability to determine one’s programming experience by analyzing their code, ethical concerns may arise regarding related issues.

The preselection of job application candidates would be the biggest concern of all concerns. While it is understood that such pre-selection methods are not always accurate and causes the loss of talents or results in hiring candidates who are not suitable for the job, it could be arguable that these methods help save time for both the company and the candidates, making the entire application process more efficient.

The main concern is if the pre-selection method is fair. As the model was trained using a dataset, biases could be existing in the dataset and be carried over to the final model during the training model. These biases could be due to the bias in the data source itself, or by human-induced, or systematic bias during the collection of data. These biases could damage the reliability of the model, as well as its fairness as it could give an advantage or disadvantage to a group of people whose code exhibits features picked up by the model but non-related to their development experience.

The model developed in this project is not ready for real-life deployment. Not only does it have low accuracy, but it was not thoroughly studied as well, such as to establish the fact that whether there are biases in the model. Before these concerns are investigated and addressed, such a system should not be deployed.

6 Conclusion

6.1 Review of the Research Questions

RQ1. Can Programmers’ Experience Levels Be Classified Using Features from Their Code? A logistic regression model was trained using 199 code samples, achieving 71.3% accuracy.

RQ2. What Code Features Improve the Performance of a Machine Learning Model to Predict Programmer Experience? From the final model’s coefficient values, the top 5 features influencing the classification are as follows: Average length of block comments, ratio of bracket without to with a space before it, ratio of empty lines to all lines, average length of lines, and the average length of comments.

RQ3. Which Modelling Techniques Achieve the Best Classification Performance for Programmer Experience? Comparing the results of all experiments, logistic regression performed best in the final, as well as most, experiments.

6.2 Further Work

A More General and Larger Data Set. The current data set is very specific and focused as it helps narrow the scope of this paper. However, the data set must be more generalized for real-world use, such that different types of code could be analyzed. Besides, a larger data set could further benefit the model’s accuracy.

Experiments on Features and Models. Most of the features used in this paper were focused on syntactic features, such that they are easily extracted. More experiments could be carried on more complex features, such as n-grams and more abstract syntax tree features, as well as to test different combinations of those features against the accuracy of the models. Furthermore, while the set of models selected for this study is rather diverse, further work could be done to experiment with different configurations of those models, or to test new models, in an attempt to reach better results. More work could also be done on the final model, where the correlation found between features and the classification could be further studied, as well as to test the model in extreme cases and against possible biases.