Keywords

1 Introduction

Efficient code is the need of the hour. A lot of factors determine the efficiency of a program. The execution time of the code is one of them. The execution time of a code is the measure of time taken for a program to complete running without any error. An important observation is that the actual time taken for program execution is machine-dependent. It depends on several factors like parallelism, processor utilisation, CPU overhead and so on. Instead of measuring the actual time taken for completion of program execution, time complexity labels are used. This is not exactly equal to the exact time taken for running a code snippet but is a quantification of the same as a function of the input length.

The process of expressing execution time based on the input size is also called asymptotic analysis. There are several asymptotic notations to express time complexity. These are Big-O, Omega and Theta. Big-O is also known as the upper bound of execution time. Omega represents the lower bound. Theta notation constitutes both the upper and lower boundaries. It is also known as the tight bound. In our work, we have used the Big-O notation to determine the time complexity.

Finding the time complexity of a code has a lot of applications. It can be used in online coding platforms to help users improve and eventually that the optimal solution. Having an idea of the algorithm’s time complexity can help developers write better code. Most of the existing solutions online do not support static analysis. The user is required to design various test cases of hugely varying input sizes. The corresponding Big-O notation is then given as the output, based on the total execution time of the algorithm. Big-O Calculator is a library that supports the Big-O notation calculation on the Coderbyte platform. “Big-O” is a Python package restricted to Python programs. This process is time-consuming and tedious.

A few other solutions which involve static analysis have a lot of limitations. For instance, [1] uses text parsing and works only for programs with loops. A few other standard procedures such as the master’s theorem and Recurrence Tree work only for recursion-based algorithms. These must be done manually and are time-consuming. Hence, an automated solution is required.

We aim to solve the problem at hand using machine learning (ML) and deep learning (DL) techniques. The dataset is collected from various sources available on the internet. As a part of the approach to solving the problem, the code is converted into an Abstract Syntax Tree (AST) to get graph embeddings rather than directly obtaining the word embeddings from the code.

The nodes from the AST are extracted, and a directed graph is built. This graph is then converted into a graph embedding using Graph2Vec. We have converted the problem at hand into a graph classification problem; hence, we use classification-based ML and DL algorithms.

2 Related Work

Sikka et al. [2] have presented an approach where ASTs are used to extract hand-engineered features and code embeddings from the code snippet. The authors have built a new dataset called CoRCoD, consisting of 933 Java codes. They have used various ML algorithms and SVM with 1024-dimensional code embeddings using graph2vec gave the highest accuracy amongst all the models. The authors have used only Java codes, and no other languages have been used. Agenis-Nevers et al. [3] have presented GuessCompx, an R package which does empirical estimation on both the space and time complexity of the given algorithm. The package takes inputs of multiple increasing-size test samples and tries to fit the best complexity label to it based on running time. This process is tedious as the user must design various samples of various sizes. Hutter et al. [4] have proposed a method of using regression machine learning techniques to predict the actual runtime of the program. All the runtimes have been recorded using the same environment as CPU configuration, etc. This setup is not efficient because the same factors must be used if the model results are to be reliable. Haridas et al. [5] worked on the representation of C/C++ programs as graphs. They formulated a unique method that utilised a neural tensor network to combine results from the GNN and time capsules. This fusion helped them to improve the accuracy of the model which predicted the similarity of a given software code to a set of codes. Although the accuracy of the similarity value was high, the trade-off was the runtime. Gao et al. [6] worked on the prediction of the runtime performance of DL models and neural architecture search algorithms that utilised GNNs and a novel approach DNNPerf (a tool). Their model accepted a DL model file, a model configuration specification and a runtime specification as input, using which it reported the runtime performance values as the output. For the models that implemented proprietary NVIDIA, CUDA, etc., their internal implementation details were hidden which made it difficult to arrive at the exact runtime performance metric. Chen et al. [7] worked on efficiently capturing code semantics with the efficient API-based AST. Zhang et al. [8] worked on a novel way to get the huge ASTs into memory by splitting them block-wise and processing them with the help of neural networks. Lin et al. [16] worked on the block-wise splitting of code into different sections and then processing them. A combination of these techniques [8, 9] can be implemented in our study if the size of ASTs generated becomes exceptionally large, and not sufficient to fit into memory. A tool called “J-CEL” [10] attempts to show the Big-O notations as graphs for a visual representation of Java codes. Code clone detection has been performed using RNNs that take in embeddings using Siamese networks and LSTMs [11] and flow-augmented ASTs [12] for comparing the similarity of codes. This research was useful in our research for generating ASTs in C and Java, respectively. Our study was supposed to be based on GNNs which we plan for future work and a study by Feng et al. [13] worked to predict vulnerabilities in functions of programs using GNNs. Reza et al. [14] worked on predicting the complexity of codes in the manner of lines of codes, depth of inheritance tree, object coupling, etc., using ML techniques. This research was studied in detail due to its similarity with our study. Guzman et al. [15] worked on a test bench of 30 Python programs that were tested for time complexity correctness using Big Theta time complexity approximation.

3 Data Collection and Labelling

The dataset contains a large collection of code samples taken from various sites such as geeks for geeks, tutorialspoint, etc. These websites contained sample codes for various data structures and algorithms where they were either labelled with the respective runtime complexity or they were not. If the runtime complexity was not labelled for a respective code sample, it was analysed by two people before assigning it a label.

“AProVE” [16] is a tool implemented to calculate runtime complexities of Java codes and Jar files. This tool was initially explored to check if codes could be labelled. Due to the limited functionality and prerequisite style of input to be given in the form of certain parameters, the codes scraped to be our dataset could not be labelled by this tool.

Further research was conducted into the Termination and Complexity Calculation competition to obtain tools to label our dataset. Since most of the tools were not open source and had a lot of dependencies, we chose to manually label our dataset.

The dataset contains a total of 10 runtime complexity labels, each code divided based on the language they belong to (C, Java, or Python). The dataset currently contains 769 codes with 41.22% Java, 42.78% Python and 15.99% C codes. The codes included belong to data structures such as Arrays, Stacks, Queues, Linked List, Trees and Graphs. The codes have the respective contributors’ names in them wherever it was mentioned. Each code sample is a single file which includes the source website of the code, the code and the contributor’s name. Using this method helped to prevent duplication of codes while assigning the codes to their respective time complexity label.

The 10 different complexity classes represented as Big-O notations used in this study are O(1), O(N), O(N2), O(N3), O(log(N)), O(N log(N)), O(N * d), O(2n), O(N!) (N factorial) and O(sqrt(N) (square root of N) for C, Java and Python as shown in Fig. 1. The entire workflow of the study is shown in Fig. 2.

Fig. 1
An organizational chart of the dataset. It is categorized as C, JAVA, and PYTHON. Each category is listed further.

Number of code snippets for each label, programming language-wise

Fig. 2
A flow diagram flows as follows. Identify the programming language, convert the program to the corresponding A S T, extract codes and construct a directed graph, convert the graph to the graph embedded, pass the embeddings to the trained model, and obtain the big O notation of the algorithm.

Workflow of the proposed approach

4 Proposed Approach

4.1 Programming Language Identification

For the study, three high-level programming languages were used, C, Java and Python. These programs can be identified by either implementing Github’s language identifier or a python package called “Guesslang.” The language codes can also be identified by their extensions. It is important to classify the codes by the programming language because the packages for pre-processing and AST generation are different for each language.

4.2 AST Generation

An Abstract Syntax Tree is a representation of the program syntax in the form of a tree. This does not represent every detail of the code. It contains only content-related information. After the language was identified, the corresponding Python package was used to obtain the AST representation. ASTs for C codes were obtained using “pycparser.” This is a Python package which accepts C code in the form of a file or a string and generates the corresponding AST. The C code was first compiled by including the fake headers from the “pycparser” repository. This compiled code was then used to generate the AST. The typedef nodes and their children were removed from the tree as they do not contribute to determining the time complexity of the code snippet. Similarly, for Java codes, “javalang” was used to construct the AST and for Python, the AST module was used to get the tree. Each of these AST representations was traversed, and all the nodes were obtained. While obtaining the nodes, these were indexed accordingly considering the current parent of the node, along with the addition of a random number, so each node has a unique index. This is a requirement to construct graphs in the next step of the solution. Figures 3, 4 and 5 show the AST representations produced by the python libraries for all three programming languages for a simple program to display “Hello World.” Each node contains a unique number as the index and the value of the node.

Fig. 3
2 parts. 1, Nine lines of C program codes with comment lines print the text Hello World. 2, presents a corresponding tree diagram to indicate A S T representation.

A simple C program to print Hello World and its corresponding AST (without typedef nodes and their children

Fig. 4
2 parts. 1, Five lines of Java program codes print the text Hello World. 2, presents a corresponding tree diagram to indicate A S T representation.

A simple Java program to print Hello World and its corresponding AST

Fig. 5
2 parts. 1, a line of Python program code with a comment line prints the text Hello World. 2, presents a corresponding tree diagram to indicate A S T representation.

A simple Python program to print Hello World and its corresponding AST

4.3 Construction of Directed Graph and Graph Embeddings

The extracted nodes and their features were then used to construct a directed graph using the “network” library. Each node has a unique index and value. The value is the class name for a few nodes and the value of the class along with the former for a few nodes. These were assigned as node attributes and the attribute was named “feature.” These graphs were then passed as input to the graph2vec algorithm to obtain a graph embedding. These embeddings are NumPy arrays of shape (128,) for each graph. This algorithm was implemented using the assistance of the “karateclub” [17] library in Python. graph2vec library contains the Python implementation of [18]. Narayanan et al. [18] state that graph2vec produces embeddings that are task agnostic. This made the current algorithm more favourable than the others. The embeddings are numpy arrays that are further used to train and test the classification models.

4.4 Data Pre-processing

The outputs from the pre-processing steps, the AST to graph building to embeddings resulted in a set of vector embeddings stored and processed in the form of numpy arrays which is the independent variable, “X.” The code names, complexity and the corresponding time complexity label were also stored. From this the time, complexity label was extracted and made the dependent variable, “Y.” This data was then fed as the input to the classification models. To test out the data processed so far without any analysis, initially, these “X” and “Y” were passed to the Random Forest multi-class classifier ML model and the confusion matrix was plotted resulting in an accuracy of 60%. Analysis of the dataset was performed, visualising the skewness of the dataset, shown in Fig. 6, due to an imbalance in the number of code samples present in each code complexity. This imbalance was taken care of by two different techniques, resampling and SMOTE. Combining the “X” and “Y,” we obtained a data frame that has 128 columns for embeddings and an extra column for the complexity class label. This Pandas data frame was used to train and test the model.

Fig. 6
A horizontal bar graph of complexity versus count. All values are estimated. Data are as follows. O of N, 375. O of N 2, 170. O of N log N, 100. O of log N, 35. O of N 3, 30. O of N d, 25. O of 2 n, 10. O of 1, 9. O of N factorial, 5. O of square root of N, 4.

Visual representation of skewness of the dataset

Resampling is the procedure to reproduce samples in minority classes to make up the majority class, known as upsampling or to cut down on the number of samples to make it into a common lower threshold known as downsampling. SMOTE is the technique used to intelligently find out which features contribute positively when the samples were duplicated and generate synthetic data to cover for the imbalance in data samples of classes. The dataset being imbalanced as shown in Fig. 6, the dataset had to be balanced before proceeding further. By experimenting with a combination of these techniques, the upsampling version of resampling is used in this study.

Getting the dataset to be balanced as shown in Fig. 7 was trained on different ML models such as Random Forest, AdaBoost, XGBoost, KNN, Logistic Regression and Naive Bayes. In this study, a permutation and combination of experiments were performed to analyse the performance of the models. A technique called feature importance was performed. This technique helps us get to know the most important features from “X” that are contributing positively towards the results. For each code in the dataset, 128 features are the vector values in each numpy array. All the features contribute differently towards the model prediction. By using “SHAP” [19], a library in Python and some packages in the Random Forest model to intelligently tell the feature importance, the top 20 features can be obtained as shown in Fig. 9. Using the top features, the models are retrained, and the results are recomputed.

Fig. 7
A table of 4 columns and 10 rows presents the dataset distribution. The column headers are the experiment, target class, before resampling, and after resampling.

Dataset distribution for Expt-1

As part of the pre-processing stage, another approach was built and used. Here, all the classes with less than 10 samples in each class were removed. This resulted in a total of 6 classes in each language, respectively. The dataset was initially divided into subsets of their respective languages. Resampling was performed on each of the subsets based on classes which had the maximum frequency. The resultant subsets gave 149 samples each for the python subsets, 165 samples each for the Java subsets and 61 samples each for the C subsets as shown in Fig. 8. These subsets were then combined to create the final dataset which was then used for model building and training.

Fig. 8
A table of 5 columns and 18 rows presents the dataset distribution. The column headers are the experiment, language, target class, before resampling, and after resampling.

Dataset distribution for Expt-2

Fig. 9
A horizontally stacked bar graph of feature type versus mean S H A P value for 10 classes. The graph follows an increasing trend. The highest values are as follows. Classes 0 and 6, 58. Class 1, 93. 6. Classes 3, 2, 4, 5, 7, and 8, 9. Class 9, 9 and 58.

Sorting the embedding columns (features) by importance

In another approach, the programming language was combined with the time complexity as the target label. This approach was followed as the node labels for each language are different. Before proceeding with this approach, classes with less than 10 samples for each class were removed, as these labels were not available for all languages. This resulted in a total of 18 classes as shown in Fig. 10. SMOTE [20] was used to resample the dataset. This resulted in 165 samples for each class, which is a total of 2970 samples. This dataset was then used to train the chosen classification models as stated earlier.

Fig. 10
A table of 4 columns and 18 rows presents the dataset distribution for experiment 3. The column headers are the experiment, target class, before resampling, and after resampling.

Dataset distribution for Expt-3

4.5 Model Analysis and Building

4.5.1 Bi-LSTM

Many times, a reference is required to certain data which was stored previously to predict the present output. RNNs are not capable of handling such long-term dependencies as there is no control over which parts of the data need to be remembered and which ones must be forgotten to make future predictions accurately. To overcome this problem, we chose to use a bi-directional LSTM. The input flows in two directions, making the Bi-LSTM different from the regular LSTM. With the regular LSTM, we can make input flow in one direction, either backwards or forwards. However, in bi-directional LSTM, we can make the input flow in both directions which helps consider both the future and the past information. Bi-LSTM is usually employed where sequence-to-sequence tasks are needed.

In this study, we re-sized the graph embeddings from 2 to 3D, as the Bi-LSTM expects 3D data. The input shape for the embeddings passed was 128 × 1. Hence, the first Bi-LSTM layer was allotted a total of 64 memory units. 64 as the input shape was 128 which suggested the sum of forwards (64) and backwards (64) should be equal to 128. We have a total of 10 classes to be predicted hence we used the SoftMax activation function. Finally, because this is a classification problem where the data is sparse, the sparse log loss (sparse_categorical_crossentropy in Keras) was used. The efficient ADAM optimisation algorithm was used to find the weights, and the accuracy metric was calculated and reported for each epoch. Figure 11 shows the different Bi-LSTM architectures for the models in different experiments.

Fig. 11
6 models present 6 flow diagrams of B i L T S M. Each model has 4 steps and flows through the input layer, 2 L S T M layers, and a dense softmax layer with bidirectional input and output.

Bi-LTSM architecture for Models 1–6

4.5.2 Random Forest Classifier

This is a supervised learning and ensemble model. It makes use of the bagging method to group the decision trees. It uses ensemble learning, which is a learning technique for enhancing the model’s performance by combining numerous classifiers to solve a complicated issue. Random Forest is diverse since it does not account for all the attributes and features while building each tree. Each tree has its data and features and thus makes complete usage of the CPU while building Random Forests. Also, there is no need of splitting the data into training and testing sets, since there is always 25% of data that is unseen by the classifier.

4.5.3 K-Nearest Neighbour

K-nearest neighbour is a supervised machine learning model which is used widely. This method makes a key assumption that the unseen data and neighbour are related and places this new data in the class that is very alike amongst the existing classes. This means that any unknown data points can be easily classified into one of the existing categories, using some distance measure to estimate the similarity between the test sample and the target classes to make the classification decision. Euclidean distance is the most widely used distance measure to find the nearest neighbours of each query point and we have used it as a metric as part of this research work while training the KNN model for the classification task.

4.5.4 XGBoost

XGBoost stands for Extreme Gradient Boosting. It is an implementation of the gradient boosting-based decision tree, which is an ensemble learner. The key idea of this algorithm is that each predictor corrects its predecessor’s error. A variety of hyperparameters are included in the XGBoost implementation. Tuning these hyperparameters can improve results depending on the job at hand. In our research work, we have used the default hyperparameters for XGBoost as provided by the Scikit-Learn module, since the model has performed well during both training and testing.

4.5.5 AdaBoost

AdaBoost is an abbreviation for adaptive boosting, which is one of the most popular ensemble learners and is mostly used with decision trees. This algorithm builds a learner and assigns equal weights to all the data points initially and eventually assigns higher weights to the samples after each iteration, such that it gives more importance to the higher weights in the next model. This process is continued until a lower error is received. In our research work, we have used the default hyperparameters for AdaBoost as provided by the Scikit-Learn module, since the model has performed well during both training and testing.

4.5.6 Logistic Regression

Multinomial logistic regression is a logistic regression extension that includes native support for multi-class classification issues. Logistic regression is restricted to two-class classification tasks. Some extensions, such as one-vs-rest, can be utilised for multi-class classification issues, but they require that the classification problem be first turned into several binary classification problems. To accommodate multi-class classification issues, the multinomial logistic regression technique is a modification to the logistic regression model that requires altering the loss function to cross-entropy loss and the predicted probability distribution to a multinomial probability distribution.

4.5.7 Naive Bayes Classifier

Naive Bayes is a probabilistic machine learning model that is used for classification tasks. The principle of this classifier is based on the Bayes theorem. It is a supervised learning algorithm which is used in text classification that includes a high-dimensional training dataset. This classification algorithm is used in building fast machine learning models that can make quick predictions. It is a probabilistic classifier, which means it predicts the probability of an object. Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods.

5 Metrics Used

  1. 1.

    Accuracy: Accuracy is one of the most important evaluation metrics when it comes to performance evaluation. It tells us how well the trained model has performed against the test data and is more useful when all the target classes have the same gravity. It is defined as the ratio of the count of true predictions to the count of all the predictions in the dataset.

  2. 2.

    Precision: Precision is an evaluation metric that computes a model's accuracy in classifying a test record as positive. It is computed as the ratio of the count of True positives in the dataset to the count of all the positive specimens in the dataset.

  3. 3.

    Recall: Recall quantifies the ability of the model to find a positive specimen. The higher the recall gets, the more positive the tests are being detected. This metric focuses only on how the positive records are being classified and are independent of the negative specimens in the dataset.

  4. 4.

    F-measure: Individually in some cases, neither precision nor recall gives the required insight into a model's performance and that is where F-measure comes in handy. F-measure gives us a single score that handles the problems of both precision and recall.

  5. 5.

    Kappa Statistics: Kappa score is used to evaluate the performance of model classification. It is used to measure the degree of agreement amongst two judges and is popularly referred to as inter-rater reliability.

  6. 6.

    AUC Score: AUC stands for area under the curve and is computed using Simpson’s classifier. The higher the AUC score the better the classifier performs. The Y-axis refers to the True Positive Rate (TPR) and the X-axis refers to the False Positive Rate (FPR).

6 Experimental Results and Analysis

A permutation and combination of techniques were used to arrive at different results and evaluation metrics. Initially, using the imbalanced dataset in Fig. 6, and running the Random Forest model, an accuracy of 59.45% was achieved. For the same dataset, removing complexity class codes which had less than 10 samples, which were codes in O(2n), O(1), O(N!) (N factorial) and O(sqrt(N) (square root of N) were removed and retrained on the same model resulting in an accuracy of 58.89%. Since there was an imbalance in the dataset, the dataset had to be balanced. By using the resampling technique of upsampling, different experiments were performed. By taking only two majority classes, O(N) and O(N2) as binary classification, an accuracy of 88.82% was achieved. Taking the four majority classes, O(N), O(N2), O(N log(N)) and O(log(N)) and running them on the same configurations, an accuracy of 89.86% was achieved. Removing the class codes with less than 10 samples, an accuracy of 94.49% was achieved. Taking all 10 complexity classes and testing on the same configurations, an accuracy of 95.73% was achieved.Balancing the dataset by SMOTE, taking two classes, O(N) and O(N2) as binary classification, an accuracy of 97.34% was achieved. Balancing the dataset by the resampling technique of upsampling, performing feature selection and retraining on the Random Forest model, an accuracy of 96.16% was achieved. To obtain the best results, a set of ML models used for training were Random Forest, AdaBoost, XGBoost, KNN, Logistic Regression and Naive Bayes. The metrics computed for these sets of models are Accuracy, Precision, Recall, F1 score, Cohen Kappa Score and ROC AUC Score as shown in Experiment-1, Fig. 12, and the visual representation of results is obtained in Fig. 13a. The Random Forest classifier was initially used, and it gave us an accuracy of over 90% which was termed significantly good with the kind of data that was available. This is because Random Forest reduces the overfitting problem in decision trees which reduces the variance which in turn improves the accuracy. Hyperparameter tuning was performed on the Random Forest classifier to improve the evaluation metrics. This included the number of decision trees being used and the max features that will be used in the classifier. Grid Search, a hyperparameter tuning strategy, was used to fine-tune the model and provide the best parameters the model could run on. This strategy improved the accuracy to 91.83%. The assumption that Random Forest gave a good accuracy was because it adds additional randomness to the model while growing trees. It also searches for the best feature amongst a random subset of features instead of looking for the most important feature, when splitting a node. The pre-processed data were then tried on boosting algorithms like AdaBoost and XGBoost which gave an accuracy of 86% and 92% respectively. The algorithm helps in the conversion of weak learners into strong learners by combining n number of learners. Boosting also can improve model predictions for learning algorithms. We have used the default hyperparameters for AdaBoost as provided by the Scikit-Learn module since the model has performed well during both training and testing with different variations of data. The data for all the experiments are shown in Fig. 12.

Fig. 12
A table of 8 columns and 21 columns presents the metrics and results of the M L model. The column headers are the experiment, classifiers, accuracy, precision, recall, F 1 score, Kappa score, and A U C R O C score.

ML model metrics and results

Fig. 13
3 grouped bar graphs of values versus classifiers, for 3 experiments. X G boost has the highest values in all graphs for accuracy, precision, recall, F 1 score, Kappa score, and A U C R O C score.

a Bar chart of performance metrics of Experiment-1, b Bar chart of performance metrics of Experiment-2, c Bar chart of performance metrics of Experiment-3

The KNN algorithm gave an accuracy of 78% on the same configured data. We use the built-in library from Scikit-Learn to train our KNN model. We split the input and output data into train and test sets to train the model and test the model’s accuracy on testing data. The Euclidean distance was used as the distance metric here. KNN assumes that if a datapoint is close to another datapoint, then they belong to similar classes. One of the reasons why KNN had lower accuracy in contrast to other algorithms is due to its inability to work with high-dimensionality data as it complicates the distance calculating process. Another reason could be feature scaling where the data in all dimensions need to be scaled properly.Logistic Regression and Naive Bayes were the next set of models that were used to analyse the behaviour of the data. The classifiers gave an accuracy of 83% and 35% respectively. In logistic regression, the parameter’s random state was set to 0 and multiclass was set to multinomial to perform better. One of the reasons for Naive Bayes performing badly might be due to the bad binning of continuous variables with multinomial Naive Bayes. Figures 14, 15 and 16 are the confusion matrices for the different experiments. The same models were tried for the approach where the programming language was combined with the time complexity as the output label. Experiment 3 in Fig. 12 and the visual representation of results obtained in Fig. 13c show the various performance metrics obtained by the mentioned ML algorithms for this approach. The accuracy achieved by this model was around 93% using Random Forest and around 94% for XGBoost. Figure 18 shows the confusion matrix obtained for XGBoost. The study was carried out on six different versions of the Bi-LSTM models. In the first model version, resampling was done on the dataset which made the runtime complexity labels equal in number. A total of 10 complexity labels were obtained which were then trained on a Bi-LSTM model with 64 memory units for a total of 25 epochs. The loss and accuracy found are shown in Figs. 17 and 18a.

Fig. 14
A 10 cross 10 confusion matrix of true label versus predicted label. The main diagonal row is colored in a dark shade with higher values. Other cells are colored in lighter shades with lower values. A shaded scale of values ranges from 0.0 to 1.0.

Confusion matrix for XGBoost exclusive of programming language in Experiment-1

Fig. 15
A 6 cross 6 confusion matrix of true label versus predicted label. The main diagonal row is colored in a dark shade with higher values. Other cells are colored in lighter shades with lower values. A shaded scale of values ranges from 0.0 to 1.0.

Confusion matrix for XGBoost inclusive of programming language and time complexity

Fig. 16
A confusion matrix of 18 columns and 18 rows of true label versus predicted label. The main diagonal row is colored in a dark shade with higher values. Other cells are colored in lighter shades with lower values. A shaded scale of values ranges from 0.0 to 1.0.

Confusion matrix for XGBoost when both the programming language and the time complexity are combined into a single label

Fig. 17
A table of 3 columns and 6 rows presents 6 models of B i L S T M. The column headers are model, loss, and accuracy.

Accuracy and loss table for six Bi-LSTM models

Fig. 18
2 double line graphs of values versus epoch. The lines labeled loss and accuracy decrease and increase in both graphs with fewer fluctuations in A and more in B, respectively.

a Loss versus accuracy plot for Bi-LSTM model 1, b loss versus accuracy plot for Bi-LSTM model 2

In the second model version, the embeddings were combined with the respective language used and the runtime labels and embeddings were removed for the labels which consisted of fewer than three codes. We then performed resampling on the dataset, and a total of six complexity labels were obtained. This was trained on a Bi-LSTM model with 65 memory units for a total of 50 epochs. The results are shown in Fig. 18b.

In the third model version, similar steps were repeated as in the second model. The only difference was that the Bi-LSTM was trained on 129 memory units. It was more time-consuming as compared to the previous model due to the high number of memory units. The results obtained are shown in Fig. 19a.

Fig. 19
2 double line graphs of values versus epoch. The lines labeled loss and accuracy decrease and increase in both graphs with more fluctuations in A and fewer in B, respectively.

a Loss versus accuracy plot for Bi-LSTM model 3, b loss versus accuracy plot for Bi-LSTM model 4

In the fourth model version, the runtime labels were combined with the language used which yielded a total of 18 runtime complexity labels. This was trained on a Bi-LSTM which consisted of 64 memory units for a total of 50 epochs. The results are shown in Fig. 19b, and the table in Fig. 17. The fifth model consisted of similar steps as in model four with the difference being in the number of memory units utilised in the Bi-LSTM which was 128 in number. The results obtained are shown in Fig. 20a.

Fig. 20
2 double line graphs of values versus epoch. The lines labeled loss and accuracy decrease and increase in both graphs with more fluctuations in A and fewer in B, respectively.

a Loss versus accuracy plot for Bi-LSTM model 5, b loss versus accuracy plot for Bi-LSTM model 6

In the sixth model version, the runtime complexities which consisted of fewer than 3 codes were removed and then steps similar to model one were followed. The results obtained are shown in Fig. 20b. In conclusion, training the model with a higher number of memory units reduced the loss and increased the accuracy. From the results, model 5 performed the best with an accuracy of 92.86%.

6.1 Assumptions

The current approach does not probe into the syntactical correctness of the program. It is assumed that all the programs are error-free. The solution presented by us does not involve built-in python packages like sklearn, pandas, etc. This solution supports the prediction of algorithmic time complexity. Another assumption is that the program completes running in a finite time and has utilised such codes in the study.

7 Limitations and Future Work

The current project is restricted to three languages, C, Python and Java. We would like to extend it to more languages like C++ and other commonly used languages for writing algorithms. This study can be extended to include some less commonly used packages in these languages to identify their runtime complexity. As a part of future work, we intend to extend the dataset by adding more data samples and trying Graph Neural Networks, for classification, since the problem also falls under graph classification. This can also be implemented as a tool or a web browser extension to calculate code runtime complexity. Currently, a frontend is built using the Streamlit Python package where the user can drop in a zip file with codes and the backend will compute the code's runtime complexity. This study can also be implemented as a web browser extension for easy computation of runtime code complexity of codes in various other high-level languages.

8 Conclusion

Predicting the code complexity can help aid in improving code quality. This can be tedious if done manually, hence using static analysis and machine learning to make things easier. The current approach aims at solving the current problem at hand for three languages, C, Java and Python. With the current approach, we also use ASTs and graph embeddings, rather than just word embeddings from programs. We find that Random Forest, accompanied by GridSearchCV for hyperparameter tuning, outperforms all the other models. We would also like to try various other algorithms to achieve accurate and better results. We hope that our research helps developers and learners who always strive to write better code.