Keywords

1 Introduction

There is a high failure rate and drop outs occurring in the computer programming courses. Students are often finding difficulty in programming. So educators have to take extra effort in identifying the weak students and device proper strategy to improve their programming performance. In this study certain factors are investigated which are obligatory for the programming skills of the students. Hitherto major studies focused on the demographics [2], educational background [4], prior programming experience [6], mathematical ability of the students in the examination [3, 7] etc. are used to predict the programming skill.

In computer programming papers, students are given various programming assignments which are expected to be completed within a stipulated time period. The evaluation of these assignments helps the educators to identify the programming skills of the students. Hence the objective of the proposed work is to explore the factors that can be used to determine the programming skills of the students. In order to predict the programming behavior, some methods are used for recognizing the hidden relationship among the attributes of the dataset. The following factors are considered for the analysis.

  • Marks obtained in the qualifying examination, especially the marks scored in mathematics [3, 7].

  • Programming behavior of the student for the given assignment which is analysed through debugging capability.

This paper is organized as follows. The background work is described in Sect. 2 and proposed approach in Sect. 3. In Sect. 4, results and discussions are analysed and finally, in Sect. 5 the conclusions and future scope are explained.

2 Related Works

Over the past decades, several researchers have been attempting to study the academic performance of the students by analysing the factors which affect the programming behavior of the students. So there are various data mining techniques can be used to predict students programming performance. In [1] relationship between the gender and the marks obtained in the final examination by the students are analysed using chi square test and it is observed that there is no significant relationship between the same. They used deep learning methodology to predict the grade of the students. Another study [2] deals with prediction of students’ performance in final examination using the linear regression and multilayer perceptron in WEKA tool and compared the greatness between the mean absolute error value differences. Based on the 58 participants from Taiwan University [3], the study predict the academic performance using students final grades to improve learning performance with the help of multiple linear regression and principal component analysis.

Sujatha [4] predict student performance with the help of regression algorithms and found risky students who need more attention in the programming based on the features of higher secondary school background details, the medium of study, syllabus covered, marks scored in mathematics and English etc. Based on the data points collected from different undergraduate courses [5], a new set of multivariate linear regression model is used to predict the final exam score in the Engineering dynamics course. The data mining technology aids to assess the learner’s performance and help us to implement various new trends and technologies to analyse the data [6]. Another paper [7] statistically found that there is no correlation between the performance of computer subjects in high school and the performance in the first year programming course.

3 Proposed Approach

The objective of the investigation is to analyse the programming skills of the students by considering the factors like marks obtained in the qualifying examination, debugging skill etc. So the weaker students can be identified in advance and the educator can help them to improve their programming skills. In the present scenario, while analysing the factors affecting the programming skills of the student, we consider:

  • The marks secured in the qualifying examination especially the marks obtained in the mathematics subject [3, 7].

  • The programming behavior of the student by analysing the debugging capability of a student.

The dataset for this study is prepared by collecting the details of first year computer science students. The details of 108 students were collected, out of which 35 were males and 73 were females. The chosen students were from different educational backgrounds. Programming questions and the difficulty levels (Easy (E), Medium (M) and Hard (H)) were prepared by the educators. For the dataset preparation, the programming assignments were allotted to the students based on the first year curriculum programming language. The teacher indistinguishably distributed programming assignments to each students and finally collect the debugging outputs of each student.

Programming outputs were analysed based on the number of errors in each compilation, number of errors occurred in the penultimate compilation, number of compilation attempts etc. Various types of errors like syntax errors, semantic errors were also identified during the analysis process. While debugging a program, students may attempt multiple compilations to arrive at the final output. The maximum number of errors occurred and penultimate compilation error values are recorded during the compilation process. This way we reach the debugging capability of the students in three levels (E, M and H) and the numerical values should be normalized within the range from 0 to 1. The compilation attempts and errors occurred during the debugging stage are used to quantify the debugging capability. The formula for calculating debugging capability is given as

$$ \text{DC}_{std1} = 1 - \left( {\frac{{PCE_{\max } - PCE_{min} }}{CA}} \right) $$
(1)

Where

  • DCstd1 = Debugging capability of individual student

  • PCEmax = maximum number of errors occurred while debugging a single program in the compilation

  • PCEmin = number of errors occurred in the penultimate compilation while debugging a single program in the compilation

  • CA = Total number of compilation attempts made by the student to reach the programming output.

3.1 Methodology

As discussed in the introduction, the primary goal of this study is to investigate the factors which influence the programming skills of the students by considering the marks obtained in the qualifying examinations and programming behavior of the students. Better programming practices not only solves the programming problem but also expand their coding creativity. Pearson correlation and Regression methods are adopted for analysing the factors. 70% data were considered as training set and remaining set as testing data.

Factors Influencing the Programming Skill

Here we find the correlation analysis of marks obtained by the students in various examinations and analysing the programming behavior using linear regression methodology.

In order to figure out the dependence between various potential factors we first correlate the final marks and mathematics marks secured in qualifying examination with final marks obtained in the examination of the current programming course. The criterion variable used for this research was the marks obtained in the final examination of the programming course. Pearson correlation coefficient is the statistical correlation method used in this study and it was calculated as

$$ r = \frac{n(\sum xy) - (\sum x)(\sum y)}{{\sqrt {\left[ {n\sum x^{2} - \left( {\sum x)^{2} } \right]} \right[n\sum y^{2} - \left( {\sum y)^{2} } \right]} }} $$
(2)

In the above correlation method, n is the number of students, x and y determines the marks obtained in qualifying examination and in the final examination. By definition, the coefficient of correlation assumes any values in the interval between −1 and + 1. The statistical correlation values between the above attributes shown in Table 1 and the Fig. 1 shows the correlation between the marks obtained in the qualifying examination and the marks secured in the final examination. It is pointed out that programming behavior of a student is weakly correlated with marks secured in the qualifying examination.

Fig. 1.
figure 1

Correlation between the qualifying marks and final marks

Next, for analysing the programming behavior of the students, we correlate debugging skill and the final marks obtained in the examination of the current programming course. The debugging capability is used for predicting the programming behavior of the students and the quantified value of debugging capability can be computed using the Eq. (1) which is prescribed in Sect. 3. Linear Regression method is used for analysing the programming behavior which indicates the strength of the impact between multiple independent variables and a dependent variable. The linear regression prediction result was displayed as below:

$$ {\text{Y}} = 0.405 - 0.011*DC_{E} + 0.465*DC_{M} + 0.11*DC_{H} $$
(3)

Where \(DC_{E} ,\;DC_{M} \;and\;DC_{H}\) deals with debugging capability of each student in different levels (Easy, Medium, Hard) and the computation is shown in Eq. (1). Here the response variable (dependent variable) used in this study is final marks secured in the examination of the current programming course. Figure 2 shows the strong positive correlation between the marks obtained in the final examination and the debugging capability of students in different levels (Easy, Medium and Hard). From this, it is pointed out that debugging skill of the student has an upper hand in determining the programming skill compared to the marks secured in the examination.

Fig. 2.
figure 2

Correlation between the debugging capability of each levels and the final marks

Furthermore, the study needs an evaluation metric in order to compare the predictions with the actual values. So, Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are the two common prediction error measurement methods used for finding the score of the continuous variables. They are used to measure the difference between values predicted by the regression model and the values actual observed. MAE and RMSE are calculated using the following formulas:

$$ {\text{MAE}} = \frac{1}{{\text{n}}}\mathop \sum \nolimits_{j = 1}^{n} \left| {{\text{y}}_{{\text{j}}} - {\text{y}}_{{\text{j}}}^{\prime } } \right| $$
(4)
$$ {\text{RMSE}} = \sqrt {\frac{1}{{\text{n}}}\mathop \sum \nolimits_{j = 1}^{n} \left( {{\text{y}}_{{\text{j }}} {-}{\text{y}}_{{\text{j}}}^{\prime } } \right)^{2} { }} $$
(5)
Table 1. Results obtained by the correlation analysis

Here \({\text{y}}_{{\rm{j}}} \;{\text{and}}\;{\rm{y}}_{{\text{j}}}^{\prime } { }\) are the attributes specified in the above discussion. R2 and MSE measures can evaluate the goodness of fit of a regression model. The accuracy of prediction is measured by the standard error of the estimate. By comparing Fig. 2 with Fig. 1, the values are very closer to the regression line which minimizes the sum of squares error. Therefore by minimizing the sum of squared deviations of prediction, the predictions in Fig. 2 is more accurate than Fig. 1. The R2, mean absolute error, mean squared error and root mean squared error values were shown in Table 2.

4 Results and Discussion

In this segment, we will discuss the results of factors influencing the programming skill of the students using correlation and regression. From Table 1, it is observed that the marks obtained in qualifying examination (especially mathematics marks and final marks 0.21 and 0.35) are positively correlated with final marks obtained in the current programming course. Also it is observed that there is a strong positive correlation between the debugging capability of the student (0.83, 0.92, and 0.89) with the final marks. By considering the regression model, R2 value is 0.81 and evaluation metric values MAE and RMSE have very less residual values (from Table 2) which produces better prediction. From Fig. 2, it is analysed that regression line minimizes the sum of squared errors. So more accurate the prediction. From the above analysis it can be predicted that programming behavior is strongly correlated with final marks in the examination than marks obtained in the qualifying examination. This will aid the educators to learn more about the weak students and identify better techniques to impart programming skills to those students.

Table 2. The results of linear regression model

5 Conclusion and Future Study

Apart from predicting the programming skills, the presented work represents the exploratory analysis of the factors which affect the programming behavior of the students. It is observed that debugging skill of a student is highly correlated with final marks obtained in the examination of the current programming course. So it is concluded that debugging skill of the student has an upper hand in determining the programming skill compared to the marks secured in the examinations. Therefore as a future work, by considering the programming behavior of the current situation, we plan to carry out a similar study covering more number of potential success factors to improve the performance of the students in programming.