Keywords

1 Introduction

At present, there is a great diversity of programming languages and this often makes it difficult to select the language that better adapts to the needs of a given development. The selection of a programming language to find a solution entails multiple factors of analysis, many of which are subject to the generated source code and the time employed to generate it.

Using the systematic reviews method [1], a documentary research has been conducted on studies which discuss the metrics that can be established on the generated code with a given programming language [2,3,4,5]. When metrics are used to determine whether a language is the best choice compared to another one, it must not be overlooked that such metrics only focus on the resulting code and do not take into consideration the characteristics of the programmer who built the code. This may result in the sub classification or over classification of a language as a result of the skills or the lack of them of the programmers using it. This poses a problem when it comes to designing an experiment to determine the language to be used, since the group of programmers who use language A may have a higher level of knowledge than the group of programmers who use language B or vice versa, which would impact on the results of the experiment. One possible solution to this problem would be to have a group of programmers who can solve the same task in the languages that are being evaluated.

Beyond the difficulty in gathering this group of individuals with the necessary knowledge in each language to be evaluated, this solution poses some additional difficulties which arise when designing experiments.

Both Juristo and Moreno [6] and Wohlin et al. [7] argue that the experiments performed in the field of software engineering are strongly influenced by the characteristics of the individuals, like in other sciences, generally known as social sciences. Juristo and Moreno [6] enumerate some issues related to social factors and the specific characteristics of the software development that must be taken into consideration when designing experiments.

  • Learning effect: If an individual should solve the same issue applying different programming languages, it is highly probable that they will learn more and more about the issue and that the final result will be better than the first one, simply due to the fact that the individual knows more about the issue rather than due to the fact that the programming language is better.

  • Boredom effect: the individuals get bored or tired of the experiment and put less effort and interest as time passes by.

  • Enthusiasm effect: It may happen that the individuals who use an old programming language are not motivated to do a good job while those who use a new programming language are.

  • Experience effect: when performing an experiment that involves programmers, it is to be expected that there will be different levels of both knowledge and skill about the programming language used.

  • Unconscious formalization: it happens when the same individual uses two or more programming languages with different levels of definition or formality.

  • Setting effect: the emotional state of participating individuals is closely related to their performance.

A set of actions to consider so as to control the abovementioned effects is described below:

  • Learning effect: do not use the same group of individuals to work on a development using more than one programming language.

  • Boredom effect: motivate the individuals who perform the experiment in the same way regardless of the group they belong to.

  • Enthusiasm effect: do not inform the individuals about the hypotheses or objectives of the experiment.

  • Experience effect: to control this effect, experimental pair programmers who are undistinguishable in terms of their knowledge and skills regarding the programming language will be formed.

  • Unconscious formalization: aspects regarding the level of knowledge of each individual should be considered when it comes to the programming language employed when forming the experimental pair.

  • Setting effect: it must be taken into consideration that all the steps of the experiment should be carried out under the same conditions.

In the field of software engineering, it is common to face a need to perform experiments with a reduced group of people who apply different treatments on the objects of study. Taking into account that comparing experimental units within homogeneous pairs not only increases the accuracy of the analysis but also allows us to control most of the undesired effects [6] when performing an experiment, this work is based on the hypothesis that it is possible to form homogeneous experimental pairs of programmers and that, therefore, two subjects with the same characteristics are undistinguishable in terms of their abilities as programmers. This hypothesis leads to the following research questions: is it possible to form undistinguishable experimental pairs in terms of abilities and skills as programmers? If so, do they require the same amount of time to solve the same task with the same programming language?

In Sect. 2, a protocol for the formation of experimental pairs programmers is proposed. In Sect. 3, the validation of the protocol is performed through a pilot test; and in Sect. 4, conclusions and future lines of research are presented.

2 Proposal for the Formation of Experimental Pair Programmers

When it comes to deciding how to form homogeneous pairs of programmers, the authors adhere to Campbell [8], who argues that many factors may indirectly affect the performance of an individual, but only three are direct determinants of performance: knowledge, skill and motivation. For this reason, a protocol will be designed with the aim to form experimental pairs of programmers who are homogeneous in terms of both level of knowledge and skills, and it will be assumed that the participants to be characterized do not show significant differences regarding motivation.

The first step consists in identifying the methods used to categorize programmers in other experimental research studies (Sect. 2.1). Then, the guidelines considered for the design of the categorization instruments used in the experiment are described (Sect. 2.2). Lastly, (Sect. 2.3) presents a mechanism to form homogeneous experimental pairs of programmers based on the characterization made.

2.1 Programmers’ Experience

Like in most human activities, individual performance in software development varies considerably from one person to another and mechanisms should be articulated so that such variations do not affect the results of the study. Feigenspan et al. [9] conducted a documentary work which analyzed 161 publications and found nine ways used by researchers to determine a programmer’s experience. These authors define experience as the amount of knowledge acquired regarding the development of programs.

  • Years: in forty-seven works, the number of years a programmer had been programming in general or in a company or in a certain language was used to determine their experience in the programming field.

  • Education: In nineteen of the articles reviewed, participants’ education was used to indicate their experience, which included information about the level of education obtained (pre-university, undergraduate, graduate, etc.) or the grades obtained in their course of studies.

  • Self-estimation: In twelve works, participants were asked to estimate their own experience.

  • Specific survey: In nine works, the authors applied a survey to evaluate programming experience.

  • Size: the size of the programs written by the participants was used as an indicator in six articles.

  • Exam: In three works, an exam on programming was administered to evaluate the experience of the participants.

  • Supervisor: In two works, in which professional programmers acted as participants, a supervisor was in charge of estimating their experience.

  • Not specified: authors often argue that programming experience was estimated but they did not specify how. That was the case in thirty-nine works.

  • Not controlled: programming experience was not mentioned at all in forty-five works, which compromises the validity of the corresponding experiments.

2.2 Characterization of Programmers

It is necessary to develop a characterization method that is not based on the perception that each individual has on their own skill as a programmer since less competent people tend to overestimate their skills because they do not have enough knowledge to recognize their own limitations and it is also common for more prepared people to tend to underestimate their achievements and competences [12]. This characterization is aimed at establishing a set of the programmer’s abilities in order to find programming pairs that may be considered homogeneous. The purpose of the characterization is to ensure that two individuals with the same characteristics are undistinguishable in terms of their abilities as programmers. To this purpose, it was decided that a broad set of skills of the programmer would be analyzed. The authors agree on the idea that the elaboration of a characterization based only on few criteria may result in serious errors. Some characteristics of the programmers are not related to the programming language and others are dependent upon it. For this reason, guidelines to develop two characterization instruments will be set, one disregarding the programming language (in Spanish, CILP) and another taking into account the programming language (in Spanish, CDLP). There exists a large number of measures to capture attributes of software processes and products which have traditionally been performed by relying on the experts’ proficiency, and this situation has frequently led to a certain degree of inaccuracy in the definitions, properties and assumptions of the measurements, making the use of measurements difficult, their interpretation dangerous and the results of many validation studies contradictory [10]. For the development of characterization tools, the general procedure for the design of a measurement instrument proposed by Sampiere [11] and the method for the definition of valid measurements proposed by Genero et al. [10] were taken into account, adapting such procedures to the needs of this work.

Due to issues related to the synthesis demanded by this publication, it is not possible to detail each of the steps followed to define either the instruments or the content of each of their dimensions.

Table 1 presents the content domains of the variable (dimensions), the indicators for each dimension and the nomenclature proposed for the dimensions of the instrument that will be used for the independent characterization of the programming language.

Table 1. Variable, dimensions, nomenclature for each dimension and their indicators.

Table 2 shows the content domains of the variable (dimensions), the indicators for each dimension and the nomenclature proposed for the dimensions of the instrument that will be used to measure the dependent characterization of the programming language.

Table 2. Variable, dimensions, nomenclature for each dimension and their indicators.

Regarding the decision on the type and format of the instrument and the context of its administration, the mixed procedure for data collection will be used, consisting of two questionnaires (CILP and CDLP) and an interview.

In order to minimize characterization errors, once the participant has answered the questionnaire, an individual interview will be conducted in which the interviewer will ask some questions so that the participant can justify their answers. If the participant provides a correct justification, the interviewer will consider it valid.

The context of administration will be a room with one computer for each programmer since the first phase, the one related to the questionnaire, is self-administered. Therefore, this can be done individually or simultaneously with a group of people. Then, the interview is conducted individually.

2.3 Mechanism to Form Experimental Pairs

It is important to design a mechanism to ensure that the experimental pairs are formed by homogeneous subjects, which means that, according to their characterization, they should be undistinguishable or have negligible differences.

Criterion for the use of variables.

Multiple variables emerge from the characterization procedure, some of them related to characteristics which are independent of the programming language and others related to characteristics dependent on the programmer’s performance in a certain programming language.

Normalization.

The normalization process consists in converting the values of the independent variables so that they are expressed in the range [0–10], regardless of their original scale. This step will ensure that none of the variables included in the distance calculation is weighted more heavily than the others.

Penalty for time spent.

Each of the variables measured is accompanied by the time spent by the participant to complete the exercises related to such variable. In order to form experimental pairs which are undistinguishable not only in terms of knowledge but also in terms of time required to solve a task, a score penalty will be applied according to the time spent on the it. The score obtained will be reduced by 10% every 5 min.

Distance calculation.

In a scenario where multiple variables need to be evaluated, all of them quantitative and whose values belong to the interval [0, 10] after the normalization process, the criterion used to determine the distance between two subjects must be defined. Since an n-dimensional space is being considered, the Euclidean distance calculation will be applied (Fig. 1).

Fig. 1.
figure 1

Euclidean distance [14]

The calculation of the distance between participants will be performed, where the value of the module of the difference between variables of the same type does not exceed a threshold. This restriction will make it possible to establish the maximum distance tolerated, which shall not entail a significant difference in a single variable.

Algorithm for the selection of experimental pairs.

The algorithm follows a sequential process in which it makes a decision at each step. It must select the minimum distance among the options available and then in the next step the algorithm has an identical problem, but with fewer options than in the previous step, and applies the same selection function to make the following decision [13].

3 Case for the Validation of the Protocol

A series of actions were taken to demonstrate the level of initial reliability and validity of the measurement instruments. The characteristics included in the formation of experimental pairs are: Comprehension of a Specification (CILP3), Comprehension of a Pseudocode (CILP4), Algorithmic Ability (CILP5), Theoretical Knowledge (CDLP2) and Comprehension of the Source Code (CDLP3).

To perform the initial pilot test, we recruited people of legal age who declared to know C programming language, and formed a group of 14 programmers. After each participant was characterized, they were asked to solve a task of medium complexity. Then, the protocol for the formation of experimental pairs of programmers was applied in order to determine whether the members of each experimental pair showed any significant differences in solving the task. If the experimental pairs formed by applying the protocol do not present significant differences when solving the same task using the same language, both the instruments and the protocol for the formation of experimental pairs can be deemed to have reached a stable version.

The results obtained from the characterization process are shown in Table 3. Table 4 presents the corresponding distance matrix. The intersections painted in black represent the subjects who should be ruled out since they have shown a distance over fifty percent in at least one of their dimensions. The experimental pairs obtained after applying the algorithm are highlighted in gray. Finally, the time differences between the programmers in each experimental pair are presented Table 5.

Table 3. Normalized results of the characterization.
Table 4. Distance matrix
Table 5. Differences in time spent by each member of the experimental pair

With regard to the time spent to perform the characterization, the average time for the characterization that was independent of the programming language was 28 min; for the characterization dependent on the programming language, the average was 19 min; and for the interview, an average of 6 min was used for each participant. In Table 5, it can be observed that there are no significant differences between the members of the experimental pairs of programmers in relation to the time spent on solving the same task.

4 Conclusions and Future Work

With the aim of proposing a protocol for the formation of experimental pairs of programmers, a document analysis was conducted on the benefits brought by this type of experiment to the software engineering sector. Two instruments were designed to characterize programmers. Using the data obtained from these characterization instruments, a procedure for the formation of experimental pairs was defined.

Finally, a validation case was implemented to verify whether the members of the experimental pairs obtained showed any differences in solving the same programming task.

It can be concluded that in terms of the formation of experimental pairs of programmers, the protocol worked satisfactorily and it is thus considered to have an acceptable level of reliability, validity and objectivity since it was consistent in the results provided.

The future lines of work identified are the need to: (1) apply the protocol to other programming languages and (2) use the protocol to form experimental pairs of subjects using different programming languages in order to determine whether a given programming language affects computing productivity.