1 Introduction

The Object Constraint Language (OCL) [1] was defined to append extra semantics to UML models in addition to the ones already enforced by the UML metamodel itself. OCL can be used to support various modeling activities for different purposes, such as providing precise meaning to state invariants on states and guards on transitions of UML state machines for automated test data generation [27]. Moreover, OCL can be used to specify constraints at various MOF-levels (M3, M2, and M1) for different purposes such as querying a subset of model elements from a model [1, 8], evaluating/validating [813] a specified constraint (e.g., for automated test oracles), and solving [4, 1429] (e.g., for automated test data generation).

While working with several industrial partners on diverse model-based engineering projects using UML and its extension and OCL, we found that the use of OCL with models can support various model-based engineering activities including: automated model-based testing, consistency checking, and configuration in the context of produce line engineering [5, 6, 20, 30]. Moreover, OCL is also widely used as the language for writing constraints in many commercial modeling tools such as IBM Rational Software Architect (RSA) and Magic Draw [31]. However, successfully applying OCL in practice comes with a cost, i.e., additional effort (in terms of training and required tool support) is required to specify constraints using OCL and being declarative language in nature, its formalism hinders its applicability in practice. To convince our industrial partners (who are not familiar with OCL) to invest in terms of training and tool support, especially when there are alternatives such as Java with which practitioners are more familiar, it is very important to provide evidence via rigorous empirical study to tell which constraint specification language is better and why.

Java is a commonly used programming language, and it may be used to specify constraints on UML models (e.g., [32, 33]), and people in industry are usually familiar with Java. However, as it was not defined for the purpose of specifying constraints on UML models, it is not so straightforward, as compared to OCL, in terms of, e.g., traversing model elements in UML models. There exist tools for both OCL and Java [9, 12, 30, 34, 35] for evaluating, querying, solving, and parsing formally specified constraints. For example, commercial tool IBM RSA [33] and open source tool Papyrus [32] both allow users to specify/validate their constraints specified both in Java and OCL against UML models.

Since there is no evidence in the literature suggesting that OCL is better than Java or vice versa in terms of specifying constraints on UML models, we conducted a controlled experiment to investigate this. Our main goal is to collect a body of evidence based on which we can recommend Java or OCL to our industrial partners for writing constraints on UML models in their respective applications. Moreover, we aim to build/gather preliminary evidence about OCL/Java for specifying constraints that is missing in the current literature (Sect. 5) that can be used by researchers and other practitioners to select a language for specifying constraints to solve their respective problems.

The controlled experiment was conducted with 29 fully trained graduate students taking a graduate student course in ‘Empirical Software Engineering’ at Beihang University, Beijing, China. The course was given by the authors of the paper. Two case studies were used in the experiment. Quality measures (e.g., Completeness, Conformance and Redundancy) were defined to evaluate constraints specified by the experiment participants. Results show that the participants working with OCL and Java performed equally well. Additionally, we observed that the participants using OCL performed consistently well for all the constraints of varying complexity, which is not the case for the participants using Java for the same constraints. Thus, we suggest using OCL for specifying constraints in industrial applications of model-based engineering since constraints in industrial applications are complex, as we observed in our industrial applications.

To further investigate the constraint specifications in Java and OCL using tools, we performed additional analyses. We selected 100 constraints with 100 % conformance and inputted/entered them to the tools to identify errors. Results show that more errors were identified in Java specifications as compared to OCL, and further lead to the conclusion that tools are needed to specify fully correct constraint specifications that can be used for supporting automation.

The rest of the paper is organized as follows. Section 2 provides details on the experiment planning. Section 3 reports and discusses experimental results. Section 4 points to possible threats to validity in our experiment. Section 5 reports the related work and we conclude the paper in Sect. 6.

2 Experiment planning

This section discusses the planning of the experiment according to the definition and reporting template defined by Wohlin et al. [36]. Section 2.1 provides experiment definition and hypotheses formulation. Section 2.2 provides details on the participants and training for the experiment. Section 2.3 provides details on the material that we used for the experiment. Section 2.4 provides metrics that we used to assess the quality of specified constraints. Last, Sect. 2.5 discusses the design of the experiment and its execution.

2.1 Experiment definition and hypotheses formulation

The objective of our experiment is to compare OCL and Java with respect to their applicability of specifying constraints on UML class diagrams. Applicability is assessed according to two criteria: the quality of specified constraint specifications and participants’ subjective opinions on the applicability of these two languages. We measure the quality of OCL and Java constraint specifications from three complementary points of view: Completeness, Conformance, and Redundancy. The subjective opinions (Applicability and ConfidenceLevel) were collected through two four-point Likert scale questions of the questionnaires: (1) To which extent the constraint is easy to specify and (2) To which extent do you feel confident to apply a language (Java or OCL) to specify constraints. The independent variable that we concern is Method (OCL vs. Java). There is one factor that is also interesting to take into account when statistical analyses are conducted: Constraint complexity. The detailed discussion of the five dependent variables and related measurement is provided in Sect. 2.4.

Based on the above variables, we can formulate the following null hypothesis \((\hbox {H}_{0})\) to be tested for each dependent variable: there is no significant difference between OCL and Java in terms of the five dependent variables. None of the expected differences between OCL and Java can a priori be certain to be in a specific direction. This therefore leads to the definition of two-tailed hypotheses \((\hbox {H}_{1})\) and it is stated as: OCL results in different quality of constraints or different responses to the two questions in the questionnaire when compared to Java. Hypotheses are provided in Table 1.

Table 1 Hypotheses

2.2 Participants and training

The controlled experiment was conducted at Beihang University, Beijing, China. The participants in the experiment were 29 graduate students taking a short-term but intensive graduate course in ‘Empirical Software Engineering’ at the Department of Computer Science and Engineering. The course was given by the authors of the paper. The students in this degree already hold a Bachelor in Computer Science and have already been exposed to the UML and OCL notations and have used Java for multiple course projects.

The participants were trained by the authors of this paper. Two three-hour sessions, as part of the course curriculum, were given on the following topics: (1) Recap of UML class diagrams since the participants were already familiar with this topic preceding the training, (2) Introduction to OCL, and (3) Recap of Java. Each topic was accompanied with several examples and interactive class assignments. The participants were given a questionnaire with eight questions before the experiment sessions to collect information about their knowledge and experience on UML class diagrams, OCL, and Java. The questionnaire is provided in “Appendix ” for reference. The collected questionnaire responses were computed (by giving equal weight to each questionnaire) to obtain a single value for each participant, indicating his/her background on UML, OCL, and Java in general. These values were then used as the basis to group the participants into two blocks and therefore ensure better homogeneity across the two groups involved in the experiment. The experiment was part of a series of compulsory laboratory exercises that were part of the course curriculum.

2.3 Materials

We used two systems in the experiment: Banking System and Video Conferencing System (VCS) [5]. Banking System is an extended version of the Banking System from the OCL 2.2 specification [1]. The rationale of choosing this system as one of the case studies is because the context is relatively easy to understand. We selected VCS as the second case study as it is a real industry case study of Video Conferencing System (VCS) developed by Cisco Systems, Norway. This case study is part of a project aiming at supporting automated, model-based testing of a core subsystem of a VCS called Saturn [5].

For Banking System, a bank has several employees and customers. Each customer can have at most two accounts in a bank: One is saving account and the other is current account. A customer must be employed in a company or owns a company in order to have a bank account. An employee of a bank can also be its customer having at the most two accounts in the bank. For the VCS case study, the core functionality to be modeled manages the sending and receiving of multimedia streams. Audio and video signals are sent through separate channels, and there is also a possibility of transmitting presentations in parallel with audio and video. One conference participant can send presentation to all others, in parallel to the ongoing video call. The core functionality was used as part of the experiment.

In the answer sheet provided to the participants for each system and each method, we provided (1) a brief description of the system, (2) the class diagram on which constraints should be specified, (3) the description of each attribute of the classes in the class diagrams, and (4) a list of constraints that the participants should specify using either OCL or Java. Hence, we designed four answer sheets for the four combinations of the two methods and the two case study systems. The content of the answer sheets designed for Banking System and VCS is provided in “Appendices and ”, respectively.

2.3.1 Complexity metrics

To enable the participants to tackle increasingly more complex constraints to smooth the learning curve, we ordered the constraints according to their complexity, which is measured by applying these four metrics sequentially:

  1. 1.

    Maximum number of traversals in all the clauses of a constraint \((n_{traversals})\)

  2. 2.

    Number of required attribute types \((n_{types})\)

  3. 3.

    Order of the complexity of the attribute types \((o_{typeComplexity})\), and

  4. 4.

    Number of clauses \((n_{clausesRequired})\).

Using Banking System as the example to explain the above metrics, we provide the class diagram of the system in Fig. 1 and three constraints specified in English, along with values to the complexity metrics are provided in Table 2.

Fig. 1
figure 1

Class diagram of Banking System

Table 2 Values of complexity metrics and specifications in OCL and Java of selected constraints

\(n_{traversals}\) is defined as a step from the context class to the farthest class on whose primitive attributes a constraint is specified. For example, Constraint A presented in Table 2 has one traversal from Bank (the context class) to Customer. Constraint B contains two traversals from Bank (the context class), via Account, to Customer. Constraint C is the most complex one among the three as it has three traversals from the context class Bank, via Customer and Account, to Saving or Current.

There are four types of primitive attributes appearing in the given constraints: Boolean, Enumeration, Integer, and String, which are ordered (from the simplest to the most complex) according to their complexity in terms of specifying constraints. For example, as shown in Table 2, Constraint A involves one attribute isEmployed: Boolean owned by class Customer. Constraint B is associated with one integer attribute: accountNumber: Integer (of class Account) and an integer operation size()/length (in Java) (returning an integer) to check that each account is linked to exactly one account. For Constraint C, we need one integer operation size()/length in Java to check that a customer can only have a saving account when he/she has a current account, i.e., account-\(>\) size()=2. Moreover, we need one Boolean operation oclIsTypeOf()/instanceof in Java to check the type of account. For checking the reverse, again we need one Boolean and one Integer.

Number of clauses \((n_{clausesRequired})\) is defined as the total number of clauses required in a constraint specification. We first ordered the constraints according to the maximum number of traversals \((n_{traversals})\), then the number of required variable types when the same maximum number of traversals appears in two or more constraints. If \(n_{types}\) is equal for two or more constraints, we further check \(O_{typeComplexity}\), then \(n_{clausesRequired}\).

2.3.2 Post-questionnaire

A post-questionnaire was distributed after the students finished the tasks of specifying constraints in each round. The objective of the questionnaire is to obtain subjective opinions of the participants on the applicability of Java and OCL and their confidence level (Applicability and ConfidenceLevel) of applying the requested method. As shown in Table 3, the questionnaire has four statements on a 4-point Likert scale question. The first statement is defined for each system. The other three statements are asked for each constraint of each system. The third statement requires the participants to rate each constraint according to the extent to which they perceive it to be easy to apply (Applicability). The last statement was used to obtain the participants’ subjective opinions on their confidence after each constraint was applied (ConfidenceLevel). Notice that the same questionnaire is used for collecting information for both methods. The questionnaire is presented as a table, with instructions, to the students as shown in “Appendix ”.

Table 3 Questionnaire design

2.4 Dependent variables and measurement

As previously discussed, in total, we defined five dependent variables. Their measurement is described below. To illustrate each dependent variable, we use Banking System as the running example.

2.4.1 Quality of constraints

We measure the quality of a constraint from three aspects: Completeness, Conformance, and Redundancy. These quality metrics are used for measuring both OCL and Java constraints. In this section, we discuss how these three aspects are measured using a set of metrics. Several OCL and Java constraints specified by the students during the experiment are provided in “Appendix ” for reference.

\(Completeness\,(Completeness_{Constraint})\): This metric measures the percentage of the specified clauses of a constraint specification, with the formula below:

$$\begin{aligned} 1-\frac{Number\,of\,missing\,clauses}{n_{Clauses} }, \end{aligned}$$

where \(n_{Clauses}\) is the total number of clauses expected from a constraint specification. For example, if a specification of Constraint B (Table 2) only describes clause “all accounts in the bank must have unique account numbers,” then the completeness of this specification is \(1-(1/2) = 50\) % since the constraint contains the other clause: “each account must be linked to exactly one customer.”

\(Conformance\, (Conformance_{Constraint})\): This metric measures the conformance of a constraint specification, using the formula below:

$$\begin{aligned} \frac{\hbox {Completeness}_{\mathrm{Traversal}} +\hbox {Conformance}_{\mathrm{Iteration}} +\hbox {Conformance}_{\mathrm{Condition}} }{{x}}*\hbox {Completeness}_{\mathrm{Constraint}} . \end{aligned}$$
(1)

\(Completeness_{Traversal}\), \(Conformance_{Iteration}\) and \(Conformance_{Condition}\) are the three aspects for measuring the conformance of a constraint, with equal weights. Traversals in a clause of the constraint are the steps required for traveling from the context model element (e.g., a class) to the destination model element, as we discussed in Sect. 2.3.1. We therefore define the metric below to calculate the overall completeness of the traversals of a constraint:

$$\begin{aligned}&Completeness_{Traversal} \\&\quad =\frac{\sum _{i=1}^{n_{Clauses} } \left( {1-\frac{Number\,of\,missing\,traversals\,in\,clause\,i}{Total\,number\,of\,traversals\,required\,for\,clause\,i}} \right) }{n_{Clauses} }. \end{aligned}$$

For example, if a specification of Constraint B (Table 2) only describes clause “each account must be linked to exactly one customer,” and the specification only describes the traversal from Bank to Account, then the completeness of traversals for this particular specification is calculated as: \(\tfrac{\left( {1-\tfrac{1}{2}} \right) +0}{2}=25\,\%\).

\(Conformance_{Iteration}\) indicates whether manipulating (or iterating over) a collection of objects, which is frequently needed for constraints specified on class diagrams due to the multiplicities on associations, is correct. The following metric is used to compute its values:

$$\begin{aligned} Conformance_{Iteration} =\frac{\sum _{i=1}^{n_{Clauses} } \left( {Conformance_{Iteration} i} \right) }{n_{Clauses} }, \end{aligned}$$

in which, \(Conformance_{Iteration}i\) is the conformance of iteration for each clause, further defined as: if the iteration is totally wrong, a value 0 is assigned; if it is partial correct, 0.5 is assigned; if it is fully correct, 1 is assigned.

\(Conformance_{Condition}\) takes into account the conformance of the condition required for each clause in a constraint specification. As for conformance of iteration, if it is fully correct, partially correct, or fully wrong, in terms of corresponding to the provided input, 1, 0.5, and 0 are assigned, respectively. Examples to calculate conformance based on students’ solutions are provided in “Appendix ”.

In Formula (1), \(x\) is determined by whether iterations are required to specify a constraint. In our experiments, five constraints of VCS do not have iterations. Therefore, for these cases, \(x\) equals 2; otherwise 3. The average conformance of the three aspects times the completeness gives us an overall conformance of the constraints, as shown in Formula (1).

\(Redundancy\,(Redundancy_{Constraint})\), for specifying a constraint, extra clauses are considered as redundant clauses:

$$\begin{aligned} \frac{Number\,of\,extra\,clauses}{Number\,of\,extra\,clauses+n_{Clauses} -Number\,of\,missing\,clauses}. \end{aligned}$$

2.4.2 Applicability and confidence level

Applicability and ConfidenceLevel are two subjective measures used to assess the two languages (i.e., OCL and Java), and they are based on the responses to two 4-point Likert Scale questions of the post-questionnaire, from 1 (Completely disagree) to 4 (Completely agree).

2.4.3 Complexity of constraints

We measure the complexity of constraints as an ordinal variable with three levels: Low, Medium, and High. As we discussed in Sect. 2.3, a set of criteria (i.e., \(n_{traversals}\), \(n_{types}\), \(O_{typeComplexity}\) and \(n_{clausesRequired})\) were used to order 10 constrains for each case study system. To define the complexity of constraints across two case study systems, we first ordered the 20 constraints (10 for Banking and 10 for VCS) using the same set of criteria. Then, this order is divided into three groups, which correspond to three levels of our constraint complexity measurement. The division is based on the objective of balancing the number of constraints following into each level. As the result, for VCS, three, four and three constraints follow into three categories (Low, Medium, and High), respectively. For Banking, four, two and four constraints are classified into Low, Medium, and High categories, respectively. In total, eight, six and six constraints are at Low, Medium, and High levels for all the 20 constraints of the two systems. For example, according to this mechanism, the complexity of Constraint A, Constraint B and Constraint C presented in Table 2 are classified as Low, Medium, and High, respectively.

2.5 Experiment design and execution

The design of our experiment is summarized in Table 4. We used a within-subjects designFootnote 1 since we have two systems and two languages (Java and OCL). During the training sessions (Sect. 2.2), each subject was equally trained to understand the two languages: OCL and Java. Based on the results of a questionnaire (Sect. 2.2), the experiment groups were formed through randomization and blocking to obtain two comparable groups of 16 students each (Group 1 and Group 2) with similar proportions of students from each block. In the first round, Group 1 was asked to specify constraints of Banking using OCL, whereas Group 2 was asked to use Java instead.

Table 4 Experiment design

Such a within-subjects design offers two main advantages. First, with it, we can reduce the error variance due to individual differences in human performance, which is quite common in software engineering tasks. This is due to the fact that the same group of students is exposed to both languages across the different systems. Second, within-subjects designs provide more statistical power as compared to a between-subjects design [36] as it leads to more observations for each treatment. Potential threats from within-subjects designs are “carryover” effects. To address this, for each system, each group was given a different treatment in such a way that ordering effects were counterbalanced: languages, i.e., OCL and Java occurred once in a different order across the two groups. For example, as shown in Table 4, in round 1, Group 1 was asked to specify constraints for Banking in OCL, whereas Group 2 to specify constraints for Banking in Java. Note that in the experiment, in the first round, Banking was used, and in the second round, VCS was used. The purpose is to enable the participants to tackle increasingly more complex models and constraints. With a within-subjects design, a matched pair analysis can be applied by comparing the performance of subjects with themselves across treatments (Sect. 3.2).

As we previously discussed in Sect. 2.2, in our experiment, we have 32 participants enrolled in the experiment after the training sessions, which were divided into two groups, each of which has 16 participants. During the experiment, all the 16 participants of Group 1 participated in the experiment and 13 out of 16 participants from Group 2 participated in the experiment and completed the tasks. Each participant was asked to specify 10 constraints either using OCL or Java for each system. Therefore, we obtained in total 580 data points and their decomposition is provided in Table 4 for reference.

At the beginning of the experiment, an answer sheet containing a brief description of the system, the class diagram on which the constraints will be specified, and the instruction of tasks to perform was distributed to the participants. The participants were given 15 minutes to read the answer sheet and had an opportunity to raise questions on the answer sheet. Then, the authors of the paper explained the system and its class diagram to all the participants. After all these preparation activities, the first constraint was given to the students via a classroom projector screen. At the same time, a 10 minutes timer was triggered. This process repeated 10 times until all the constraints were specified. Notice that before the experiment was conducted, two students were asked to specify the constraints and the average time spent on specifying one constraint was around 10 minutes. After that, the post-questionnaire was distributed. Fixing the time for task execution tends to yield more differences in task effectiveness, but then results cannot be used to study time differences across treatments [36].

The students used pens during the experiment to record the results on the provided answer sheets (“Appendices and ”), which were collected after each task. We understand that it would be closer to reality to use OCL and Java tools in the experiment. However, we considered that selecting which tools to use would form an internal threat to validity, as there exist various OCL and Java tools in the market and applying which one, even in reality, heavily depend on the application context and there is no unified answer. It is also worth noting that the scope of this experiment is to evaluate how well OCL and Java can be used to specify constraints, not to evaluate particular tools.

The authors of the paper carefully checked the collected answer sheets and evaluated the derived constraints based on the defined quality metrics (Sect. 2.4.1). The data were encoded into a JMP [37] data file to perform the statistical analysis.

3 Results and discussion

In this section, we present results and discussions to test the hypothesis formulated in Sect. 2.1. We first provide descriptive statistics of the dependent variables in Sect. 3.1. In Sect. 3.2, we report the results of the univariate analysis that we conducted to test the significant difference of OCL and Java in terms of the five dependent variables by Method and by Complexity of constraint, respectively. To further analyze whether the objective quality measures and the two subjective measures correlate to each other, we conducted correlation analysis (Sect. 3.3).

3.1 Descriptive statistics

The descriptive statistics for all the dependent variables are provided in Tables 5, 6, and 7. The overall observation is that regardless which method was used, the participants performed well in terms of the three quality metrics: high Completeness (91 and 89 % for OCL and Java, respectively), reasonable Conformance (71 and 73 % for OCL and Java, respectively), and low Redundancy (2 and 4 % for OCL and Java, respectively). Another observation is that there is no big difference between OCL and Java by looking at the mean values of the three quality metrics (rows 5, 7 and 9 of Tables 5, 6). To confirm the significance, we performed statistical tests, which will be discussed in Sect. 3.2.

Table 5 Descriptive statistics for Completeness, Conformance, and Redundancy—OCL
Table 6 Descriptive statistics for Completeness, Conformance, and Redundancy—Java
Table 7 Descriptive statistics for Applicability and ConfidenceLevel

When looked into mean values of each method of the three constraint complexity levels, the participants who were using OCL performed consistently well across all the levels (Table 5). There are 6 and 10 % differences between Medium and High and between Medium and Low, for Java, in terms of Completeness, and 13 and 12 % differences between Medium and High and between Medium and Low, for Java, in terms of Conformance (Table 5). This result indicates that the participants performed differently when they were specifying different levels of complexity of constraints using Java. Further statistical analysis conducted to check the significance and results is reported in Sect. 3.2.

Regarding the two subjective, Likert scale measures, the participants subjectively thought constraints with higher complexity were more difficult to specify and they had less confidence. This observation applies to both Java and OCL. For example, for OCL, constraints with Low, Medium and High complexity, received 51, 31, and 25 % Applicability as shown in Table 7. Further statistical analysis was conducted to check the significance and results are reported in Sect. 3.2.

3.2 Univariate analysis

3.2.1 Dependent variables by method

Due to the fact that the distributions of all the continuous dependent variables strongly depart from normality as the results of the Shapiro–Wilk W test [38] showed, we performed nonparametric, Matched Pair, Wilcoxon rank sum test [38], and results are reported in Table 8. Each row reports on each dependent variable measure for each group of participants or the two groups together (1+2). Columns show the mean differences, degree of freedom (DF), and corresponding probability for the Wilcoxon test.

Table 8 Two-tailed matched pair Wilcoxon test with the significance level \(\alpha = 0.05\) (dependent variables by method)

For Group 1 who used OCL to specify constraints for Banking in the first round and specified constraints for VCS using Java in the second round, as shown in Table 4, the matched pairs were formed based on the data collected for these two tasks. A pair in our context is the same student using OCL to specify a constraint at a level of complexity in the first round and specifying a constraint using Java in the second round of equivalent. The same strategy was followed for matching the results of applying OCL and Java by Group 2.

The participants in Group 1 performed significantly better when they were using OCL than Java in terms of quality metrics Completeness and Redundancy, as shown in Table 8 (the values highlighted in bold in Column 5). No significant difference was observed between two methods regarding Conformance though when Java was used, the participants performed slightly better than when they were using OCL (notice the negative value in Row 4 and Column 3: \(-\)0.025). Regarding the participants’ subjective opinions on the applicability of the methods and their confidence of applying them, no significant difference was observed between the two methods. When looking at the results of Group 2, no significant difference can be observed between the two methods for any of the five dependent variables. When the results from the two groups are combined, OCL yielded better performance than Java in terms of Redundancy, implying that the participants working with Java significantly introduced more redundant clauses as compared to the participants working with OCL.

3.2.2 Dependent variables by Complexity

As discussed in Sect. 2.4, we classified all the 20 constraints of the two systems into three categories: Low, Medium, and High. For continuous data, to test whether the dependent variables are significantly different given different levels of complexity, we performed the Kruskal–Wallis one-way analysis of variance test [38]. It is a nonparametric equivalent of the one-way ANOVA test, since the distributions of all the continuous dependent variables strongly depart from normality as the results of the Shapiro–Wilk W test showed. To compare each pair of complexity levels, we performed the Wilcoxon Signed-Rank test [38]. Results are provided in Table 9. For ordinal data, the Pearson Chi-square test [38] was performed and results are also reported in Table 9.

Table 9 Wilcoxon test with the significance level \(\alpha = 0.05\) (dependent variables by Complexity)

As shown in Table 9 (Rows 5, 6, 10 and 11, and Column 3), for Applicability and ConfidenceLevel, significant differences were observed between any two Complexity levels. When we further looked into the details, we observed that for more complex constraints, the participants had significantly less confidence and thought the given method was significantly more difficult to apply. This observation is consistent for both OCL and Java.

Regarding the three quality measures with continuous data, results of the Wilcoxon pair tests show that there is no significant difference for any constraint complexity level pair for any measure of OCL, as shown in Table 9 (Rows 3–5 and Columns 4–6). This implies that OCL consistently performed well (with 91 % Completeness, 71 % Conformance, and 2 % Redundancy on average) for all constraints at the different levels of complexity.

For Java, as shown in Table 9, significant difference was observed for Completeness between pair Medium-Low in favor of Low (notice that ‘(\(-\))’ attached to the p values in the table indicates the direction of the differences). As shown in Fig. 2, constraints with the Medium complexity specified using Java obtained lower Completeness than the ones with High complexity though no significant difference was identified (see Row 8, Column 4 of Table 9). As shown in Row 9 and Columns 4 and 5 of Table 9, constraints with the Medium complexity specified by the participants using Java have significantly lower Conformance than constraints with the other two levels of complexity (on average 13 or 12 % lower than the constraints classified as the High or Low complexity, respectively, as shown in Table 6). This result indicates that the complexity of constraints has impact on the quality of specified constraints when Java was used. Recall that in our experiment, the ten constraints for each system were ordered from the simplest one to the most complex one. The participants started from the simplest ones (Low) to more complex ones (Medium and High). For constraints with low complexity, the participants performed well. Gradually along with the increase in the complexity, their performance decreased. But they gained experience of applying Java for specifying constraints after finished roughly two third of the constraints, which eventually leads to the fact that they performed well for the constraints with the High complexity.

Fig. 2
figure 2

Mean diamonds graph for Completeness (a) and Conformance (b) by Complexity of Java

3.3 Correlation analysis

It is also interesting to know whether there are correlations between the quality measures (i.e., Completeness, Conformance and Redundancy) and the measures measuring the participants’ subjective opinions on the applicability of two methods: Applicability and ConfidenceLevel. To this end, we conducted the nonparametric Spearman’s \(\rho \) test [38]. Results are reported in Table 10. Spearman’s \(\rho \) is used to determine the wellness of dependence relationship between two dependent variables. The value of \(\rho \) ranges from \(-\)1 to +1. When the value is 0 this means that there is no dependence between two variables. A positive value means the value of one dependent variable increases as the value of the second dependent variable increases. A negative value of \(\rho \) shows increasing the value of one dependent variable decreases the value of the second dependent variable. In addition to reporting \(\rho \), a \(p\) value is often reported to show the significance of the relationship.

Table 10 Correlation analysis among dependent variables with the significance level \(\alpha = 0.05\)

From Table 10, Rows 1–2 and Column 4, one can observe that Completeness and Conformance are significantly correlated with ConfidenceLevel and Applicability, as the p values are less than 0.05. This result indicates that the objective measures of the quality of constraints and the subjective opinions of the participants on the two methods are nicely consistent. In other words, a participant who was more confident to apply a method to specify a constraint and thought the method was easier to apply specified constraints with higher quality. Significant correlation was identified between ConfidenceLevel and Applicability, which implies that when participants were confident also thought the methods were easy to apply.

3.4 Additional analysis

We also conducted additional analyses to understand how far the student’s derived constraint specifications are from the fully correct specifications in the sense that they are ready to be processed by tools for supporting automation. From the constraint specifications derived by the students, we selected 25 specifications that achieved 100 % conformance for each system and each method. In total, 100 constraint specifications were selected and inputted to IBM RSA for checking OCL constraints and Eclipse IDE for Java development for checking Java specifications.

Out of the 25 selected OCL constraints for the Banking System, 12 of them contained errors that need to be fixed before being used for supporting automation. For VCS, fifteen OCL constraints have errors identified. We report the identified error types and number of errors in Table 11. One can see that most of the errors are due to syntactically incorrect reference to enumeration literals. For example, in OCL, an enumeration literal is referred as “Enumeration Name:: Enumeration Literal.” Some students mistakenly referred to enumerations in one of the following ways: (1) Referring to the enumeration literal as in Java, i.e., “Enumeration Name. Enumeration Literal”; (2) Referring only with the name of the enumeration literal, i.e., “Enumeration Literal.”

Table 11 Errors observed when tools were used (OCL)

For the selected 25 Java specifications for the Banking System, ten out of them contained one or more errors. For VCS, nine out of 25 contained one or more errors; three of them were caused by the incorrect design of the class diagram. As it can be seen in Fig. 5, the cardinalities from Saturn to SIP and \(H323\) classes are exactly one. In the correct design, it should have been “\(0..1\).” In addition for VCS, we found seven instances where the students referred to objects in a syntactically wrong way (J1 in Table 12). For example, an object \(s\) of type Saturn was given to the students as the starting point for traversal, but in these seven instances, the students did not start the traversal from \(s\). We report the identified error types and number of errors in Table 12. In total, fifteen errors were identified in the 10 constraints for Banking and 20 errors were identified in the 9 constraints for VCS.

Table 12 Errors observed when tools were used (Java)

Based on the total number of errors observed, it seems that constrains specified in Java contain more errors than constrains specified in OCL. However, to further confirm this, another controlled experiment is needed with students specifying constraints directly in Java and OCL tools. We plan to conduct such experiment in the future. Another observation is that tools are needed for specifying fully correct OCL and Java constraints. We can also learn from the results of the experiment that the error types reported in Tables 11 and 12 are easy to fix with tool support.

3.5 Overall discussion

Based on the results presented in Sects. 3.1, 3.2 and 3.3, we can observe that the performance of the participants specifying constraints using both OCL and Java is equally well as shown in Table 5. The mean Completeness for OCL is 91 and 89 % for Java, whereas mean Conformance for OCL is 73 and 71 % for Java as shown in Table 5. Moreover, the specified constraints have low mean Redundancy, i.e., 1 % for OCL and 4 % for Java (Table 5).

Based on the results presented in Sect. 3.2, we observed that the participants who worked with OCL performed consistently well to specify constraints of varying complexity; however this was not the case for Java. Based on these results, it is apparent that it does not make much difference to use OCL or Java for specifying constraints on UML models. However, since the performance of the participants working with OCL is not affected by the complexity of constraints, we recommend using OCL for specifying constraints on UML models. Even when one has to specify complex constraints, as is the case in most of the industrial applications, we expect the better performance with OCL as compared to Java.

Notice that the purpose of the experiment is to compare OCL and Java in terms of specifying constraints on UML Models at the design time and we were not interested in studying the runtime details of Java and OCL. Moreover, some of OCL evaluators such as Dresden OCL [10] and Eclipse OCL [8] translate OCL constraints into Java for evaluation at the backend and thus all the runtime issues of Java are the same as for OCL.

It is important to point it out that the motivation of the work is to test the capability of human subjects in terms of specifying constraints using OCL and Java. We do not aim to use, in the context of this controlled experiment, manually derived specifications for any particular purpose (e.g., generating test data). Therefore, evaluating constraints is out of the scope of this experiment. In addition, evaluating an OCL constraint and the execution of a Java program require OCL evaluators and Java compilers, respectively, and thus are out of the scope of this controlled experiment.

In this experiment, we only measure the semantic conformance of derived OCL and Java constraint specifications against provided English specifications and class diagrams. Since we focus on comparing OCL and Java in terms of specifying constraints by mentally understanding UML class diagrams and constraints written in English, measuring syntactic conformance of these constraint specifications is not within the scope of this experiment. This is due to the reason that a tool can easily check syntactic conformance of OCL and Java constraint specifications, but it cannot ensure their semantic conformance against requirements and validating their semantic conformance has to be manual, which leads to the definition of the complexity metrics as discussed in Sect. 2.3.1.

4 Threats to validity

Below, we discuss the threats to validity of our controlled experiment based on the concepts discussed in [36]. Conclusion validity threats are concerned with factors that can influence the conclusion that can be drawn from the results of the experiments. As with most controlled experiments in software engineering, our main conclusion validity threat is related to the sample size on which we base our analysis. To deal with this, our experiment design required modeling 10 constraints per system (20 in total for two systems) to maximize the number of observations within time constraints. The other concern is that the quality of constraints specification can be interpreted in various ways, depending on one’s subjective opinion. However, we made an effort to minimize subjective judgments by proposing a set of objective metrics to measure the quality of constraints. By doing so, subjective perceptions can be reduced to minimum and the comparison of constraints derived by different participants becomes possible.

Internal validity threats exist when the outcome of results is influenced by confounded factors and are not necessarily due to the application of the treatment being studied. Through our experiment design, we have tried to minimize the chances of other factors being confounded with our primary independent variable: the use of OCL and Java. We used a within-subjects design and matched pairs analysis since the strength of this design is that the variation due to differences in participants is eliminated as each participant acts as its own control. We avoided any biased assignment of participants to groups by randomization and blocking based on questionnaire results. The experiment participants were provided with constraints that are written in English as the input, which are inherently ambiguous. Therefore, it might have the impact on the quality of the derived OCL and Java constraints. However, it is worth noticing that the participants using either OCL or Java for the same system were provided with the same set of constraints in English. Therefore, we do not expect this threat having any impact on the comparison of OCL and Java. Another concern is the proficiency of the English language of the students. Explaining the constraints in English by the authors to the participants during the experiment reduces the potential impact of proficiency of the English of participants on their performance. We also avoided the possible impact of this factor by using the within-subjects design and matched pairs analysis.

Regarding construct validity, the main threat is that we were not able to investigate all features of OCL (such as specialized operations including oclInState) in this experiment due to the nature of our case studies. This will require replications with different systems, which we plan to conduct in the future.

The main threat to external validity is typical of controlled experiments in artificial settings: Are the participants representative of software professionals? Many practitioners have anyway very little knowledge of OCL in general, and hence require training. Note also that we chose a group of experienced graduate students with an advanced educational background (Sect. 2.2). In addition, some studies [3941] have been reported on the performance, for various tasks, of trained software engineering students when compared with professional developers. These differences were not statistically significant when compared to junior and intermediate developers, thus suggesting that there is no evidence that students trained for the tasks at hand may not be used as participants in place of professionals.

5 Related work

OCL is a standard language that is widely accepted for writing constraints on UML models. OCL is based on first-order logic and set theory and provides various constructs (e.g., collection operations) to define constraints in a concise form. The language allows modelers to write constraints at various levels of abstraction and for various types of models. For example, it can be used to write class and state invariants, guards in state machines, constraints in sequence diagrams, and pre- and post-conditions of operations. Our several industrial case studies have shown the benefits that it can bring to solve various industrial problems such as supporting automated model-based test case generation and automated model-based consistency checking and configuration in the context of produce line engineering [5, 6, 15, 1820, 26, 28, 30]. OCL is also being used as the language for writing constraints on models in many commercial MBT tools such as CertifyIt [13] and QTronic [42].

Java is a programming language that has been very widely used and supported by a lot of tools. However, the modeling community does not often notice that Java can also be used as a constraint language to specify constraints in UML models. Considering Java is widely known and practiced by software developers and has a large number of tools available in the market, gradually it becomes an option, in addition to OCL, to specify constraints on UML models. This observation is supported by the fact that some market-leading UML modeling tools such as IBM RSA [33] and Papyrus [32] support using both OCL and Java to specify constraints. It is, however, rarely reported in the literature, with scientific evidence, which of these two languages (i.e., OCL or Java) is better in terms of specifying constraints on UML class diagrams.

In the rest of the section, we discuss several representative tools that provide support for using OCL and/or Java for specifying constraints (Sect. 5.1), followed by the related work reporting controlled experiments empirically evaluating the impact of OCL in UML-based maintenance and the understandability of OCL (Sect. 5.2).

5.1 Tools that implement OCL and/or Java for specifying constraints

In UML models, constraints can be specified using different types of languages such as natural language, programming languages (e.g., Java and C++), and OCL. Some existing open source and commercial tools such as IBM RSA [33] and Papyrus [32] provide modeling environments for users to specify constraints in UML models with various languages. However, OCL and Java are two commonly implemented constraint specification languages and some modeling tools also support automated validation of constraints specified in OCL and/or Java. In the following section, we briefly discuss some widely used modeling tools that have OCL and/or Java implemented as their constraint specification languages.

IBM RSA [33] allows one to specify constraints either using OCL or Java. One argument for supporting both languages is that “Java might be easier to use to express complex constraints, and offer great flexibility” and “OCL is more consistent with how OMG defines UML constraints” [43]. Notice that this argument is not supported with any scientific evidence. The empirical study, we conducted exactly aims to test which one is easier to use and which one can handle complex constraints better. The results reported in Sect. 3 reveal that OCL performs significantly better than Java in terms of handling complex constraints, which is not consistent with the argument provided in [43]. Open source modeling tool Papyrus [32] allows users to specify constraints using OCL, Java, natural language, C, and C++. However, as mentioned in [44], to make specified constraints usable by Papyrus, constraints must be written in OCL or Java such that specified constraints can be validated automatically.

OneModelica [45] is the IDE designed for the Modelica modeling language. In [46] a study was reported to compare OCL and Java in the context of OneModelica for Modelica code validation. OCL and Java were compared to each other regarding two aspects: readability of constraints as well as execution performance. The first comparison aspect is closely relevant to the objective of the controlled experiment reported in this paper. The authors of the paper [46] concluded, via subjective language concept comparison, that the readability of OCL constraints is “very good” as compared to Java. However, more software developers can understand Java and tool support is a quite important benefit as compared to OCL. This conclusion conforms to what we observed from the controlled experiment. However, it is important to notice that the controlled experiment we conducted is a scientific way to provide evidence and the results of the experiment were analyzed and evaluated with more objective metrics (e.g., conformance, completeness) instead of a very subjective evaluation of readability.

MagicDraw [31], Enterprise Architect [47], and argoUML [48] are another three UML modeling tools that provide capability of specifying OCL constraints and validating them. None of them, however, support specifying constraints using Java.

5.2 Controlled experiments

Briand et al. [49, 50] conducted a controlled experiment to evaluate the impact of OCL in UML-based maintenance, from the perspective of using OCL on model comprehension and maintainability. The motivation was to assess the benefits (precision) that OCL brings when applying it in UML-based development, considering the additional effort required and extra formality introduced. Results show that an initial learning curve is required to gain significant benefits when using OCL in combination with UML diagrams. To compare with our experiment, we evaluate the applicability of OCL in combination with UML by comparing it with Java, which can equivalently do the same thing. Our motivation is to collect evidence and provide arguments in a scientific way which of these two languages is better. Therefore, we can recommend it to our industrial partners.

Correa et al. reported a controlled experiment in [51] to evaluate the impact of bad OCL expressions and their refactoring on the understandability of OCL specifications. Results show that most refactoring significantly improves the understandability of OCL specification. We did not find any other work relating to empirical evaluation of OCL or Java.

Harald Störrle reported in [52] the results of a series of controlled experiments to evaluate the usability of the OCL Query API (OQAPI), which was designed for the purpose of improving the usability of OCL for supporting querying. Experiment results show that OQAPI is easy to use in terms of facilitating user querying using OCL.

Based on the above-related work, we can conclude that the controlled experiment reported in this paper is one of the first experiments that were exclusively designed to compare OCL and Java for specifying constraints on UML models. The results of the experiment provide some evidence that can be used by practitioners and academics to choose a language for specifying constraints for their specific problems.

6 Conclusion

The Object Constraint Language (OCL) has been widely used along with UML models for various purposes such as supporting model-based testing and configuration of products in a product line. From the last several years, we have been working on various industrial projects on model-based engineering (MBE), which involved using the OCL. One of the major challenges that we faced is the limited evidence about the applicability of OCL in the literature as compared to, e.g., Java. Such evidence is important to convince the industrial partners about the use of OCL in the industry.

To collect some evidence about the use of OCL, we reported a controlled experiment that was conducted to evaluate the “applicability” of OCL by comparing it with one of the most commonly used programming languages in terms of applying them to specify constraints on UML class diagrams. We looked at applicability from two aspects: the quality of specified constraints in terms of completeness, conformance, and redundancy, and subjective opinions of participants on the applicability and their confidence of applying the two languages.

Experiment results showed that both OCL and Java are equally good: Completeness and conformance of the specified constraints were high, and there were very few redundant clauses in the specified constraints. We also observed that the applicability of OCL is not impacted by the complexity of constraints. This observation gives us confidence that OCL scales well when it is used for specifying complex constraints, which are commonly seen in industrial settings. However, this is not a case for Java whose performance is influenced by the complexity of constraints.

Moreover, we performed additional analyses where we took 100 constraints in Java and OCL that have 100 % conformance. These constraints were inputted to OCL and Java tools to identify additional errors in their specification. Results show that the constraints specified in Java contain more errors than in the ones in OCL. These results suggest that tools are absolutely needed for specifying fully correct OCL and Java constraints.

Based on the results of the experiment, we recommend using OCL for specifying constraints on UML models for addressing large-scale industry problems, especially for industrial contexts where Java is not used as the development language.