1 Introduction

Modeling is at the core of many disciplines, but it is especially important in engineering because it facilitates the communication and construction of complex systems from smaller parts (Thomas 2004). Models help us understand a complex problem and its potential solutions through abstraction. This is why software systems, which are often among the most complex of all engineering systems, can greatly benefit from using models and modeling techniques (Selic 2003). This idea is now receiving even more emphasis, since the software industry is moving towards Model-Driven Development (MDD) processes (Atkinson and Kühne 2003), in which software is developed at a higher level of abstraction than source code, based on models and model transformations. The MDD paradigm therefore focuses the effort of development on the design of models, rather than on coding. Correspondingly, the focus of software quality assurance is shifting from system implementation towards system modeling.

To be useful and effective, an engineering model must possess the following five key quality characteristics to a sufficient degree (Selic 2003): abstraction, understandability, accuracy, predictiveness and inexpensiveness.

In this paper, we focus on understandability because it is recognized as one of the main factors influencing maintainabilityFootnote 1, and it is well-recognized that a large part of the effort invested in the development of any software product is devoted to maintenance (Pigoski 1997). More specifically, we focus on the understandability of UML statechart diagrams, since UML has become the de facto standard for modeling software systems; added to this is the fact that UML statechart diagrams have become an important technique for describing the dynamic aspects of a software system (Denger and Ciolkowski 2003). UML statechart diagrams are also considered to be one of the most important UML diagrams and they should be used by practitioners as a starting point for training newcomers to UML (Bolloju and Leung 2006).

The main goal of this line of research, which we have pursued over the last 5 years, was to investigate which constructs influenced the understandability of UML statechart diagrams, since a UML statechart diagram must be understood before any desired change on it can be identified, designed, or implemented. In the quest to reach this objective, we carried out a controlled experiment and a replication of it (Cruz-Lemus et al. 2005b). We found that activities, guards, simple states and transitions were the UML constructs that most influenced the understandability of UML statechart diagrams, but that the effect of composites states was not clear. Considering these results as preliminary, we decided to continue investigating composite states.

Composite states allow modelers to structure UML statecharts in a hierarchical fashion. A composite state represents the abstraction of an entire UML statechart diagram into which the composite state can be refined. As such, composite states are an important construct of the UML statechart diagrams metamodel (OMG 2003) and they are believed to be a fundamental modeling abstraction mechanism to help modelers master the complexity of a software system. From a theoretical point of view, UML statechart diagrams with composite states extend finite state machines to facilitate the description of highly complex behaviors (Hu and Shatz 2006) by dividing the system into smaller, less complex parts thereby making this system easier to understand. This in turn leads to a model that is easier to develop and modify.

Taking as a starting point the common use of hierarchical structures in modeling techniques, we thus hypothesized that abstracting a UML statechart diagram composed of highly related simple states and transitions into a composite state could help improve the understandability of a UML statechart diagram. Empirical support needs to be provided to show if this belief is actually true and, if so, under what conditions.

As related works show (see Section 2), references on empirical studies related to dynamic modeling in general and UML statechart diagrams in particular are few and far between. To our knowledge, the influence of composite states on the understandability of UML statechart diagrams has not been studied in the literature previously, despite the importance of the topic. This fact motivated us to gather empirical evidence for our hypothesis.

In this work, we present a family of three empirical studies consisting of five controlled experiments, whose design and execution were gradually modified and improved to alleviate some threats to the validity of the different component studies. We used relatively small statechart diagrams (10 to 25 states) as experimental materials and the experimental subjects were undergraduate and graduate students of Computer Science at several universities, along with a number of professionals with an average of 2 years’ experience in UML modeling.

The data analysis carried out in each individual experiment did not allow us to obtain conclusive results. This led us to carry out a meta-analysis study. Meta-analysis has been recognized as an appropriate way to aggregate or integrate the findings of empirical studies in order to build a solid body of knowledge on a topic based on empirical evidence (Lipsey and Wilson 2001; Miller 2000; Pickard 2004). Moreover, the need for meta-analysis is gaining relevance in empirical research, as is demonstrated by the fact that it is a recurrent topic in various forums related to Empirical Software Engineering. In other areas, such as psychology or medicine, a single study is extremely unlikely to be definitive. Dozens and even hundreds of studies on the same topic may follow. In Empirical Software Engineering, it is unusual for a large amount of studies concerning the same topic to take place, but it is necessary to cross the borders of individual studies to extract conclusions of a more general kind from families of experiments, with or without significant results.

Since we have not evaluated industrial systems with a large range of different size and complexity, we cannot generalize our findings to every usage of composite states in UML statechart diagrams. Nevertheless, our common family of experiments seems to indicate that the use of composite states is not always beneficial.

The paper is organized as follows. Section 2 presents related work. Section 3 provides a roadmap of the family of experiments that we have performed. Section 4 introduces the Cognitive Theory of Multimedia Learning (CTML) (Mayer 2001), which we have used as a background in some of our experiments. Sections 5, 6, 7 then explain in detail the experimental process used to carry out each of the studies that are part of the family of experiments. Section 8 summarizes the threats to the validity of the family of empirical studies. In Section 9 the results of the meta-analysis performed with the data are presented. The main conclusions achieved from this family of experiments and the future work that is planned are in Section 10.

2 Related Work

In this section, we situate our empirical study in relation to some other work found in the relevant literature.

Comprehension has been widely studied. In the literature, we can find works that have studied the comprehension of programs (Woodfield et al. 1981), complete models (Agarwal et al. 1999) or specific diagrams such as UML class diagrams (Purchase et al. 2001, 2002; Yusuf et al. 2007), UML collaboration diagrams (Glezer et al. 2005; Purchase et al. 2001, 2002) and UML sequence diagrams (Glezer et al. 2005; Xie et al. 2007). We can even find examples of pieces of work which study how the use of different artifacts, e.g. stereotypes, affects the way in which models are understood (Genero et al. 2008; Ricca et al. 2007; Staron et al. 2006).

As we have commented previously, understandability is considered to be a main factor influencing maintainability (Briand et al. 2001; Fenton and Pfleeger 1997; Harrison et al. 2000) and we can also find other works taking up this issue (Arisholm and Sjøberg 2004; Genero et al. 2007).

In some of these studies we have found that experience is a factor to be taken into account when measuring comprehension (Arisholm and Sjøberg 2004; Bolloju and Leung 2006; Ricca et al. 2007; Yusuf et al. 2007).

We found the following two papers dealing with empirical studies on the comprehension of UML diagrams which model dynamic aspects of an OO system:

  • Otero and Dolado (2004) evaluate the comprehension of the dynamic modeling in UML designs by using two experiments in which they compare the comprehension of UML sequence, collaboration, and statechart diagrams. They conclude that sequence diagrams are the most appropriate for comprehension of management information applications, collaboration diagrams are those best suited to real-time non-reactive systems, and statechart diagrams are the most appropriate for real-time reactive systems.

  • Otero and Dolado (2005) present two controlled experiments for evaluating the semantic comprehension of two standard languages, UML versus OPEN Modeling Language (Firesmith et al. 1998), from the perspective of dynamic modeling. The results reveal that the specification of dynamic behavior using OPEN Modeling Language is faster to comprehend and easier to interpret than when using the UML language, regardless of the type of dynamic diagram.

As we commented in the introduction, the main goal of our line of research over the last 5 years has been to investigate which constructs influenced the understandability of UML statechart diagrams, so the most closely related work is that done by ourselves prior to this. We had carried out a controlled experiment and a replication of it (Cruz-Lemus et al. 2005b) in which we found that some of the UML statechart diagram constructs (activities, guards, simple states and transitions) were the ones that most influenced the understandability of UML statechart diagrams. To perform that experiment, a group of teachers and students from the University of Castilla-La Mancha (Spain) performed a series of comprehension tasks on 20 different UML statechart diagrams which covered a broad range of values for the proposed metrics. In this study, composite states did not seem to affect the understandability of UML statechart diagrams.

In addition, in Cruz-Lemus et al. (2005c) we presented an experiment and its replication whose purpose was to find out the optimal nesting level of composite states within UML statechart diagrams. 38 Computer Science students from the University of Murcia (Spain) answered a set of comprehension questions related to the same system, but modeled using 0, 1, and 2 nesting levels in composite states, i.e., without composite states, with one composite state and with composite states within composite states. They concluded that a flat nesting level makes the diagrams more easily understandable.

This review of the literature reveals that the use of composite states and their impact on the comprehension of UML statechart diagrams have not been investigated in depth, despite the need for empirical studying of UML diagram comprehension, and in spite of how many recently- published works there are.

Even though in (Cruz-Lemus et al. 2005b) we found that composite states seem not to affect the comprehension of UML statechart diagrams, we considered this a bit suspicious, so we decided to investigate this finding in greater depth. Starting from the common use of hierarchical structures in modeling techniques, we decided to hypothesize that abstracting a UML statechart diagram composed of highly related simple states and transitions into a composite state could help improve the understandability of a UML statechart diagram. This hypothesis was what led us to carry out the research that we are presenting in the current study.

3 The Family of Experiments

An experiment may be a part of a common family of studies, rather than being an isolated event (Basili et al. 1999). Common families of experiments allow researchers to answer questions that are beyond the scope of individual experiments and let them generalize findings across studies, thus providing evidence for confirming or rejecting specific hypotheses. In addition, common families of studies can contribute to devising important and relevant hypotheses that may not be suggested by individual experiments. A common family of experiments is not necessarily composed only of identical replications of the same study. Materials, hypotheses, and specific tasks assigned to the subjects may be refined across experiments, based on the knowledge obtained after each experiment.

Figure 1 shows the chronology of the family of experiments we have carried out in our study on the understandability of UML statechart diagrams.

Fig. 1
figure 1

Chronology of the family of experiments

The first experiment and its replication (E1 and R1) took place in two universities in Spain in 2005. The materials and tasks to be performed were quite simple and the background knowledge of the undergraduate students used as subjects was not advanced. These studies provided some initial results that were later strengthened with the other experiments of the family.

The second experiment and its replication (E2 and R2) took place in two universities, one in Spain and the other in Italy, in 2006. The Italian students’ background was similar to that of those in the previous study (E1 and R1), but the Spanish subjects were PhD students and had more experience in modeling. In addition, the materials and tasks assigned to the subjects were improved, especially with the use of the CTML (Mayer 2001) for assessing the complete set of variables of the experimental design. We describe this theory in more detail in Section 4.

In these studies, we used students as experimental subjects. The tasks to be performed did not require high levels of industrial experience, so we believed that this experiment could be considered appropriate, as suggested in the literature (Basili et al. 1999; Höst et al. 2000). Working with students also implies a set of advantages, such as the fact that the prior knowledge of the students is rather homogeneous, there is the possible availability of a large number of subjects (Verelst 2004), and there exists the chance to test experimental design and initial hypotheses (Sjoberg et al. 2005). An additional advantage of using novices as subjects in experiments on understandability is that the cognitive complexity of the objects under study is not hidden by the experience of the subjects.

The main difference between the first four studies (E1, R1, E2, and R2) and the third experiment (E3) lies in the fact that we had professionals as experimental subjects in E3. Another feature that made that experiment distinct was that the materials and tasks were further renewed and improved.

In studies E1 and R1, we used variable understandability effectiveness, defined as the ability to understand the presented material correctly. In studies E2, R2 and E3, we added two new variables related to the CTML, retention and transfer. We explain these variables in Section 4.

These three variables were measured by using three separate tests based on questionnaires. The values of understandability effectiveness (UEffec), transfer (UTrans), and retention (UReten) were computed as the number of correct answers for each specific test divided by the number of questions.

The time needed to complete a test was also measured, but we chose not to use it because, from our own experience and following the advice of several experts, we have concluded that time is not a good indicator of understandability on its own. It provides information only about how quickly the tasks have been performed, but not about how well.

As for the design of the experiments, we used the guidelines provided in several works (Juristo and Moreno 2001; Kitchenham et al. 2002; Wohlin et al. 2000). Taking into account the kind of experimental designs used and the treatment of the studies, an appropriate statistical method for obtaining the results is an ANOVA (Kirk 1995; Winer et al. 1991). We set a statistical significance threshold α = 0.05 in all of our studies, so we rejected the null hypotheses of our studies if the statistical tests we used provide a statistical significance (p-value) of the results that was not higher than 0.05. We also studied the power of the statistical test when non-statistically significant results were obtained. We used SPSS (SPSS 2003) to perform all the statistical analyses.

We examine all the threats to validity of the experiments in Section 8.

4 The Cognitive Theory of Multimedia Learning (CTML)

Models in general and conceptual models in particular include both graphics and text. (Mayer 2001) proposed a definition of “multimedia” to include descriptions that include “words” and “pictures”. Conceptual models can be considered multimedia messages, since they include both words and graphic elements (Gemino and Wand 2005).

We have used CTML (Mayer 2001) to explain how individuals viewing explanative material develop an understanding of multimedia content being presented to them. One of the main strengths of this theory lies in the experimental studies that have been based on it to compare text-only presentations with graphics/text presentations in several fields (Craig et al. 2002; Gemino and Wand 2003; Mayer 1989; Mayer and Anderson 1991; Mayer 2001; Tabbers 2004).

There are a number of reasons for choosing CTML as a means of measuring how subjects understand the materials that are being presented (Gemino and Wand 2005). Firstly, CTML focuses on words and graphics, which are the elements used by UML. Secondly, CTML provides principles for the design of effective multimedia presentations that can be empirically tested. In third place, CTML has evolved over a decade of work, in which experimental instruments and methods have been developed (Mayer 1989, 2001).

CTML suggests that a learner is not an “empty vessel” waiting to be filled with domain information, but an active processor with limited cognitive capacity who attempts to integrate presented material with previous knowledge. This implies that individuals might differ in how they understand the same model, depending on prior knowledge and the attention they give to various parts of the model.

(Mayer 2001) suggests that three outcomes are possible when presenting explanative material: (1) no learning, (2) fragmented learning, and (3) meaningful learning. These outcomes are primarily based on concepts that can be measured by two variables that Mayer labels retention and transfer.

Retention is defined as the comprehension of material being presented. Transfer is the ability to use knowledge gained from the material to solve related problems not directly answerable from it. No learning occurs where retention and transfer are low. Fragmented learning occurs where retention is high but transfer is low. This result indicates that material has been received but has not been integrated well with prior knowledge. It suggests that memorization has occurred, rather than meaningful learning. Finally, meaningful learning occurs when both retention and transfer are high. High transfer indicates that information has been integrated into long-term knowledge and a high level of understanding of the presented material has been achieved.

5 First Experiment and Replication (E1 and R1)

In this section, we outline the main characteristics and results of the first experiment (E1) and its replication (R1). More details about this study can be found in (Cruz-Lemus et al. 2005a).

All the subjects received a short training session before the experiment, in which the instructor commented on the main constructs of UML statechart and showed two examples of the experimental tasks to be performed. These examples, as well as those performed in the rest of experiments and replications, were neutral with regards to the independent variable (whether using composite states or not), as one example contained composite states and the other did not.

We split the subjects randomly into two groups, which we here call Group A and Group B. Two different domains were used, one involving the functioning of an ATM (Automated Teller Machine) and the other a phone call. For each domain, two conceptually identical diagrams were used, but while one of the diagrams included composite state(s), the other did not.

In the first part of the experiment, we used the ATM domain, in which the subjects in Group A received a diagram without composite states, while the subjects in Group B received a diagram with composite states. In the second part of the experiment, we used the phone call domain. Subjects in Group A received a diagram with composite states, while the subjects in Group B received a diagram without composite states. The experiment design is summarized in Table 1.

Table 1 E1 and R1 design

This process of assigning subjects to the 4 different treatments, obtained by combining the dependent variables (Domain and Composite States) corresponds to a 2x2 factorial design with confounded interaction (Winer et al. 1991), because within a domain, the variable Composite States changes together with the group of subjects. So the learning effect is alleviated.

Half of the subjects of each group received the diagram without composite states first and the diagram with composite states second, while the other half received them in inverse order, to avoid possible learning effects.

Each diagram had a test enclosed, with 6 questions. The questions for each domain were the same, regardless of the particular use of composite states. The questions inquired about navigation between states and the effects that it produced.

To increase the motivation and interest on the part of the subjects, the instructor explained to the students that the exercises in the experiment would be similar to those that they would find in their exam at the end of the term. The goal of this experiment and the research question were not disclosed in this experiment (Carter et al. 2003) nor in any of the following studies, however.

In this study, we measured UEffec, the understandability effectiveness (defined in Section 3) of a set of UML statechart diagrams.

We explain the main differences between experiment E1 and its replication R1 in the following subsections. The main conclusions and threats will be presented.

5.1 First Experiment (E1)

The subjects in this experiment were in the fourth year of Computer Science and had received a complete Software Engineering course in which they had studied modeling techniques, including UML.

The other main features of this experiment are outlined in Table 2.

Table 2 E1 main features

Table 3 shows the descriptive statistics of the data.

Table 3 E1 mean and standard deviation (in parentheses) for the understandability effectiveness

Table 3 shows that the subjects obtained better results for UEffec when working with those diagrams that did not use composite states. After removing the outlier values found, we also performed an ANOVA test, which is the most appropriate test for exploring the results of a 2x2 factorial design with interaction confounded (Kirk 1995; Winer et al. 1991). The ANOVA results are shown in Table 4.

Table 4 E1 ANOVA results

In Table 4 and the rest of the tables in the document related to ANOVA, we show the results of Fisher’s F test, where Source column describes the independent variables, df refers to the degrees of freedom, F is the value of the test statistic, p-value is the statistical significance obtained, and Observed Power is the estimated power of the test based on α = 0.05.

We cannot make any strong conclusion, as at an α level we cannot reject H0, i.e., there is no effect from the use of composite states. The observed power of the test is low, probably because of a small effect size, so we would be assuming a 0.756 (or 1−0.244) estimated probability of Type II error in our assertions. Even though the results are not conclusive, they seem to indicate that there is no appreciable impact of the use of composite states on the understandability effectiveness of UML statechart diagrams.

5.2 First Experiment Replication (R1)

The subjects in this experiment were in the second year of Computer Science and were not very familiar with modeling or with UML. They were taking their first course of Software Engineering at the time of the experiment.

The main differences with respect to E1 are detailed in Table 5.

Table 5 R1 main differences with E1

Due to limitations of physical space in the classrooms where R1 took place, the subjects were divided into two groups of 92 and 86 subjects respectively and they performed the experiment at different times. To be specific, the second group began and finished 1 h later than the first. Nevertheless, there was no interaction between the subjects of the different groups.

The skills of the subjects using UML for modeling, especially UML statechart diagrams, were much lower in R1 than in E1, as most of them had only a few months of experience, and they had not worked with some UML metamodel constructs (e.g. composite states) yet. That being the case, the only knowledge they had about composite states was acquired during the introductory session before the experiment.

We used the same techniques and performed the same analysis as in E1. The results obtained are summarized in Table 6.

Table 6 R1 mean and standard deviation (in parentheses) for the understandability effectiveness

In this case, most of the subjects that had received diagrams without composite states or the diagram that modeled the ATM with composite states answered all the questions correctly.

Table 6 again shows that the subjects obtained better overall results for the UEffec when working with those diagrams that did not use composite states. We removed the outlier values and performed an ANOVA test, whose results are in Table 7.

Table 7 R1 ANOVA results

We can observe that there is not a significant effect from the domain or from the use of composite states and, in this case, the power is still very low. If we rejected the null-hypothesis, we would be assuming a 0.687 estimated probability of Type II error. Once again then, the results are not conclusive although they do seem to indicate that there is no appreciable impact of the use of composite states on the understandability effectiveness of UML statechart diagrams.

5.3 E1 and R1 Conclusions

The main goal of E1 and R1 was to study the effect that the use of composite states had on the understandability effectiveness of UML Statechart Diagrams. Considering the results obtained, we cannot conclude anything definitively, as these were not statistically significant and the values of test power were low. We could notice, nonetheless, that the use of composite states does not seem to significantly improve the understandability effectiveness of UML statechart diagrams.

6 Second Experiment and Replication (E2 and R2)

Given the results of E1 and R1, we reviewed the experimental process that had been carried out in that study, focusing especially on the design and the materials that had been given to the subjects. After reading about the experimental approaches performed in other works (Bodart et al. 2001; Gemino and Wand 2005), we sought to enrich the type of tasks that the subjects had been required to carry out in our empirical studies, to reflect fully the understanding that the subjects had on the diagrams. We hence decided to carry out another experiment, after reviewing the design and the materials that would be given to the subjects. As we have commented above, the new approach that we decided to use was based on CTML (Mayer 2001) (See section 4).

We carried out a controlled experiment (E2) and a replication (R2) in which, in addition to UEffec, we took into account the two variables presented by CTML: UTrans and UReten.

In E2 and R2, all subjects received a short training session before the experiment, in which the main constructs of UML statechart diagrams were explained. Several of the subjects had not used UML statechart diagrams for a while. Some examples, similar to the tasks to be performed in the experiment, were also explained by the instructor of the experiment, so that the subjects had a clear idea of how to do the experimental tasks.

As in E1 and R1, we used four different diagrams. They modeled two different domains (an ATM and an alarm clock). We chose these two domains because it was our opinion that there should be a non-negligible difference in the degree of familiarity of the subjects with each domain. In particular, we believed that the alarm clock domain was more complex than the ATM, at least for the dynamic behavior modeled in the diagrams. One possible problem with E1 and R1 may have been that the diagrams were quite easy, so the use of composite states would not actually make any appreciable difference.

In E2 and R2, the design and type of statistical study is identical as in E1 and R1 (Section 5). For each domain, we used two different diagrams with an identical semantic content, one with composite states and the other without. Each subject received two diagrams, one with and another without composite states. Each of them related to a different domain. Thus, we obtained two different groups, as shown in Table 8.

Table 8 E2 and R2 group design

Each subject had to perform three questionnaire-based tests, each about a different variable we studied:

  • Test 1 contained 7 questions which were exactly the same within each domain, independent of the usage of composite states. The questions inquired about navigation between states, variable values, etc. The subjects were allowed to check the diagram to answer the question. This is a kind of task we had already used in previous studies (Cruz-Lemus et al. 2005a; Cruz-Lemus et al. 2005b). With this test, we studied the UEffec variable.

  • Test 2 consists of a questionnaire with 5 questions in which the subjects were asked about how the model worked, i.e., some questions that were more specific than in the previous test. In this case, the subjects were not allowed to look at the diagrams to answer the questions, as these had been removed previously. This task allowed us to measure the UTrans variable.

  • Test 3 consists of a ‘fill-in-the-blanks’ task. The subjects received a text in which the requirements of the model were commented on, but there were a number of missing words. The subjects had to fill in these blanks without using the diagrams, which had not been given back to them. With this task we studied the UReten variable.

These two types of tests were similar to some others used in similar studies (Gemino and Wand 2005; Khatri et al. 2006) which deal with model comprehension using the CTML.

E2 and R2 started with a 25 min introductory session in which the instructor explained the main constructs of a UML statechart diagram. We then showed two examples in a shortened version, along with the correct answer to each question.

Throughout this time, the subjects were allowed to ask the instructor about any doubts they might have; they could also make any remarks they wished to.

We randomly assigned the subjects to two groups. Then, each subject received a diagram, depending on particular subject group he /she belonged to, along with the corresponding sheet for Test 1. From that moment on, the subjects had 20 min to look at the diagram, try to understand how the model worked, and answer the questions.

When that task was completed, these materials were collected and each subject received the sheets with Tests 2 and 3 for the diagrams that they had been studying. They had 20 min to work on both tests.

The materials were collected once more and the above process was repeated, i.e., first they received a diagram and a Test 1 sheet and then later tests 2 and 3. In this second diagram, each subject received a different domain than in the first one (ATM / clock) and also a different usage of composite states (with / without).

The process is represented in diagram form in Fig. 2.

Fig. 2
figure 2

E2 and R2 procedure overview

In the following subsections, we present the specific details for E2 and R2.

6.1 Second Experiment (E2)

The main features of E2 can be found in Table 9.

Table 9 E2 main features

After collecting the data, we first carried out an analysis of the descriptive statistics. In Table 10, we present the means and the standard deviations across the different groups.

Table 10 E2 Means and standard deviations (in parentheses) across groups

The values for UEffec are lower than in the previous studies (E1 and R1). A possible explanation for this is that, in this case, the materials had been modified and there were more questions. Besides that, the questions were more difficult to answer.

There is a clear trend for the UEffec variable: the subjects obtained better results in those diagrams modeled without using composite states, regardless of the domain of the diagram. Nevertheless, for the UTrans variable we find the opposite situation; here, better results are obtained for those diagrams modeled using composite states. In the third case, for the UReten variable, the results were different, depending on the domain.

We detected outlier values in the different tests and decided to exclude them from the data analysis. We then proceeded to test the previously described statistical hypotheses through an ANOVA test, which is the most appropriate test for exploring the results of a 2x2 factorial design with interaction confounded (Kirk 1995; Winer et al. 1991). The results are shown in Table 11.

Table 11 E2 ANOVA results

In all cases (UEffec, UTrans and UReten), the variables are not significantly affected by the domain or the use of composite states. The test powers are low, so the possibility of producing an error by accepting the null-hypotheses is high. The results are therefore not conclusive.

In this case, the results obtained for the three variables agree with those obtained in E1 and R1 for UEffec.

6.2 Second Experiment Replication (R2)

The main differences with respect to E2 are detailed in Table 12.

Table 12 R2 main features

Table 13 presents the means and the standard deviations across the different groups used in R2.

Table 13 R2 Means and standard deviations (in parentheses) across groups

In this study, the only variable that shows a trend in its results, independent of the domain of the diagrams, is the UTrans variable. The results obtained by the subjects were better for those diagrams modeled without using composite states. In the other two cases, the subjects obtained different results, depending on the domain and the use or not of composite states.

Comparing these results with those obtained in E2, we can observe an increase in the means of the variables. A possible reason for that is the small sample size in both studies, together with the fact that they are not randomized. These facts probably indicate that we could find two groups with different backgrounds in our population. After removing the outliers, we also performed an ANOVA test with the data obtained in R2. Table 14 summarizes the results obtained.

Table 14 R2 t-test results

Again, in all cases (UEffec, UTrans and UReten), the results indicate that they are not significantly affected by the domain or the use of composite states. Once more, the test powers are low, so the possibility of producing an error by accepting the null-hypotheses is high. The results are not conclusive, therefore.

6.3 E2 and R2 Conclusions

The main goal of E2 and R2 was, once more, to study the effect that the use of composite states had on the understandability of UML Statechart Diagrams. Considering the results obtained in E1 and R1, we revised the materials and introduced new variables, included in the CTML.

After reviewing the results obtained, we again cannot come to any definitive conclusions, as they were not statistically significant and the values of test power were low. It is true, nonetheless, that in this case, we could notice that the use of composite states does not seem to significantly improve the understandability transfer of UML statechart diagrams.

7 Third Experiment (E3)

In this section, we explain the process we followed when we carried out the third member of the family of experiments.

In this experiment, we further revised and improved the materials and tasks to perform. More importantly, software professionals were involved as subjects in this experiment.

The experiment was carried out in the facilities of the Soluziona Software Factory Company, located in Ciudad Real, Spain. Soluziona, which now belongs to the INDRA Corporation, currently holds a top position in the market of professional services in software, with a sales volume which is close to 800 million Euros and, after a long period of expansion, the company has spread to over 28 countries in 4 different continents. The company has recently reached maturity level 3, according to the CMMi model, and it is planned for level 4 to be achieved in 2009.

7.1 E3 Design

Table 15 outlines the main features of E3.

Table 15 E3 main features

In this study, there was only one domain, a digital watch (Webb 2006), which has a size and complexity that are representative of a real-life case. In this case, we have used a randomized blocks design to control the effect of the subjects’ experience over the variable CS.

The working hypotheses and part of the procedure to follow were similar to E2 and R2. In the following sections, we explain in more detail the experimental procedure and the results we obtained.

7.2 E3 Procedure

The experiment was divided into two sessions, over 2 days. The first session took place on the afternoon of the first day and the second session the following morning.

In order for the subjects to have a knowledge background that was homogenous, the first session began with a seminar about “Dynamic Modeling with UML.” Twenty-five professionals attended the first session and they were provided with a summary of the main concepts of dynamic aspects in modeling in general and in UML in particular. The last part of the seminar focused on UML statechart diagrams, although there was no explicit mention made of any aspect to make the subjects guess the relationship between the seminar and the ensuing experiment.

After the seminar, the instructor explained several UML statecharts with questionnaires, as examples of the test that the subjects were going to perform in that session (Test 0). These examples consisted of questions about navigation through several statechart diagrams. After this, the subjects performed Test 0. This test was used to put the subjects into balanced groups, depending on their knowledge and performance.

They also filled in an anonymous subjective questionnaire in which they included some personal data (age, gender…) and their experience in modeling, OO programming, use of UML, etc. These data indicated that although most of them had developed OO software, only half of the subjects had previously used UML in real projects, and this only once or twice. The average length of experience in OO development was 2 years.

This first session lasted approximately 2 h. After this, all of the Test 0 questionnaires were analyzed and the subjects were assigned to two groups, depending on their results.

The subjects were ordered according to the number of correct answers and the time spent on the questionnaire, as suggested in (Otero and Dolado 2004). After that, those subjects who occupied an odd rank were assigned to Group A, and the others to Group B. Thus, we obtained two balanced groups, as Table 16 shows.

Table 16 Means and standard deviations for groups after Test 0

The second session took place the following morning. One of the subjects who had been assigned to Group A did not show up for the second session. We decided not to rearrange the groups, which would now have the same number of subjects. The subjects were informed that they had been grouped depending on their performance in the test collected the previous day.

The second session was composed of three tests, as in experiments E2 and R2. First, the subjects received one UML statechart diagram and a copy of Test 1. To avoid possible learning effects, we adopted a balanced between subjects and blocked design, i.e. each subject was assigned only one diagram. The subjects of Group A received a diagram that was modeled using composite states and those in Group B received exactly the same system but this was modeled without using composite states. The 10 questions in Test 1 for both groups were exactly the same.

As in E2 and R2, this test was used to measure the UEffec of the model. The questions in Test 1 covered all the different parts of the diagram so that we could make sure that all parts of the diagram had been covered by the subjects before we removed them. This phase lasted for 25 minutes.

When that phase was completed, all the diagrams and tests were collected and Tests 2 and 3 were handed out. Test 2 was used to measure the UReten variable and consisted of a fill-in-the-blanks text with 10 gaps that the subjects had to complete in order to build the text with the specifications of the system. The subjects had 15 min for this phase.

Test 3 was used to measure the UTrans variable and consisted of a list of 6 tasks to perform, based on the information taken from the diagram. As these tasks were the most complicated part of the experiment, the subjects had 35 min to solve them.

At the end of Test 3, all the materials were collected and the subjects were handed out a debriefing questionnaire, to collect their impressions about the difficulty of the tests and the main positive and negative points that they had found during the experiment.

Figure 3 summarizes and describes all this process in diagram form.

Fig. 3
figure 3

E3 procedure schema

Appendix A contains the diagrams and tests 1, 2 and 3 of this experiment as an example. The diagrams and tests used during the experiment were in Spanish. We have translated them into English here for the reader’s convenience.

7.3 E3 Data Analysis and Interpretation

As in the previous experiments, we carried out an analysis of the descriptive statistics of the data. Table 17 presents the means and the standard deviations for the measures of the dependent variables studied in E3.

Table 17 E3 Means and standard deviations (in parentheses)

We can observe how, in this case, the results obtained for the understandability effectiveness and transfer variables are higher when the subjects worked with the diagram modeled with composite states, while the retention variable was higher in the diagram modeled without using composite states.

After removing the outlier values, we performed an ANOVA to test the two unmodified original hypotheses shown in Table 15 and the new H’0c null-hypothesis as we did with the previous analyses. When two groups are compared, an ANOVA produces the same results as a t-test (Kirk 1995; Winer et al. 1991), which is the most common statistical test used for analyzing two groups and one factor. The results obtained are set out in detail in Table 18.

Table 18 E3 ANOVA results

The results shown in Table 18 indicate that there is a statistically significant effect of the use of composite states on the understandability of UML statechart diagrams, in all three variables used (UEffec, UTrans and UReten).

Using composite states improves the UEffec and UTrans of the diagram, and worsens the UReten. This means that composite states are useful for a better comprehension of the diagram (UEffec) and for performing tasks related to the diagram, but not directly answerable from it (UTrans). At the same time, they are not useful for memorizing the diagrams (UReten).

These results contrast with those obtained in the experiments presented previously. In this case we again intended to assess how composite states affected the understandability of UML statechart diagrams, but in this case we used real practitioners instead of students. The complexity of the tasks to be performed was also increased. Our conviction is that these two factors must have affected the results obtained. As well as all these factors, it should be remembered that we used the skills of the subjects to balance their distribution into groups.

8 Threats to the Validity of the Family of Empirical Studies

In this section, we explain some issues that can threaten the validity of experiments, considering the four types of threats proposed in Wohlin et al. (2000).

8.1 Conclusion validity

In E1, R1, E2 and R2, the statistical power was low. As we have already commented, this fact does not allow us to reject erroneous hypotheses without a large degree of uncertainty.

8.2 Internal validity

The number of subjects involved was not large. However, a clear trend was identifiable in only one case.

Our final consideration is that composite states seem to be a construct that requires a certain maturity level to be used properly. As we commented in section 2, experience is a factor to be taken into account when measuring comprehension (Arisholm and Sjøberg 2004; Bolloju and Leung 2006; Ricca et al. 2007; Yusuf et al. 2007). We believe that students probably have not acquired this maturity yet, while practitioners have. This may have been a determining factor in obtaining the results presented here.

8.3 Construct validity

Our measures were built on the basis of the guidelines provided in CTML, and we believe that in this way we have measured the variables appropriately.

8.4 External validity

The diagrams that were used in this study represent relatively simple models and it is certainly possible that if industrial-strength diagrams had been used, different results might have been obtained.

Our results may be applied to UML statechart diagrams and subjects with similar characteristics to those we have presented. These results may be generalized to the entire population of designers who use UML statechart diagrams only after further studies confirm them.

9 Meta-Analysis Study

In Sections 5, 6, 7 of this work, we have presented 5 experiments to investigate the influence of composite states on the understandability of UML statechart diagrams. Table 19 summarizes the ANOVA results studying the effects of the domain and the use of Composite States (CS) on the three dependent variables.

Table 19 ANOVA summary (in bold, significant results)

Only in experiment E3 did we obtain statistically significant results that showed that use of composite states improves the way that subjects directly understand how the diagram works (UEffec), as well as the performance of tasks related to the diagram, acquiring knowledge from it (UTrans). But the use of composite states is not useful for memorizing the diagrams (UReten).

As we have seen, no conclusive results were obtained from the individual data analysis, so we decided to integrate them. There are several statistical methods that allow us to accumulate and interpret a set of results obtained through different experiments that are inter-related because they check similar hypotheses (Glass et al. 1981; Hedges and Olkin 1985; Rosenthal 1986; Sutton et al. 2001; Wolf 1986). In the present study, we use meta-analysis because it allows us to extract more general conclusions, even though some of the experimental conditions are not exactly the same.

Meta-analysis is a set of statistical techniques for combining the different effect sizes of the experiments to obtain a global effect of a factor. As measures may come from different environments and not be homogeneous, a standardized measure of each one needs to be obtained and then those measures for estimating the global size effect of the factor must be combined. In our study, the factor is the use of composite states and how that affects UML statechart diagrams understandability.

To carry out the meta-analysis presented in this work we used the Meta-Analysis v2 tool (Biostat 2006). In this meta-analysis we used the mean value for CS(with) minus the mean value for CS(without), and from these values we obtained Hedges’ g metric (Hedges and Olkin 1985; Kampenes et al. 2007), which we used as standardized measure. This value expresses the magnitude of the treatment effects, CS, in our case, relative to the within-group standard deviations. It can be used to synthesize studies that have quantified treatment effects in different scales (Rosenthal 1994).

The Hedges’ g metric is a weighted mean whose weights depend on sample size (Eq. 1)

$$\overline Z = \frac{{\sum\limits_i {w_i z_i } }}{{\sum\limits_i {w_i } }}$$
(1)

where wi = 1/(ni-3) and ni is the sample size of the i-th experiment.

The higher the value of Hedges’ g is, the higher the corresponding mean difference is too. For studies in Software Engineering, we can classify effect sizes into three different values: small, medium and large (Kampenes et al. 2007).

Once the overall effect size is calculated, we can provide a confidence interval or a p-value which allows us to decide about the meta-analysis hypotheses, such as we can find in other Empirical Software Engineering works (Dybå et al. 2007; Hayes 1999; Laitenberger et al. 1999; Miller and McDonald 1998; Porter and Johnson 1997).

Our meta-analysis hypotheses can be stated as:

  • H0a: using composite states (CS) does not influence the UEffec. H1a: ¬H0a

  • H0b: using composite states (CS) does not influence the UReten. H1b: ¬H0b

  • H0c: using composite states (CS) does not influence the UTrans. H1c: ¬H0c

Table 20 summarizes the results we obtained with our meta-analysis. For each study and domain within the study, we report the values of Hedges’ g and effect size. Specifically, the cells related to effect size contain two pieces of information.

  • An indication of the magnitude of effect size, classified as Small, Medium, or Large. The magnitude of effect size is computed based on the standardized difference between two means. For instance, an effect size of 0.5 indicates that the mean of the CS (with) is half a standard deviation larger than the mean of CS (without). Considering the UEffec variable, a positive effect means that using composite states improves the understandability effectiveness, whilst a negative effect would mean the opposite. For instance, there is a negative effect size for UEffec in E1 with the ATM domain, as denoted by the negative value of Hedges’ g, while there is a positive effect size in R1 with the ATM domain, as shown by the positive value of Hedges’ g. The same applies to the other two variables. For studies in Software Engineering, we can consider that effect sizes within 1.01 and 3.40 are large; sizes within 0.38 and 1.00 are medium and those within 0 and 0.37 are small (Kampenes et al. 2007).

  • An indication of whether the result is statistically significant (S) or not (NS). Note that the global effect is only significant for UEffec, in our study.

Table 20 Hedges’ g metric values

For the reader’s convenience, we show our meta-analysis results in diagram form, as provided by the Meta-Analysis v2 tool (Biostat 2006). Figures 4, 5, 6, one figure for each dependent variable (UEffec, UTrans, UReten respectively), display Hedges’g metric with a confidence interval of 95%. Not all the studies contribute equally to the overall conclusion, which is represented by the diamond in the last row of the figures. Each of them receives a specific weight in the meta-analysis, i.e., the study’s effect size, represented by the squares in the figures. The estimations for studies with a large sample size are more accurate, so they contribute more to the overall effect. However, sample size is not the only factor contributing to the weight of a study. The weight of a study is proportional to the area of the corresponding square in the figures.

Fig. 4
figure 4

UEffec meta-analysis

Fig. 5
figure 5

UTrans meta-analysis

Fig. 6
figure 6

UReten meta-analysis

After this process, we carried out a new meta-analysis based on two facts. Firstly, the description of E3 and threats to validity make it different from the others, as we used practitioners and more difficult tasks and the design was blocked by using the subjects’ experience. Furthermore, the error observed in the three meta-analyses was the highest.

Table 21 summarizes the results including and excluding E3. We can observe how Hedges’ g estimation is modified.

Table 21 Meta-analysis results with all the experiments and excluding E3

Nevertheless, the conclusions, based on the p-values are similar to those obtained previously. The meta-analysis conclusions are thus the following:

  • Using composite states makes the understandability effectiveness (UEffec) of UML statechart diagrams decrease, with a medium-size effect size (−0.383), i.e., the UEffec mean when not using composite states is larger by 0.383 times the standard deviation than when using CS, and the p-value is 0.000. This effect size is lower when we consider all the experiments; because in E3 the composite states improved the understandability. We suspect that this positive effect in E3 is due to the use of practitioners as experimental subjects.

  • Using composite states has no influence when performing tasks related to the diagram (UTrans p-value = 0.890). When we include E3, the effect size and the p-value are more in favour of the idea that composite sates help improve the transfer, but these are not enough to be significant.

  • Finally, the use of CS has no influence in memorizing the diagrams (UReten p-value = 0.597). In this case, when including E3 the effect size and the p-value are inclined to indicate that composite states have influence, but this is not significant.

10 Conclusions and Future Work

In this work, we have presented a family of empirical studies to study whether the use of composite states affects the understandability of UML statechart diagrams. We pursued this goal, wishing to obtain empirical evidence on whether the use of composites states is beneficial. This evidence can be used as advice to software engineers or modelers when they are modeling or maintaining object oriented systems using UML.

As suggested in several empirical works related to model comprehension (Bodart et al. 2001; Gemino and Wand 2005) we measured understandability through three measures: Understandability Effectiveness, Retention and Transfer. Each of these measures captures different aspects of the understanding of models when modelers or software engineers deal with models.

In our empirical work, we have followed the steps suggested in the Empirical Software Engineering field (Tichy 2000), beginning our study with students, to test the original designs, and gradually improving the materials used and the experience of the subjects until performing the last study with practitioners.

The results obtained are valid in the context of relatively simple statechart diagrams (10 to 25 states) and undergraduate students and novice practitioners.

After testing our hypotheses in each individual study, through ANOVA, we could not reach conclusive findings, given that in some cases the results are conflicting across our experiments. We therefore decided to integrate the empirical data through a meta-analysis study.

The main findings obtained through the common family of empirical studies are:

  • Our first idea and the one most commonly accepted in the Software Engineering field, was that using composite states helps make UML statecharts more comprehensible. But the meta-analysis results show that using composite states has a negative influence on the understandability effectiveness (UEffec) of the diagrams, i.e. the way that subjects directly understand how the diagram works. This finding goes against conventional wisdom. Nevertheless, the particular results of E3 are in favour of this assertion. We suspect that the reason could be that experienced subjects are able to take advantage of the benefits of using composite states. When there is a lack of experience, it might be more difficult to understand and handle the use of composite states. It also seems that the more complicated the tasks to perform are, the more useful the use of composite states is.

  • The overall results do not show a clear effect, either in using or in not using composite states related to the concepts of Transfer, i.e., the ability to use knowledge gained from the material to solve related problems not directly answerable from it (UTrans), and Retention, i.e., the ability to memorize the material being presented (UReten). But the particular results obtained in E3 show that composite states improve the transfer of the diagrams. As we have just commented, we suspect that the reason could be that when subjects are more experienced, they can take advantage of using composite states, otherwise the effect is the opposite. When looking at retention (UReten), the results in experiment E3 present a negative effect and it seems that is better not to use composite states for memorizing diagrams.

Even though the meta-analysis seems to improve the findings for the individual studies, after a rigorous and long period of experimentation we can not provide conclusive findings on whether composites states are beneficial to the understanding of UML statechart diagrams or not, in the context mentioned. However, our results show that the use of composite states may not be always beneficial, as might have been believed after only casual consideration.

At any rate, further investigation is needed in the following directions:

  • study of the hypothesis that we have expressed about the effect that composite states may have on subjects without skills in their use;

  • extension of the number of practitioners in future studies, which would strengthen the validity of conclusions, as the sample size of E3 (24 practitioners) is small compared to other studies in the family;

  • use of more complex diagrams and tasks, from real projects, as we suspect that the use of composite states could be more beneficial when understanding more complex UML statechart diagrams used in real-time systems.