1 Introduction

In the process of software systems development software architecture represents a key artefact that affects all later activities such as design and implementation and plays a crucial role in achieving the desired software qualities (Losavio et al. 2003). Software architecture focuses on a high level view of a software system and it is defined as : “the structure or structures of the system, which comprise software components, the externally visible properties of those components, and the relationships among them” (Bass et al. 1998).

According to the software architecture community, an architectural description can comprise multiple views concentrating on one of many system concerns, such as logical, implementation, deployment, process, or architectural knowledge view, and from the viewpoint of different stakeholders, such as end-users, developers, project managers, and business analysts (Kruchten 1995; Clements et al. 2003). Architectural component and connector models (or shortly component models), that are part of the implementation view, are frequently used as a central view of the architectural descriptions of software systems (Clements et al. 2003). Component models represent high-level abstractions of the system implementation and are often considered to contain the most significant architectural information (Clements et al. 2003). In this view, components could refer to different system entities such as processes, objects, clients, servers, data stores, modules, subsystems, etc., while connectors represent the interaction mechanisms between components (Clements et al. 2003). In this article, we consider a component more in the sense of software modules by adopting the definition of Clemens et al., i.e. a component represents an implementation unit of software that provides coherent unit of functionality at the first level of decomposition in the system (Clements et al. 2003). This definition is adopted because our work focuses on the understandability of component models which mainly relates to understanding a functional decomposition of the system and the effect of modifying the system functionalities, i.e. the impact analysis. Please note that component decomposition can be made independent of the functionality type implemented in a component. For example, a decomposition can consider both technical functionalities (e.g. components for file-access or network connection) and business functionalities (e.g. components for savings or accounts). Since a component in a component model represents a high-level abstraction of the entities in the source code of the system, it can be broken down into (i.e., is refined by) more fine-grained, technical components or classes that realize the component in the technical design or implementation of the system. In the context of object-oriented software systems that we focus on, a component usually groups a set of source code classes and/or packages with similar functionalities, while a connector could represent any kind of dependency between classes like method calls, fields access, etc.

Understandability is one of the most important characteristics of software quality (Pacione et al. 2004). The difficulty of understanding the software system limits its reuse and maintenance. Boehm defined software understandability as a feature of software quality which means ease of understanding software systems (Boehm 1978). In the context of component models, understandability refers to understanding the functionalities of individual components together with the functional relatedness among them (Dugerdil and Niculescu 2014). Understandability is a critical aspect for the component models, as their main purpose is to “ ... enable designers to abstract away fine-grained details that obscure understanding and focus on the “big picture:” system structure, the interactions between components, ...” Oreizy et al. (1999). This, however, is not possible if the given models themselves and/or the links to other design and code artefacts are hard to understand.

In our previous work (Stevanetic and Zdun 2016), we examined the relationships between the effort required to understand a component, measured through the time that participants spent on studying a component, and the hierarchical quality metrics originally designed to assess the understandability of the modular design of an object-oriented software system (Hwa et al. 2009). Those metrics refer to 6 design properties found to have an impact on the understandability of the modular design of a system: size, complexity, encapsulation (i.e. information-hiding), coupling, cohesion and modular abstraction. In the same study, we have further examined the impact of personal factors (i.e. the participants’ experience and expertise), and compared the efficiency of both personal and system related factors (metrics) with the prediction models obtained in our previous studies (Stevanetic and Zdun 2014a, 2014b). In another study reported in a position paper (Stevanetic et al. 2014), we presented a tool for supporting software evolution by integrating a DSL-based architecture evolution approach with our empirically evaluated understandability metrics. In this article, we provide: 1) an extended description of the results obtained in our previous work (Stevanetic and Zdun 2016) consisting of more detailed description of the studied metrics and applied statistical techniques as well as more detailed explanations and discussions of the obtained results, 2) a new metric for measuring the analyzability of component models based on the integration of our empirical evaluations and the existing work on the analyzability related metrics proposed by Bouwers et al. (2011), and 3) significant tool extensions compared to our previous work reported in a position paper (Stevanetic et al. 2014) including the realization of the new analyzability metric by supporting how much each of the architectural rules used to specify a DSL-based architectural abstraction specification contributes to the understandability of components and enabling change impact analysis, i.e. the identification of changes in the system that affect different analyzability levels of the component models.

The results of our empirical analyses show that the hierarchical understandability metrics can predict the understandability with high practical significance. On the one hand, the obtained prediction models are significantly better then the models obtained using the graph based metrics (examined in Stevanetic and Zdun 2014a), the package based metrics (examined in Stevanetic and Zdun 2014b) or the models that use the participants’ experiences as predictors. On the other hand, those models are not significantly different or worse in prediction from the models that combine both the system related metrics (the graph based, package based, and hierarchical understandability metrics) and the participants’ experiences. This means that, from all studied predictors, the system related metrics (i.e. the hierarchical understandability metrics) are enough to consider for the prediction. We also find that the participants’ experiences are important and can predict a significant amount of variance in the data but the obtained models are not as accurate as the models that use the metrics related to the software system itself (concretely the hierarchical understandability metrics). Regarding the tool support, we demonstrate in a case study how it can be used to create component models with appropriate analyzability level by incrementally improving an initial component model of the system. In addition, we show how the tool can be used for change impact analysis, i.e for detecting the changes that exist between different component models that affect their different analyzability levels.

This article is organized as follows: In Section 2, we discuss the related work. In Section 3 we describe the study design. Section 4 describes the statistical methods we applied and the analysis of our data. In Section 5 we discuss the threats to validity. Section 6 describes the tool we developed together with a case study on how the tool can be utilized in a practical context. In Section 7 we conclude and discuss future directions of our research.

2 Related work

So far very few studies investigate the empirical evidence on the architectural understandability. One of them examines the influence of package coupling on the understandability of the software systems (Gupta and Chhabra 2009), while another one examines the relationships between some package-level metrics and package understandability (Elish 2010). None of the studies examines the understandability of architectural components. In this section we discuss the existing works in several fields closely related to our work.

2.1 Measuring the understandability

In the work by Patig (2008) the variables and tasks that have been proposed by cognitive psychology or applied in computer science to test understandability are extracted. Those variables and tasks are summarized in Fig. 1 and they represent a theoretical framework for investigations on understandability. The variables have been theoretically justified by the authors who used them. In our case, the independent variables represent the metrics that we collected (in the work by Patig they are related to abstract/concrete syntax and therefore this part of the figure is adapted from the original one). The dependent variable in our case is the understandability of components. As we see from Fig. 1, different measures can be used to quantify the dependent variable(s) such as frequency (the number of correct answers), selection (which of several answers participants choose), response latency (how quickly participants reacts), response duration (how long participants deal with a task), and amplitude (measuring the strength of response, i.e. brain activities in performing a task). In our case we measured the correctness of the answers and the time that participants spent on resolving the questions. Regarding the comprehension tasks the participants of an experiment need to answer an appropriate set of questions. If the questions are related to the syntax of the model (constructs of the model) the task is called syntactic. If the questions are related to the understanding of the context described the task is called semantic. Both of these two types of tasks are related to surface level understanding. In problem-solving tasks that address deeper understanding participants have to resolve whether and how certain information can be extracted from a model. In our case the problem-solving tasks are more suitable because the participants have to understand not just the component models themselves, i.e. how the components interact in the model, but also the relations between them and the concrete system implementation. Modelling tasks are used more for measuring the general ease of use of some notation and therefore they are not suitable for our case.

Fig. 1
figure 1

Theoretical framework for investigations on understandability (adapted from Patig 2008)

In the work by Patig all proposed dependent variables are externally measured in terms of using some external means like the time that participants spent on answering the questions or the percentage of the correct answers on those questions. Beside the external means it is also possible to use the participants subjective ratings in the measurement process. In the context of model understandability Moody proposes three ways how to assess understandability: the model user’s rating of model understandability, the ability of users to interpret the model correctly, and the model developer’s rating for model understandability (Moody 1998). The first and the third way are based on the subjective ratings of users/developers. However, Lindland et al. explain that the ability of model users to interpret the model correctly is the best operational test whether the model is actually understood rather than whether it is understandable (Lindland et al. 1994; Moody 1998).

2.2 Architecture and design metrics and their empirical evaluations

There exist plenty of software metrics for measuring the system’s architecture, architectural components, and other high level software artefacts and structures (packages, modules, graph-based structures). For example, metrics related to components and component models measure different attributes like size, coupling, cohesion, and dependencies of components as well as the complexity of the whole component models (Sharma et al. 2009; Sartipi 2001; Sengupta et al. 2011). Regarding the software packages, different metrics that measure size, coupling, stability, and cohesion are proposed (Elish 2010; Gupta and Chhabra 2009; Martin 2003; Gupta and Chhabra 2012). Graph-based metrics measure the complexity of interactions between the graph nodes (Bhattacharya et al. 2012; Ma et al. 2006; Allen et al. 2007). Certain graph-based metrics are evaluated to be useful for measuring large scale software systems that are observed to share some properties that are common for complex networks across many fields of science (Ma et al. 2006). Most of the above given metrics lack of the links to the quality attributes. Stevanetic and Zdun (2015) present a systematic mapping study on software metrics related to the understandability concepts of software architectures with regard to their relations to the system implementation. In this article and the previous ones that empirically investigate the understandability of components, the examined metrics are chosen from the given mapping study and tested in the given context.

There exist several studies that empirically evaluate metrics. In contrast to our work, they usually evaluate the usefulness of a metric for its proposed purpose, but do not test the relationships of specific metrics, as in our case the prediction of the understandability using predictor metrics. Also, none of the studies focuses on architectural component models. Among many others, Basili et al. evaluate object-oriented design metrics as quality indicators (Basili et al. 1996). Albrecht and Gaffney provide one of many examples for a study on development effort metrics (Albrecht and Gaffney 1983). Similarly to our work Moody presents an empirical evaluation of the use of data model quality metrics (Moody 2003). In this approach a broad set of quality metrics is investigated. The result obtained is that only a few of these quality metrics have an influence on the quality as perceived by the model users. These are the system complexity, the number of data items duplicated in existing systems, the development cost estimation, the reuse percentage, and the number of defects by quality factor.

2.3 Understandability of UML models and process models

There exist a variety of studies in the literature that examine the understandability of different UML models. Some of them examine the layout or visualization aspects of UML models. Purchase et al. (2001) show that certain visualizations are better than the other depending on the kind of comprehension tasks that is used. Criteria and guidelines of how to create effective layout for UML class and sequence diagrams are established in the work by Sun and Wong (2005). They are based on perceptual theories.

Some other studies related to UML model understandability compare the effect of using different UML diagram types (e.g., sequence and collaboration diagrams). For example, Otero and Dolado take different UML diagrams types, sequence, collaboration, and state diagrams, and evaluate the semantic comprehension of the diagrams when used for different application domains (Otero and Dolado 2004).

Some authors investigate the styles and rigor in UML models and how they affect the understandability of the models. For example, Briand et al. (2005) investigate the impact of using OCL (object constraint language) in UML models on defect detection, understandability, and impact analysis of changes. They find that the benefits for the individual activities are modest but the overall benefits of using OCL on the aforementioned activities are significant. None of the aforementioned studies examine the understandability of architectural components, the central high level organizational units of the architectural descriptions of software systems.

The work in the field of process model related metrics emphasize the importance of model characteristics for assessing model understandability. Such metrics measure structural properties of a process model, motivated by prior work in software engineering related to lines of code, cyclomatic number, or object-oriented metrics (McCabe 1976; Chidamber and Kemerer 1994; Fenton and Pfleeger 1998). Soo and Jung-Mo (1992), Nissen (1998), and Morasca (1999) focus on defining metrics. Different metrics have been also validated empirically. Cardoso adapts the cyclomatic number metric for business processes (called it control-flow complexity (CFC)) and proves the correlation of the metric with perceived complexity of process models (Cardoso 2006). Canfora, Rolon, and Garcia analyse understandability as an aspect of maintainability using different metrics of size, complexity, and coupling in their experiments. They identify several significant correlations (Canfora et al. 2005; Aguilar et al. 2007). Some other metrics are related to cognitive research, e.g. Vanderfeesten et al. (2008), and based on concepts of modularity, e.g. Vanhatalo et al. (2007) and van der Aalst and Bisgaard Lassen (2008).

Different empirical validations in the field of process models clearly show that size is an important model factor for understandability, but does not fully determine phenomenons of understanding. It means that additional metrics like structuredness can help to improve the explanatory power significantly (Mendling 2008). In our case, we examine the effect of different metrics, that measure more/less the same concepts as those mentioned for process model understandability (size, coupling, complexity), on understandability of components’ functionalities implemented by the corresponding set of source code classes. We also show that the size is not enough to fully determine the understandability and additional properties need to be taken into account. Similar to our work, Reijers and Mendling (2011) investigate the impact of personal and model related factors on understandability of process models. They show that expert modelers perform significantly better and that the complexity of the model affects understanding. A combined regression model is calculated that permits preliminary conclusions on the relative importance of both groups of factors. They find that personal factors (theoretical knowledge, practical experience, educational background) have a stronger explanatory power in terms of adjusted R2 than model related factors but they kept the size of the models constant by intentionally selecting models of equivalent size. We also find that the participants’ experiences are important as well as the system related metrics but in contrast to the work by Reijers and Mendling, we find that the system related metrics have a significantly stronger explanatory power and even alone can be used for the prediction, i.e. combining them with the experiences does not produce a stronger explanatory power. Furthermore, we take into account the size. Also, all our participants are students and we do not consider experts from industry as it is the case in the previous study.

2.4 Software quality models

To assess design quality different object-oriented software quality models have been proposed and validated in the literature (Chidamber and Kemerer 1994; Bansiya and Davis 2002; Genero Bocco et al. 2005; Harrison et al. 1998; Basili et al. 1996). In those models, software quality is assessed using several software metrics that are used to quantitatively assess design properties such as coupling and cohesion. But those models are insufficient to manage understandability in the high level system representations such as module-view, package-view, or component-view because they capture a software system as the set of classes and their relationships, but not the set of modules, packages or components and their relationships.

Contrary to the given quality models, Bansiya and Davis (2002) proposes a hierarchical quality model for object-oriented design quality assessment (QMOOD) which is able to assess understandability of a system. Their model extends Dromey’s quality framework used for building product based quality models (Dromey 1995; Dromey and McGettrick 1992). However, QMOOD can only consider the dependencies between classes in a module without considering the dependencies between classes of different modules as well as a module hierarchy and therefore cannot assess the quality of modular design properly. Sarkar et al. (2008) examine different metrics that can be used to assess modularization quality of a large-scale object-oriented software system. But the authors do not provide relationships between their metrics and the high-level quality attributes. Therefore more investigations are necessary to establish the links between those metrics and high-level quality attributes. Hwa et al. (2009) propose a hierarchical model to assess understandability of modularization in large-scale object-oriented software. They define several design properties, which capture the characteristics influencing on understandability, and design metrics based on the properties, which are used to quantitatively assess understandability. In this article, we use the concepts and metrics defined in the work by Hwa et al. to improve the explanatory power of our previously obtained models on understandability of architectural components.

2.5 Other aspects related to architectural component models

Even though there is a lack of empirical studies on architectural component models understandability, other aspects like fault density and reuse of components have been studied before. In the work by Fenton and Ohlsson the relations between fault density and component size are examined (Fenton and Ohlsson 2000). Mohagheghi et al. use the historical data on defects, modification rate, and software size to investigate the comparison between software reuse and defect density and stability (Mohagheghi et al. 2004). Malaiya and Denton study the factors that can be used to determine the “optimal” component size with regard to fault density (Malaiya and Denton 2000). They identified component partitioning and implementation as influencing factors. Graves et al. examine the software change history of components in order to create a fault prediction model (Graves et al. 2000). Metrics such as change times, time elapsed since the last changes, and number of changes are used in the model, while size and complexity metrics are not deemed useful. These and similar studies have in common with our one that a link between software quality or desired properties, such as fault density or reuse rate, and component properties, such as size, complexity, or change rate, are made. These studies are different from our one as they examine aspects that can be studied without considering the human participants: They only analyse aspects that can solely be studied using the software systems and their historical data.

A number of authors propose ways to improve the understandability of architectural models through additional models or documentation artefacts. A major research direction deals with documenting architectural decisions and architectural knowledge in addition to component models (Babar and Lago 2009; Jansen and Bosch 2005; Zimmermann et al. 2007). Another major research direction deals with architectural views (Clements et al. 2002; Hofmeister et al. 2000; Kruchten 1995) which enable different stakeholders to view the architectures from different perspectives. Both research directions only complement component models with additional knowledge, but neither of them studied the understandability issues of component models with regard to their relations to the system implementation.

2.6 Architecture abstraction and evolution

There exist several approaches that support the abstraction of the architecture from other system artefacts as well as the architecture evolution. Here, we discuss some of those approaches that are closely related to the approach used in our tool.

Konersmann et al. (2013) describe the ADVERT approach that provides support for software evolution on an architectural level. Their approach is based on two ideas: (1) Maintaining trace links between requirements, design decisions, and architecture elements, and (2) explicitly integrating software architecture information into the code. Contrary to our approach the ADVERT approach assumes that the architecture already exists (is built from the design solutions) and it does not provide architecture level quality checks. Another approach that focuses on architecture evolution is proposed by Barnes et al. (2014). They support the modelling of different evolution paths and allow reasoning about architecture evolution based on these different paths. Cuesta et al. (2013) extends the approach by Barnes et al. by proposing the documentation of architecture evolution using architectural knowledge. These approaches are more focussed on reasoning about architecture evolution while our approach aims at supporting architecture evolution in order to evolve source code and architecture documentation in a synchronized fashion, allowing at the same time architecture quality evaluation.

There exist several approaches that focus on the automatic creation of source code abstractions using automatic clustering. The comparison and review of those approaches and the corresponding clustering measures can be found in the work by Maqbool and Babri (2007). They define a number of clustering algorithms groups and compare their performance using different open source projects. The results show which approach works good for which application but no conclusions regarding the overall effort necessary to correct the automatic clustering are drawn. Contrary to all these approaches our DSL-based approach is semi-automatic, enables the checking of design constraints during the abstraction process, provides traceability between source code and models and focuses on the evolution of the architecture (having an “up-to-date” architecture that reflects the source code) rather then the recovery of architecture. Also, our approach provides quality checking of the generated architectural abstractions based on the corresponding empirical evaluations.

Egyed (2004) proposes an approach for model abstraction based on traceability information and abstraction rules. The author identified 120 abstraction rules for the example of UML class models, which need to be extended with a probability value because the rules may not always be valid. Our approach is based on architectural abstraction specifications that enable creating architectural models on different levels of abstraction, starting from the system implementation.

3 Empirical study description

For the planning of our study, data collection, and analysis and interpretation of the results, we have followed the experimental process guidelines proposed by Kitchenham et al. (2002). In particular, for the planning phase, the next guidelines are followed: experimental context setting guidelines (examining the related work, defining hypotheses, and considering the circumstances in which an empirical study takes place) and study design guidelines (defining the population of the study, administering the treatments, considering the methods for reducing bias). For data collection, and the analysis and interpretation of the results, the next guidelines are followed: data collection guidelines (defining measures used in the study, ensuring their accurate calculation, considering which data should be excluded), analysis guidelines (choosing the appropriate statistical techniques, performing the data sensitivity analysis), interpretation guidelines (defining the population and the circumstances for which the results apply, specifying study limitations and threats to validity).

3.1 Goals

As mentioned above, this article aims at further elaborating on the concepts and metrics related to the empirical evaluations of the understandability of components that we studied in our previous work. Namely, we examine the usefulness of the hierarchical understandability metrics proposed in the work by Hwa et al. (2009) as well as the participants’ experience and try to improve the prediction efficiency of our previous prediction models.

In the following couple of paragraphs we provide the notation and the definitions of the metrics we used in our previous work as well as the metrics from the discussed hierarchical model.

The metrics that we studied in our previous studies include: metrics adapted from the corresponding package level metrics defined by Martin (2003) (studied in Stevanetic and Zdun 2014b) and metrics on graphs that have been previously defined by Allen (2002) and Allen et al. (2007) (studied in Stevanetic and Zdun 2014a).

The metrics adapted from the package-level metrics defined by Martin are shown in Table 1. The first three metrics are adapted from the corresponding package level metrics (number of classes for a package, package afferent coupling and package efferent coupling) defined by Martin (2003). We consider the dependencies between the components in terms of the dependencies between the classes while in the work by Martin the dependencies between packages are considered through the number of packages that are related to the given packageFootnote 1. The first three metrics characterize the coupling and the size of a component and the fourth metric is introduced to model the internal complexity of the component in terms of the number of dependencies between classes within a component.

Table 1 Metrics adapted from the package level metrics defined by Martin (2003)

Regarding the metrics defined by Allen (2002) and Allen et al. (2007), a graph composed of nodes and edges is considered as an abstraction of a software system and a sub-graph represents a software module. With respect to our case, nodes correspond to the source code classes while edges correspond to the relationships between those classes. Components (that group source code classes) in our case correspond to the modules in the work by Allen (2002).

In this paragraph we provide the metrics’ definitions together with some explanations. The definitions of the graph based metrics are shown in Table 2. The notation used for the metrics definitions is the following (adapted from the work by Allen 2002): S – the whole system graph (all nodes and edges), S# – edges–only graph (edges in S and end points), Si – node sub–graph (nodes in S# and edges incident to node i (i = 0 for the environment node, i = 1,...,n for system nodes)), MSS partitioned into modules, mk – module k (nodes in a module and their incident edges), MS – nodes in MS and intermodule edges, MS0 – nodes in MS and intramodule edges, Pr(i,j) – path between nodes i and j (nodes and edges on the path between the nodes i and j , including i and j), pL(i) – the proportion of the i-th row pattern in the nodes × edges table, nk – the number of nodes in a module, \(n_{e\_k} \) – the number of edges incident to nodes in a module, and \(m_{k}^{(n_{k})} \) – module as a complete graph consisting of nodes in a module and all possible edges between those nodes. The definitions of the length metrics are based on the notion of size, applied to paths (each path is considered to be a module in that case) (Allen 2002). The definitions of the coupling and the cohesion metrics are based on the definition of complexity whereby different graph abstractions are considered. Namely for the complexity metrics a whole system graph is considered while for the coupling and cohesion metrics an intermodule edges graph and an intramodule edges graph are considered, respectively. For instance, the counting coupling metric for a module is equal to the number of edges incident to nodes in a module but only intermodule edges are taken into account unlike the counting complexity metric where edges in a whole system graph are taken into account.

The metrics from the hierarchical understandability assessment model consider six design properties which affect understandability of the modular design of a system. Hwa et al. (2009) systematically examined which properties can affect the understandability and the six of them they found are: design size, complexity, encapsulation (i.e. information-hiding), coupling, cohesion and modular abstraction. Complexity, encapsulation, coupling and cohesion come from general properties which should be managed for software quality (Ghezzi et al. 2002; Booch 1994; Bansiya and Davis 2002) and modular abstraction is a new design concept introduced by the module/package hierarchy (Lungu et al. 2006). Table 3 represents the metrics definitions together with the corresponding notation. Please note that modules in the work by Hwa et al. correspond to components in our case. Please also note that the DMH metric (Depth in Module Hierarchy) might not be always directly applicable for components since e.g. one (big) component might contain classes located in several modules/packages with similar functionalities. In that case, similarly to Hwa et al., we can find an average depth in a hierarchy for all classes in a component with respect to the location of the class in a module/package hierarchy.

3.2 Variables

The variables used in our study can be divided into two sets. The first set is related to the variables that are collected from the participants and the second set is related to the variables that are collected from the studied system. All the variables can also be divided into dependent and independent variables. The first set of variables includes 7 variables, from which 5 are independent variables related to the participants’ demographic information: programming experience, Java programming experience, commercial programming experience, experience in programming computer games, and Android programming experience, and the remaining two are the time required to study a component and the percentage of the correct answers on the given questions. The time variable is used to measure the effort required to understand a component and it represents a dependent variable. The percentage of the correct answers variable is introduced to help in estimating the time variable, in the case that the participants do not spend enough time to fully examine the given components in order to achieve a high percentage of correctness (see below for more explanations).

The second set of variables are related to the metrics that we aim to explore (see Tables 12 and 3) and they are calculated from the studied system. All the metrics are treated as independent variables.

Table 2 Graph based metrics definitions (adapted from Allen 2002 and Allen et al. 2007) (please note that Size(Si) is in principle calculated in the same way as Size(mk|S), just a different graph (in this case the one defined as Si) is observed)
Table 3 Notation for the hierarchical understandability metrics and their definitions (adapted from Hwa et al. 2009)

The dependent variables and their scale types, units, and ranges are shown in Table 4 while the independent variables together with their scale types, units, and ranges are shown in Table 5.

Table 4 Dependent variables and their scale types, units and ranges (reused from (Stevanetic and Zdun 2014a)
Table 5 Independent variables and their scale types, units and ranges

3.3 Hypotheses

We expect that the given hierarchical understandability metrics can be used as good predictors of the understandability. In addition, we expect that the participants’ experience is also significant in predicting the understandability effort. In other words, we expect that the prediction models that use the participants’ experience can provide better prediction than using the median as an estimate. In case of the experience variables, we do not expect that they can capture the variability of the measured understandability as good as the metrics related to the system itself. For example, if we have two components to be studied, one with 3 and the other one with 15 classes, it is hard to believe that participants with the same experience would need the same effort to understand them. A bigger component would require much more effort than a smaller one that is caused by the variation in their sizes. Therefore we do not expect that the corresponding prediction models for the experience variables are highly accurate. At the end, by combining the system related metrics (the graph-based, package-level, and hierarchical understandability metrics) with the participants’ experience, we expect that more efficient prediction models can be obtained compared to those that consider separately the graph based metrics, the package-level metrics, the hierarchical understandability metrics and the participants’ experience.

Based on previous considerations we formulate the following set of hypotheses:

Hypothesis(H1)::

The hierarchical quality model metrics can be successfully utilized to predict the effort required to understand a component with high practical significance.

Hypothesis(H2)::

Prediction models created using just the participants’ experiences as predictors have at least one predictor with a non-zero coefficient, i.e. they can predict the understandability effort significantly well.

Hypothesis(H3)::

Combining both the system related metrics and the participants’ experiences leads to a significantly increased efficiency of the obtained prediction models compared to the prediction models that use just the graph based metrics.

Hypothesis(H4)::

Combining both the system related metrics and the participants’ experiences leads to a significantly increased efficiency of the obtained prediction models compared to the prediction models that use just the package-level metrics.

Hypothesis(H5)::

Combining both the system related metrics and the participants’ experiences leads to a significantly increased efficiency of the obtained prediction models compared to the prediction models that use just the participants’ experiences.

Hypothesis(H6)::

Combining both the system related metrics and the participants’ experiences leads to a significantly increased efficiency of the obtained prediction models compared to the prediction models that use just the hierarchical understandability metrics.

3.4 Study design

3.4.1 Subjects

The participants of the study are 49 master students. The study took place within the Advanced Software Engineering (ASE) lecture at the University of Vienna in the Winter Semester 2013.

3.4.2 Objects

The object of our study was the Soomla Android store Footnote 2 system, version 2.0. It is an open source cross platform framework that supports virtual economy in mobile games, and encourages better game design and faster development. We choose the given system because of the following factors:

  • The system is open source which enables us to carry out the study and communicate its results.

  • The system is written in Java which the participants are familiar enough with.

  • The application domain of the system is probably known to the participants from similar game applications.

  • The system has industrial relevance since it is used in many real-world games.

  • The source code of the system contains of 54 classes within 8 packages. The system has in total 3623 LOC (excluding blank and commented lines) and therefore it is probably understandable within a study session, but also not too simple.

3.4.3 Instrumentation

Architectural documentation about the Soomla Android store system

A UML component diagram representing the architecture of the system, its conceptual description and the traceability links that relate the architecture to the system implementation (class design) are handed in to the participants.

The architecture of the system is shown in Fig. 2. There are in total seven architectural components: Security (C1), CryptDecrypt (C2), PriceModel (C3), GooglePlayBilling (C4), StoreController (C5), DatabaseServices (C6), and StoreAssets (C7). In addition there exist two more external components: GooglePlayServer, the REST Web Services running at Google, and SQLLiteDatabase, the database accessed using JDBC. The architectural representation of the system is constructed by two experienced software architects. They fully studied the given system and its documentation and extracted its architecture together with the traceability links to the system implementation. Table 6 shows a short description of the roles that the components play in the system.

Fig. 2
figure 2

Architectural description of the Soomla Android store system in the form of a UML component diagram (reused from Stevanetic and Zdun 2014b)

Table 6 Soomla Android store architectural components and their roles in the system (reused from Stevanetic and Zdun 2014b)

Source code access

The access to the source code of the system was browser-based, on prepared computers. Namely we enabled the participants to easily navigate through the components and open the source code of their realized classes by grouping the classes into the corresponding components.

A questionnaire to be filled-in by the participants

The first part of the questionnaire is related to the rated participants’ experiences including. The second part contains the understandability questions related to the 7 architectural components. Four true/false questions were provided to be studied for each component, and the participants had to check the right answers among them. In order to correctly answer the questions, the participants had to fully understand the functionalities of each component by examining the relationships (as well as the roles of those relationships) among the classes inside a component and the relationships among the classes inside a component and the classes outside of that component. In the case of bigger components, answering the questions requires to analyse more classes and their relationships than in the case of smaller components. Table 7 shows an example of two questions, one for Component GooglePlayBilling (Q1) and the other one for Component Security (Q2). Component GooglePlayBilling (has 11 classes) is bigger than Component Security (has 2 classes) and therefore the corresponding question(s) require to examine more classes and their relationships than the question(s) for Component Security. The order in which the seven components are studied is changed for different participants so that 7 random combinations of components are generated and assigned to the participants (the order of questions within the components remained the same). For example, one participant studied the components in one order, e.g.: C2, C6, C1, C3, C5, C7, and C4 while another one studied them in some other randomly generated order, e.g.: C1, C5, C7, C3, C4, C6, and C2. The randomization enables us to get more/less balanced data for all the components in terms of equalizing the fatigue effects or the lack of time needed to complete all required tasks.

Table 7 An example of two questions (one for Component GooglePlayBilling and one for Component Security)

In order to measure the time that the participants spent on analysing each of the components, we provided a table with the time slots. Each slot contains a start and a stop time. The start time indicates the time when the participants started analysing a component while the stop time indicates the time when they finish it. Several slots were provided for each component in case that the participants want to analyse a component several times. The format used for writing the time is hour : minute. The time limit for the whole study was 90 minutes. None of the participants has been studied the system before so that a potential bias that some participants spent additional time (beside the time written in the time slots) on examining the system is negligible. To ensure that there will be enough time to analyse all the components within the study session of 1.5 hours, we tried the same study with several our colleagues before we tried it within the course. All of them agreed that the given tasks are appropriate for the given time limit. All the above explained instruments are available on the following Web addressFootnote 3. The file containing our results to be assessed by others is available on the same page.

3.5 Execution

3.5.1 Data collection

Figure 3 shows the data related to the participants’ demographic information.

Fig. 3
figure 3

Participants’ demographic information

Based on the information from the figure we can say that the programming experience of the participants is medium to high. Most of them have more than 3 years of programming experience. Many of the participants also have industrial programming experience but only a few of them have game programming experience and experience with Android.

The descriptive statistics (mean, median, and standard deviation) related to the time and the percentage of the correct answers variables are shown in Fig. 4. From our results we excluded the participants who have less than one year of programming experience (9 of them). Some of the participants did not specify both start and stop time for all studied components, so that we also excluded those results from the analysis (just for the components where the start and stop time were not specified). The total number of collected data samples for all components is 7(components) × 49(students) = 343 and the number of excluded data samples is 103.

Fig. 4
figure 4

Descriptive statistics for the time and the percentage of the correct answers variables (reused from Stevanetic and Zdun 2014a)

The data related to the metrics we aim to explore are shown in Tables 89 and 10. The graph based metrics are automatically calculated from the corresponding graph abstractions of the system. The graph abstraction of the whole system is also utilized for the calculation of the package based and hierarchical understandability metrics. The metrics are independently calculated by two architects who studied the system in order to avoid misinterpretation of their calculations. The accuracy of the graph based metrics calculations is additionally tested on the examples provided by Allen (2002).

Table 8 Package based component level metrics (reused from Stevanetic and Zdun 2014b)
Table 9 Graph based component level metrics (reused from Stevanetic and Zdun 2014a)
Table 10 Hierarchical understandability component level metrics (reused from (Stevanetic and Zdun 2016))

Looking at Fig. 4 we can say that the obtained time for the first three components (C1, C2 and C3) is significantly lower than the time for the remaining four components. This observation is expected since the first three components contain smaller number of classes in comparison to the other four. Another observation is related to the component C4. The average time needed to analyse this component is significantly higher than the time needed to analyse the components C5, C6 and C7. Consequentially the percentage of the correct answers for the components C5, C6 and C7 is decreased with respect to the component C4 which has more/less similar values to the smaller components (C1, C2 and C3). Even though it seems expected that the percentage of the correct answers decreases for the components that have many classes simply because of the higher amount of information that need to be handled which increases the probability of missing some relevant information parts, it seems also that the participants spent a bit less time for analysing the components C5, C6 and C7 than it is necessary (or at least for the component C7 which has the same number of classes as the component C4) in order to score better and achieve the higher percentage of the correct answers. With respect to this and the discussion in Section 3.2 the percentage of the correct answers variable is used to help in estimating the time required to fully analyse a component and achieve maximal correctness of 100%.

3.5.2 Validation

To prevent the participants from using forbidden materials and talking to each other at least one observer was present in the lab during the study execution. It also enabled the participants to pose clarification questions. The materials given to the participants are collected before any of them left the lab. There were no cases where the participants behaved unexpectedly.

4 Analysis

The following statistical tests are used for analysing the data.

  • Variance Inflation Factor (VIF) (O’brien 2007) and Condition Number (CN) (Belsley 1991) - Collinearity Analysis

  • Multiple Regression Analysis (MRA) (Rubinfeld 2000)

VIF and CN are commonly used to detect the multicollinearity problems (see below). MRA is commonly used to examine the relationship between one dependent variable and more than one independent variables or predictors. The relationship is assumed to be linear, which makes a model easy to interpret. Furthermore, the “true” relationship is often at least approximately linear over the range of values that are of interest to us. Even if it is not, the variables can be transformed in such a way as to linearise the relationship. The analyses are performed using the programming language R (R Development Core Team 2008).

4.1 Collinearity analysis

Collinearity analysis aims at indicating the variables that are highly correlated with some other variables. Those variables should be excluded from the set of all possible predictors potentially considered for the prediction. To test for possible correlations within the studied metrics sets, we calculate the Condition Number (CN) and the Variance Inflation Factor (VIF). The VIF values greater than 10 suggest high correlation, i.e. multicollinearity problems among the tested variables. The CN values greater than 30 suggest the same (Belsley et al. 1980).

Regarding the information theory and counting graph based metrics we consider them as two separate sets of predictors because we already saw that they are highly correlated in our case. Therefore, all potential predictors considered for the prediction models generation include either the information theory based metrics or the counting based metrics and the percentage of the correct answers (see discussion in Section 3.5.1). The VIF and the CN values for the information theory graph based metrics and the package based metrics are shown in Tables 11 and 12 respectively.

Table 11 Condition number and variance inflation factor – information theory graph based metrics
Table 12 Condition Number and Variance Inflation Factor – package based metrics (reused from Stevanetic and Zdun 2014b)

Regarding the information theory graph based metrics, as we can see from Table 11, the greatest VIF value when all metrics (predictors) are included (column “VIF”) is the value for the Length metric (54.54 > 10). The VIF value for the Size metric is very close to it (52.61). Therefore, in the first step we can exclude either the Length or the Size metric from the set of predictors. The results for the VIF values and the CN value after excluding these metrics are shown in the third and the fourth column of the figure. After excluding the Length metric there are two predictors that can be further excluded, the Size or the Complexity metric (they are both greater than 10 and have similar VIF valuesFootnote 4). After excluding the Size metric, only the Length metric has the VIF value greater than 10. Therefore, we obtained two final sets of possible predictors that are used for creating the prediction models for information theory based metrics. Excluded predictors are either the Size and the Length metrics or the Complexity and the Length metrics. The final sets of predictors have acceptable VIF and CN values (see for example those in Table 11). Using the same procedure for the counting based metrics we obtain three final sets of possible predictors, i.e. the sets exclude either the Size and the Length metrics, the Complexity and the Length metrics, or the Size and the Cohesion metrics.

For the package based metrics, as we can see from Table 12, the VIF coefficients in the case when all predictors are included are less than 10 where the greatest VIF value is 7.96 (for NIntD). Therefore, we can say that there is a slight tendency of multicollinearity between the variables. Hence, we decided to exclude the NIntD from the set of all predictors after which we get acceptable results for both VIF and CN values (see Fig. 12).

Regarding the hierarchical understandability metrics they are no multicollinearity problems in that set of metrics. The highest VIF value has the MSC metric (3.87) and the CN value for this set of predictors is 13.11. The participants’ experiences also do not express multicollinearity problems. The highest VIF value has the programming experience variable (1.62) and the CN for the whole set of variables is 8.29. As mentioned above, we would like to examine the model where all the studied variables are taken into account, i.e. the hierarchical understandability metrics, the participants’ experiences, the package based metrics and the counting or information theory graph based metrics. Combining all those 4 sets together introduces multicollinearity problems since there are metrics in multiple sets that measure the same concepts (size, coupling, and cohesion) even if different metrics for those concepts are used. After examining the VIF and CN values all graph based and package based metrics can be excluded from the set. After excluding these metrics, the highest VIF value has the MSC metric (3.98) and the CN value for the remaining set of variables is 16.76.

4.2 Multiple regression analysis

In this part of the analysis we create multiple regression models that can be used for predicting the time variable. They are also used to test our hypotheses described in Section 3.3. To prevent the over-fitting of the data, i.e. to enable more efficient generalization of the results we perform the Mallows’ Cp calculation for creating the prediction models (Kobayashi and Sakata 1990). If p is the number of predictors including the constant predictor, if it exists, all the models that satisfy the equation Cpp must be considered as reasonable good fits with respect to preventing data over-fitting.

Before we move to the regression analysis, we shortly explain the role of the percentage of the correct answers variable. In Section 3.2, we mentioned that this variable is used as an independent variable to help estimating the time as a dependent variable. Namely, there might exist a dependency between the time and the percentage of the correct answers because if the participants spend less than some minimum time required to analyse a component, the percentage of the correct answers will probably decrease because of an incomplete insight into all relevant component parts. Therefore with the help of the percentage of the correct answers variable we can estimate the time required to fully understand a given component, i.e., to achieve 100% of the correct answers.Footnote 5 If we replace the value for the percentage of the correct answers in the obtained prediction models (see below) with the constant value of 100%, the effort required to fully understand a component is obtained, that further depends only on other factors included in the model. Please note also that predicting the time for 100% of the correctness is not the most realistic requirement because of the lack of the data that are available for that. For example we can also estimate 75% of the correctness which would be more accurate because there exist more data for that. However in our case we use 100% because of the negligible difference in the prediction.

To check the accuracy of the obtained prediction models we calculated a goodness of fit measure using the following equation based on the absolute deviation of the median (Kampenes et al. 2007) (assuming Xi is the prediction and Yi is the actual value):

$$A (accuracy) = \frac{{\sum}_{i}|Y_{i}-X_{i}|}{{\sum}_{i}|Y_{i}-median(Y_{i})|} $$

The smaller the value of A the better prediction. If the value is greater than 1, the estimation is not working, i.e. there is no evidence that the prediction is better than using the median as an estimate. The value (1-A) represents the proportion of the variation in the Y variable explained by the predictions. (1-A) is a robust analogue of R2, so the following guidelines based on those proposed by Kampenes et al. (2007) can be used for the effect size calculation: the (1-A) values in the range of 0 to 0.0372 represent a small effect size, the values in the range of 0.0372 to 0.208 represent a medium effect size while the values in the range of 0.208 to 0.753 represent a large effect size. Furthermore, for good prediction models the residuals have to be normally distributed which is the case with our data. The influential points are the points whose removal will cause a large change in the fit, and they can be detected using Cook’s distance contour lines (Cook 1977). When some points have a distance that is larger than 1, it suggests that the model might be poor or might have outliers. Our models do not have influential points. We further provide the significance of the coefficient of determination (R2) for the obtained models that is measured by the F-statistic (Dalgaard 2004).

In order to test our hypotheses described in Section 3.3, we first generate the prediction models that consider: 1) the package based understandability metrics, 2) the graph based understandability metrics, 3) the hierarchical understandability metrics, 4) the participants experiences, and 5) both system related metrics (the package based, graph based and hierarchical understandability metrics) and the participants experiences, using the above explained analysis. The obtained models are then compared if there is a significant difference in their prediction capabilities. The best 3 models in terms of the given accuracy measure (A) for the above given cases that fit the explained criteria (Cpp) are shown in Tables 13 (the package based metrics), 14 (the counting graph based metrics), 15 (the hierarchical metrics), 16 (the participants’ experiences), 17 (the participants’ experiences together with the system related metrics). For all shown models except the ones that consider the participants’ experiences as predictors, the effect size is in the range of 32% to 40% which represents a large effect size. Those results suggest that the obtained prediction models have high practical significance. Please note that the percentage of the correct answers variable is taken into account for the construction of the prediction models as independent variable based on the discussion provided in Section 3.1. With regard to that, we have to check if this variable alone captures the most of the variance in the measured understandability effort in which case the studied metrics and participants’ experiences variables do not play an important role. It is not the case, since the prediction model that considers only the percentage of the correct answers variable has the accuracy measure (A) greater then 1 and does not provide better prediction then using just the median as an estimate.

Table 13 Models’ parameters – package based metrics
Table 14 Models’ parameters – counting graph based metrics
Table 15 Models’ parameters - hierarchical understandability metrics
Table 16 Models’ parameters - participants’ experiences
Table 17 Models’ parameters - participants’ experiences together with the system related metrics

Another useful technique for overcoming the over-fitting problem is the cross-validation analysis (Field et al. 2012). Beside the Mallows’ Cp analysis, we also applied 10-fold cross-validation technique on our data.Footnote 6 The results of the cross-validation analysis corroborate the results of the Mallows’ Cp analysis and confirm their validity.

Regarding the hypothesis H1 that consider the prediction models for the hierarchical understandability metrics, with respect to the analysis undertaken, we can say that the hypothesis H1 is supported, i.e. the hierarchical quality model metrics can be successfully utilized to predict the effort required to understand a component with high practical significance.

Regarding the hypothesis H2 that consider the participants’ experiences as predictors, we see from Table 16 that the effect size of the obtained models (around 4%) is on the border between small and medium. Compared to the other obtained models that consider the system related metrics, we can say that these models are much less accurate and efficient. These results comply with the discussions provided in Section 3.3 that the participants’ experiences cannot capture the variability as good as the metrics related to the software model itself. Therefore it has been demonstrated that the hypothesis H2 is supported, i.e. the prediction models for the effort required to understand a component created using just the participants’ experiences as predictors have at least one predictor with a non-zero coefficient, i.e. they can predict the understandability effort significantly well. Please note that this does not mean that the obtained models are well-fitted (accurate), it just means that they can predict the significant amount of variance in the model comparing to the remaining unexplained variance (Field et al. 2012). Based on the obtained result, we can say that the participants’ experiences are important and can significantly improve the understandability but they are not able to appropriately capture the variance in the data caused by the variation of system’s structural properties (like size, coupling, cohesion, etc.). The result is as mentioned above expected.

Finally, to test the hypotheses H3, H4, H5, and H6, we compare the efficiency of the obtained prediction models that use both the system related metrics and participants’ experiences on one side and the models that use separately the package based, graph based, and hierarchical understandability metrics, as well as the participants’ experiences. For that purpose, we calculate two parameters, the difference between the AICc (second-order corrected Akaike Information Criterion) values (ΔAICc) for the models to be compared and the corresponding evidence ratios (w). These parameters are commonly used for model comparisons in case of non-nested modelsFootnote 7 (Burnham and Anderson 2002). If the obtained difference (ΔAICc) is lower than 4 we can say that there is no significant difference in the prediction capabilities (power) of the given two models (Burnham and Anderson 2002). If the difference is in the range [4,7], we can say that there is a significant difference in the prediction capabilities, and, if the difference is greater than 10, a very strong difference exists (Burnham and Anderson 2002). The evidence ratio is a value of one model being more likely than the other model (for example a model with AICc= 120 is nearly 150 times more likely than a model with AICc= 130). We compare the best models from each group in terms of the AICc measure. The results of the analysis are shown in Table 18.

Table 18 Model comparisons

Column ΔAICc 1 shows the difference between the AICc values of each model and the model that includes both the hierarchical understandability metrics and the participants’ experience variables (Model 1). The corresponding evidence ratios are shown in column w1. From the obtained ΔAICc 1 values we see that there is a large significant difference (values greater then 10) in prediction capabilities between Model 1 and the last 3 listed models Models 3, 4, and 5. Regarding the difference in prediction between Model 1 and Model 2 (that just includes the hierarchical understandability metrics) no significant difference exists (ΔAICc 1 = − 1.88). Based on the obtained results we can say that the Hypotheses H3, H4, and H5 are supported while the Hypothesis H6 is not supported, i.e. combining both the system related metrics and the participants’ experience variables leads to a significantly increased efficiency of the obtained prediction models compared to: 1) the prediction models that use just the graph based metrics, 2) the prediction models that use just the package based metrics, and 2) the prediction models that use just the participants’ experiences. The model with both the hierarchical understandability metrics and the participants’ experience is not significantly better in prediction compared to the model that includes just the hierarchical understandability metrics. It is even a little bit worse (the AICc value is increased by 1.88)Footnote 8. As a consequence of this last fact we can also conclude that the model that includes just the hierarchical understandability metrics is significantly better than the last 3 listed models, i.e. Models 3 ,4, and 5 (the differences of the AICc values are increased by 1.88 in comparison with the ΔAICc 1). Columns ΔAICc 2 and w2 show the differences of the AICc values and the corresponding evidence ratios between each model and Model 2.

To summarize the obtained results we can say the following. The introduced hierarchical understandability metrics can be used to predict the understandability effort of a component with high practical significance. On the one hand, those prediction models are significantly better in predicting the understandability effort than the models obtained using the graph based metrics, the package based metrics or the participants’ experiences. On the other hand, those models are not significantly different or worse in the prediction from the models that combine both the system related metrics (the graph based, package based and hierarchical understandability metrics) and the participants’ experiences. The participants’ experience can predict a significant amount of variance in the data but the obtained models are not as accurate as the models that use the metrics related to the system itself (concretely the hierarchical understandability metrics).

With respect to the discussions in Section 3.2 we can now calculate the effort required to fully understand a component by replacing the percentage of the correct answers variable in the obtained prediction models with the constant value of 100%. In Fig. 5 the predicted time variable using the model with the highest effect size value (Model 3 from Table 15) and the time variable obtained from the participants are shown. The predicted time variable significantly differs from the time variable obtained from the participants just for the component StoreAssets (C7). It can be interpreted that the participants needed a bit more time for analysing the component StoreAssets (C7) in order to be able to answer all the questions correctly. It really makes sense because the component StoreAssets (C7) has 11 classes as there are in the component GooglePlayBilling (C4) and therefore we expect that they require similar times in order to be fully studied.

Fig. 5
figure 5

The time from the participants and the time from the predicted model where the correctness of the answers is set to 100 %

Before we move to the next section let us examine one more interesting aspect. Namely in the context of process model understandability (Canfora et al. 2005; Aguilar et al. 2007) (see Section 2 for more details) different empirical validations showed that size is not enough to fully determine phenomena of understanding: additional metrics like structuredness help to improve the explanatory power significantly (Mendling 2008). We confirm this in the context of architectural components by comparing the prediction power of the model that considers just the size metric (MSC metric) and the best obtained model in terms of the accuracy measure that considers the hierarchical understandability metrics. The obtained ΔAICc value is 59.707 and the corresponding evidence ratio is w = 9.2e + 12. These results confirm that there is a strong significant difference in the prediction power between the mentioned models.

5 Validity evaluation

In this section we discuss how we tried to minimize the threats to validity. The following threats are taken into account:

Conclusion validity

The conclusion validity indicates to which extent the conclusions are statistically valid. The sample size is one of the possible threats for the statistical validity. In our case 49 students answered the questions for the 7 components.

While the number of participants we used is quite fair the dataset consisting of 7 components is limited to the relatively small-size dataset due to the limited time of the study session. However after performing the power analysis in R (Kabacoff 2011) we found that the statistical power obtained for our sample with the medium effect size of 0.15, which corresponds to the expected R2 around 0.4 (we assumed the effect size suggested by Cohen (1988)) is 0.99. It means that the likelihood of finding a prediction model when there is one with the given effect size is 99 %. Therefore the total sample size is not considered to be a threat for the conclusion validity. Anyway we plan to increase the number of studied components in our future work.

Construct validity

The construct validity describes the degree to which the used variables are accurately measured by the appropriated instruments.

A possible threat to the construct validity might be related to the instruments for measuring the time variable. Namely the participants might have forgotten to write the time in the time slots appropriately, i.e. right before they start analysing a given component and right after they finish it. To minimize that threat we put a reminder before the text related to each component to remind the participants to write the time appropriately.

For the future reproduction studies in a browser-based environment it might be useful to set up a script that monitors the website being viewed and automatically collects the information per student. Another option would be to use IDE tracking tools, e.g. an Eclipse plugin.

The true/false questions might seem to be not good choice for measuring the understandability since the participants could get the right answer 50% of the time. However, the maximal likelihood that any number of participants from the given range (1–49) get the correct answers on 2 or more question (2, 3, or 4) is just around 14%. Therefore the likelihood of obtaining a substantially higher score by guessing alone is very small (Ebel and Frisbie 1991).

The component level metrics are calculated automatically with the help of the tool ObjectAid UML ExplorerFootnote 9. The dependencies between the source code classes are visualized in the tool and based on those visualizations the corresponding graph abstractions used for the graph based metrics calculations are manually generated. The hierarchical understandability metrics are directly calculated from the provided visualizations. The accuracy of the graph based metrics calculations is additionally tested on the examples provided in the work by Allen (2002). In any case, all metrics are independently calculated by two architects who also created the architecture of the studied system. Therefore, the threat that the metrics calculations are not valid is highly reduced.

Internal validity

The internal validity relates to the degree to which conclusions can be drawn about cause-effect of independent variables on the dependent variables. The following threats are considered:

  • Participants competences and experiences. The participants’ competence might influence the study results. In our case all participants have knowledge about software development and software architecture, as well as of software traceability. Most of them have at least medium experience in programming. Regarding the participants’ experiences we considered the experience years (see Table 5). Some other potential variables related to the participants’ demographic information may affect the obtained results to a certain extent. For example, in addition to the considered variables we examined possible differences in our results in case we add the final participants’ grades in the course and if the participants successfully passed two other courses that might be relevant for the studied problem Software Engineering, and Information Systems and Technologies. After considering these variables the accuracy measure for the best prediction model that considers the experience variables only slightly changed. This result does not affect any of our hypotheses and considerations and therefore we decided not to report about it in detail. In our future work we plan to include experts who have many years of professional experience and to test whether some different prediction models can be obtained.

  • Fatigue effects. Total time limit for the whole study was 1.5 hours so fatigue was not very relevant. Also, the randomization of the tasks helped to cancel out these effects.

  • Question design. The fact that we used more complex questions in case of larger components might cause additional difficulties to answer them. It is because in practice, people have a limit to the number of things they can keep in mind at a time. However, please note that each smaller question within the bigger one can be separately studied.

External validity

The external validity is related to the degree to which the results of the study can be generalized to the broader population. The greater the external validity, the more the results of an empirical study can be generalised to actual software engineering practice. We dealt with the following facts:

  • Components and their metrics.

    With respect to the time limitation of our study, we tried to find the components that vary in the size and the other studied metrics to the extent possible in order to make our results more generalizable. Therefore we intentionally took the components that vary in the size and the other studied metrics in order to cover different metrics values. There is a very low threat for the statistical validity of our results (see Section 5). The obtained prediction models are validated to prevent over-fitting of the data (see Section 4.2), i.e. to enable a reasonably well-fitting prediction in case of new data. However, to examine more fine-grained distributions of the components’ metrics and especially the components whose metrics’ values significantly vary from the studied metrics’ values (i.e. are significantly bigger than the studied metrics’ values), more components need to be examined. In case of bigger components it would be interesting to see to which extent the obtained prediction models would be affected. In that case the participants would require much more time to analyse the components. Furthermore, an architectural representation would probably require hierarchical organization of the components, i.e. components having sub-components at different abstraction levels (it starts from a set of high-level components that model high level functionalities and results in a set of low-level components that combine to perform the high-level functionalities). This representation complies with the guides for software architecture definition in the series of guides for software engineering produced by the Board for Software Standardisation and Control (BSSC) of the European Space Agency (Mazza et al. 1996), for instance. Having in mind the above discussion we are aware that our results (obtained prediction models) might vary to a certain extent for new data. According to that our tool support (see Section 6 for more details) is designed to consider the predicted component’s understandability values as more relative values (rather than evaluating the design by giving absolute values), i.e. in comparison to the understandability of other components in the system, that is used for identifying critical components which require more effort to be understood compared to other components in the system.

  • Studied system and its representations.

    Regarding the studied system we chose the system that is written in Java (that the participants are familiar with), that has industrial relevance, and whose application domain is relatively known to the participants (see more details in Section 3.4.2). The architecture of the system is represented in the form of a UML component diagram that the participants are also familiar with (see Section 5). Having in mind these facts, we can say that our results might be more or less different for other potential systems depending on the extent to which the assumptions related to the chosen system are violated. For example, the results might differ for a system written in some other language that the participants are not familiar with, or some domain specific systems that the participants are totally unfamiliar with, etc. Also, architectural descriptions of software systems using component models could be created in different ways, starting from the simple descriptions of the system like box-and-line diagrams (Rozanski and Woods 2005), over semi-formal models (e.g. UML models) (Björkander and Kobryn 2003; Medvidovic et al. 2002; Robbins et al. 1998) to formal models in architecture description languages (ADLs) (Medvidovic and Taylor 2000) or domain-specific languages for architecture description (Völter 2010). More studies are necessary to examine how different architectural representations affect the understandability of components with respect to their concrete implementation.

  • Varying class sizes within components.

    As we already mentioned above in order to generalize our results we plan to increase the number of studied components. Beside that we consider one more threat in this context, it is the size of the classes in a component. Considering general case there might be some classes that are much bigger than other classes in the system. In that case the number of classes in a component will not appropriately capture the component size (in our case as it is mentioned in Section 3.4.2 no big deviations in the sizes of the classes exist). However that case might also be considered as inappropriate design, i.e. big classes can be divided into smaller classes that consist of one or a set of closely related functionalities. Anyway the given observation can be further examined in order to see how the deviations in the size of classes affect the obtained results.

  • Subjects.

    It has been shown in previous research that software engineering students may provide an adequate model for the professional population (Weber et al. 2014). Even though our participants have substantial experience including the industrial background certain changes in the obtained results might be expected with experts. Studies with experts would enable us to conduct more robust analysis.

To summarize the cases in which our findings, i.e. predictions, would appropriately work, taking into account the given threats to validity, we can say the following:

  • The studied system needs to be object-oriented and its application domain relatively known to the participants.

  • The architectural components need to have up to 15-20 classes that do not have big deviations in their size (e.g. one very big class and several very small ones)

  • The participants need to have at least a couple of years of appropriate programming experience as well as basic knowledge in the software architecture and software engineering field so that they can easily understand the code of the system together with its architecture.

In other cases, the obtained results can vary from ours to a lesser or greater extent.

6 Tool support

6.1 Background

In our previous work (position paper Stevanetic et al. 2014), we presented an integration of a semi-automated DSL-based abstraction of architectural component models and understandability related software metrics. An overview of our approach is shown in Fig. 6. The black part refers to the semi-automated DSL-based architectural abstraction while the red part marked with dashed lines refers to the understandability metrics.

Fig. 6
figure 6

Integration of the understandability related metrics in the DSL-based architecture abstraction approach (reused from Stevanetic et al. 2014)

Regarding the black part, we defined a DSL that enables architectural abstractions from class models, which can be automatically extracted from the source code, into architectural component models. First, a class model from the system’s source code is extracted. Starting from a class model, a UML component model is generated using the architectural abstraction specification defined in the DSL code. In this way the traceability information that links the class models and component models can be preserved. Furthermore, the approach supports consistency checks that are based on the automatically generated traceability information that link the DSL, the class model, and the component model of the system. For instance, the source code classes that are not covered by the architecture abstraction specification or connectors that are defined in the architecture specification but where no relation exists in the source code classes are checked. This enables having an “up-to-date” component model that reflects the source code (i.e. all source code classes are mapped to their respective components). The given approach also supports the software architect throughout the evolution of a software system by allowing him/her to compare different component models (see the bottom of the figure) and to maintain them in correspondence with the source code over time.

The red part marked with dashed lines describes the integration of the understandability metrics for generated component models. For example, they provide an indicator whether a component model is growing too large or other similar guidelines. Firstly, the metrics calculations are extracted from both the class model and the component model. The obtained metrics values are then evaluated with regard to different metrics constraints. Metrics constraints represent a set of rules defined on metrics values that need to be satisfied. In our case they are defined based on our empirical evaluations and also take into account some additional considerations (see Stevanetic et al. 2014). In case that some metrics values do not satisfy the corresponding constraints the architectural abstraction DSL or the source code can be improved in order to resolve the inconsistencies that occurred.

In this paper, we investigate how our empirical findings can be combined with existing empirical evaluations and how we can provide a corresponding tool support. In that context, we have found the work by Bouwers et al. (2011), who studied the analyzability of component models. Taking into account the findings from Bouwers et al., who found that the components should be balanced in size in order to facilitate the system’s analyzability, we have argued that balanced values for the components’ understandability effort can facilitate the analyzability of the whole system. In contrast to our previous position paper (Stevanetic et al. 2014), where the idea about the balanced understandability of components is just mentioned, in this article we further elaborate on concrete calculations of the components’ analyzability based on the integration of our new understandability effort prediction models (i.e. the ones that use the hierarchical understandability metrics, see Section 4.2) and the metrics provided by Bouwers et al.

6.2 Architecture analyzability metric

In this section, we briefly explain our new analyzability metric for component models that is based on the integration of our understandability related prediction models in the analyzability metric defined by Bouwers et al. (2011). Furthermore, we elaborate on calculations related to how much each of the rules used to specify architectural abstractions contribute to the overall understandability of components.

Namely, Bouwers et al. (2011) defined a metric for quantifying the analyzability of software architectures. The metric is called Component Balance (CB) and is defined as the product of two metrics: System Breakdown (SB), which measures whether a system is decomposed into a reasonable number of components and Component Size Uniformity (CSU), which measures whether the components are all reasonably sized. The SB metric is based on the number of components in the system and it is driven by logic that both high and low number of components hinders analyzability. For example, having only one component is bad since the structure of the code does not provide any hints as to where functionality is implemented. On the other hand, many small components do not provide a software engineer with sufficient clues as to which component should be chosen to inspect.

The CSU metric captures how uniformly the volume of the system is distributed over its components and it is based on the Gini coefficient (see Bouwers et al. 2011). To provide maximal discriminative power to a software engineer, a system should be decomposed into a limited number of components of roughly the same size.

Now, we can explain our idea of combining the metric given above with our empirical findings. Namely one of the main drawbacks of the given metric is that it captures the system’s structural decomposition only using the size metric of a component. Dependencies between components are not taken into account which is important because the size is not enough to fully determine phenomena of understanding (see Section 4.2). To improve the situation, we propose to use our “understandability metric” related to the obtained prediction models instead of a simple size metric. In that case both the internal structure of a component and its dependencies to other components are captured (see Section 3.4.3)

Therefore, instead of using the size metric for the CSU metric, we can use the metric obtained from our prediction models as follows:

$$\begin{array}{@{}rcl@{}} Understandability(c)&=&1.1517 \times MSC(c)-0.7889 \times NAC(c)+ 0.5555 \times DMC(c) \\ &&+\,2.5473 \times CRW(c)-0.1371 \times DMH(c) \\ CSU(C)&=&1-Gini(\{Understandability(c) : c \in C\}) \end{array} $$

We picked the second prediction model from Table 15 but any of the models can be used since they have almost the same explanatory power (accuracy). Using the adapted CSU metric, we can now calculate the adapted CB metric as a new analyzability metric.

6.3 Integration of the metrics in the tool

In this Section, we explain how our concepts are embodied in the tool using a concrete example. In particular, we demonstrate how to create component models with reasonable analyzability level by incrementally improving an initial component model of the system. In addition, we show how the tool can be used in detecting the changes that exist between different component models that affect their different analyzability levels. For the calculations given below we used the prediction model 1 given in Table 15 (any other model provided in Table 15 can be used).

The calculation of all required metrics are added in the Metrics Calculation part of our tool (see Fig. 6). The Metrics Calculation part is developed using the custom validation features for Xtext based projects (for more information please refer to http://eclipse.org/Xtext/documentation/). Particularly, the DSL specification for the component models abstraction is written in a file. The file can then be explicitly validated via the menu option in which case the validator class that pursues the metrics calculations is called. The calculated metrics are written in a file which is then processed using the R programming language script. The final output represents the barplot diagram of the CSU metric together with the hierarchical metrics used in the prediction models for all components in the system. Additionally, the CB and SB metrics are calculated for the whole system. Furthermore, our tool supports the calculations on how much each of the architectural rules, used to specify a DSL-based architectural abstraction specification, contributes to the understandability of a given component.

To demonstrate our tool we use the example of the Frag system. FragFootnote 10 is a dynamic programming language implemented in Java, specifically designed for being a tailorable language, building Domain-Specific Languages (DSLs), supporting Model-driven Software Development, and for being easily embeddable in Java. To generate the initial component model of the system, we studied the source code of the system and the corresponding class model that is automatically generated from the system’s source code. To ease this task, we imported the source code in an Eclipse IDE. The understanding of the system was facilitated by the fact that one of the authors of the paper is also the author of Frag. After initial examinations, we created the initial architecture abstraction specification of the system consisting of 6 components (Core, Interpreter, Command Objects, Parser, Exceptions, and MDSD) is generated. The given 6 components are found to represent the major functionalities and/or concerns in the system. The DSL specification of the initial component model is shown in Fig. 7. Figure 8 shows the distribution of the calculated metrics values for the initial component model. The CB metric value for the model is 0.32.

Fig. 7
figure 7

Initial component model - DSL specification

Fig. 8
figure 8

Initial component model - metrics

From the DSL specification (developed using Xtext2), we can see that different architectural rules are used to write the architecture abstraction. For example, rules operating on source code artefacts relate different source code artefacts packages, classes, and interfaces to an architectural component (e.g. Package rule shown in Fig. 7 which selects everything inside a specific package), rules utilizing relationships between source code artefacts relate an architectural component to the source code artefacts that have specific relationships to the given source code artefact like sub- and super-type relations for classes and interfaces, interface realizations, and other dependencies (e.g. Uses rule shown in Fig. 7), etc. Complex rules definitions are supported through the implementation of the three set operations union (or), intersect (and), and difference (and not) which are all used in Fig. 7. Detailed information about all rules and how they are derived can be found in Haitzer and Zdun (2014).

Here, we shortly explain how the metrics for each architectural rule in the DSL specification are calculated. In particular, each rule specifies a set of source code elements to be added to a given component. Hence, for each rule we can calculate the above mentioned set of metrics as we do for the system’s main components. Let us explain how exactly these calculations are done for Component Interpreter in the example from Fig. 7. The rules are evaluated in the following way: first the highest level rule is evaluated. In our example, the highest level rule is the or composition rule which actually specifies all classes in the component. Next the left part of the previously analysed composition rule is evaluated, which is here the Class rule. Then the right part of the or rule is analysed. Since it consists of further composition rules the next highest level rule will be picked. In the given example it would be the and not composition rule which specifies all classes contained in both root.frag.core.Interp and root.frag packages without the class root.frag.core.Dual. Next the left part of the and not rule would be evaluated, which corresponds to the and composition rule. Then the left and right parts of the and rule are examined which corresponds to the Uses rule and the Package rule, respectively. At the end the right part of the (and not) rule is evaluated which corresponds to the Class rule.

From Fig. 8 we can see that Component Interpreter has significantly higher CSU metric than the other components, mainly because of the high number of classes that it contains (see the MSC metric in the figure). Namely, we want that Component Interpreter includes all classes which name ends with “Interp” or those that are tightly coupled to the class Interp from the core package (see the DSL specification of Component Interpreter in Fig. 7). However, by examining how much each architectural rule in Component Interpreter affects the CSU metric value for the whole component we find that the Uses and Package rules and therefore the and rule that connects them have the high CSU metric values because of the high number of classes that those rules produce (see Fig. 10). Consequently, a lot of classes are assigned to Component Interpreter that should not belong there. To improve the situation we change the DSL specification for Component Interpreter so that the tightly coupled classes to the class Interp now include those that both use the given class and are used by it. Furthermore, the Package rule now limits finding the tightly coupled classes to the core package. The metrics for the new component model are shown in Fig. 9. The CB metric value for the new component model is 0.37 (Fig. 10).

Fig. 9
figure 9

Changed component model 1 - metrics

Fig. 10
figure 10

The impact of each architectural rule on understandability – component interpreter from Fig. 7

By looking at Fig. 9 we can see that now Component Parser has significantly higher CSU metric value than the other components. The high CSU metric value for Component Parser is mainly affected by the relatively high number of classes that this component contains (see the MSC metric values in Fig. 9) compared to the other components. Therefore, dividing it into several smaller components would probably improve the situation, i.e. increase the overall CB metric value for the system. From the point of the SB metric value which decreases if the number of components is greater or smaller than 8 (see Section 6.2), dividing the Parser component into 2 or 3 smaller components would increase its value since the current number of components in the system is 6.

By examining the Parser component we have found that it would make sense to divide the component into two components ParserRules and ParsedObjects. Namely, the parser used in Frag uses lexical parsing approach based on the composition of rule definitions that are similar to EBNF. A rule is a description of the situation when the rule matches (a matcher) plus an action that is taken when the rule applies. The result is a tokenized list of parsed elements. Since the concept of rules is important it makes sense to create a separate component for handling the parsing rules (Component ParserRules). The second new component ParsedObjects relates to the list of the parsed elements that roughly corresponds to the Abstract Syntax Tree (AST) in other parsing approaches. Having a separate component for the AST of the parsed code makes sense because the AST structure can contain additional information that need to be managed, i.e. the information related to the subsequent processing, e.g. contextual analysis, etc. The calculated metrics for the new component model are shown in Fig. 11. The CB metric for the new component model is 0.55 and it is increased by 0.18 compared to the previous component model.

The distribution of the metrics values for the new components can further be examined in order to make additional possible improvements. For example, we can examine if it would make sense to divide Component CommandObjects that has the highest CSU metric value in the newly generated component model. By examining the corresponding classes of the CommandObjects component we have found that it can be further divided into two components FileCommands and NonFileCommands that correspond to the commands for handling files and other non-file related commands. The calculated metrics for the generated component model that encompasses all mentioned changes are shown in Fig. 12. The CB metric for the component model that consists of all mentioned changes is 0.64 and it is increased by 0.09 compared to the component model it is adapted from.

Fig. 11
figure 11

Changed component model 2 - metrics

Fig. 12
figure 12

Changed component model 3 - metrics

The example given above shows how we can gradually improve the analyzability of the initial component model by making changes in the DSL and judge the analyzability of the component model created with the DSL using the given metrics calculations. It would be also possible to make source code changes and observe their effect on the analyzability of the generated component model.

Let us now explain how our tool can support finding the changes that affect different analyzability levels of two different component models. To appropriately capture different changes in the system that can affect changes in the given metrics values, our tool supports finding the changes at three different levels in the system: source code changes, component model changes, and DSL changes. Regarding the source code changes, by comparing different source code versions of the system an architect or developer can find which source code elements are added to the system, deleted from it, or changed. By comparing the component models, a user can find the differences in the realized classes contained in each of the components, i.e. if some classes are added to a given component or removed from it (which cannot be seen using the source code comparisons). Furthermore, by comparing the component models it is possible to find new components in the system or those that are deleted from it. The DSL comparisons show the differences in the architectural rules used to specify which system parts are assigned to a component. By combining the given three kinds of comparisons a user can precisely determine the differences in the compared system’s versions as well as in each of the compared components. To realize the mentioned kinds of comparisons we used and extended the Eclipse IDE’s features for the comparisons of different resources. Furthermore, to enable finding the component model changes we created the corresponding serialized representations of the examined component models that contain the fully qualified names of all source code artefacts related to each component.

We compare two component models of the Frag system, one that corresponds to the 0.7 version of the system and another one that corresponds to the 0.8 version. The metrics values for the first component model are shown in Fig. 12 while for the second component model they are shown in Fig. 13.

Fig. 13
figure 13

Component model of the frag version 0.8 - metrics

To find which changes in the system caused the different metrics levels of the given component models we compared their DSLs, their source code, and the classes they contain. From the DSL comparison we can see if there is a difference in the architectural rules used to specify the given views, if new components are added, or if some of them are deleted. By comparing the two DSLs we have found that Component MDSD from the first view (version 0.7) is replaced with 3 new Components DSL, FCL, and FMF in the second view (version 0.8). Otherwise, no changes in the DSL of the other components have been found. By comparing the components’ contained classes we have found that the package core in the first view is renamed to the package fmf in the second view.Footnote 11 Using the same comparison we have found which classes are newly added to the components or which classes are deleted from them. Finally by comparing the source code folders we can find which classes changed their source code. In our example we have found that all classes have been changed to a certain extent. Figure 14 shows the visualization of the given comparisons views (DSL, classes, and source code comparisons) based on the Eclipse features for the resource comparisons. Table 19 provides an overview of all found changes.

Fig. 14
figure 14

Change impact analysis comparisons

Table 19 Overview of all found changes

By comparing the corresponding metrics values for the given two component models, we can say that, except newly introduced components, they show very small differences. This is not in accordance with the number of changes that we found (see Table 19). However, after examining the given changes we have found that the only real changes in terms of adding, deleting or changing some functionality in the system or its part are related to the added or deleted classes in the components (Columns “Added Classes” and “Deleted Classes” in Table 19). The changes in other classes are mostly syntactic changes or code refactoring related changes that do not affect classes’ external behaviour.

The integrated metrics benefit from the architecture abstraction tool in the way that the later provides an “up-to-date” architectural component model that reflects the source code (i.e. all source code classes are mapped to their respective components) that is necessary for the metrics calculations. This way, as we demonstrated, the architects or developers can gradually improve the architecture by making the changes in the source code or in the architecture abstraction DSL and judge the analyzability of the architecture created with the DSL. To perform such improvements the architect or developer can partially benefit from the given metrics calculations provided for each component. For example, as we demonstrated, components that have a large number of classes (i.e. the high MSC metric value) can be broken into several new components. Similarly, components with high coupling can be modified by rearranging their classes with other classes to which they have a strong coupling or by refactoring the classes source code to reduce their coupling to the classes in other components. Modification steps, of course, require manual effort and human expertise.

The metrics calculations for each component as a whole provide useful information on what is its understandability level and what is it affected from. The metrics calculations for each architectural rule related to a given component help an architect or developer to grasp how much different source code artefacts that constitute a given component contribute to its understandability, which rules contribute the most to the limited understandability, etc. (as demonstrated for Component Interpreter above). It can help during performing changes in the DSL of a component in terms that an architect can assess in which direction and approximately how much the understandability of a component will change if some rules are changed.

7 Conclusions and future work

In this article, we provide an extended description of the analysis and results obtained in our previous work (Stevanetic and Zdun 2016) consisting of a more detailed description of the studied metrics, applied statistical techniques, and obtained findings. In addition, we present a new metric for measuring the analyzability of component models based on the integration of our empirical findings and the existing observations related to them, i.e. concretely the existing work on the analyzability related software metrics proposed by Bouwers et al. (2011). Furthermore, we present significant tool extensions compared to our previous work (Stevanetic et al. 2014) including the realization of the new analyzability metric by integrating our previous tools for supporting software evolution using a DSL-based architecture abstraction with the obtained empirical findings. Our tool extensions enable the calculations of how much each of the architectural rules used to specify a DSL-based architectural abstraction specification contributes to the understandability of components and also enable change impact analysis, i.e. the identification of changes in the system that affect different analyzability levels of the component models.

Regarding our empirical findings, we studied the understandability of architectural components using a number of component level metrics including the package based metrics defined by Martin (2003), information theory graph based metrics, and the corresponding counting-based graph based metrics defined by Allen (2002) and Allen et al. (2007), and hierarchical understandability metrics introduced in the work by Hwa et al. (2009), as well as the personal factors of participants like experience and expertise, and the combinations of both personal factors and component level metrics. The understandability of component models is measured through the time that the participants spent on understanding the components, and then predicted using the above given component level metrics and participants’ experiences. On the one hand, the prediction models that consider the hierarchical understandability metrics are significantly better in predicting the understandability effort than the models obtained using other component level metrics or the models that include the participants’ experiences. On the other hand, those models are not significantly different in the prediction from the models that combine both the component level metrics and the participants’ experiences. This means that from all studied models it is enough to consider the hierarchical understandability metrics for the prediction. This result is from our point of view intuitive, as those metrics are originally designed to assess the understandability of the modular design of a system. The participants’ experience is also important and can predict a significant amount of variance in the data but the obtained models are not as accurate as the models that use the component level metrics, i.e., the metrics related to the system itself.

The obtained empirical findings are integrated in the tool that supports the synchronized evolution of the architecture and source code of the system. While the DSL-based architecture abstraction approach enables users to keep source code and architecture consistent, the given metrics extensions enables them, while working with the DSL or source code, to continuously judge and improve the analyzability of the architectural component model they create with the DSL. To further support users in performing adequate changes in the DSL or source code and understanding their impacts on the understandability of a given component, we calculate the given metrics for each architectural rule used to define the DSL-based specification of that component. In that way users can grasp how much different source code artefacts that constitute a given component affect its understandability. Beside improving the analyzability of component models, our approach also supports change impact analysis, i.e. finding which changes in the system’s source code or the DSL-based architectural specification correspond to the changes in the observed metrics values. The applicability of our approach is shown using a case study of an existing open source system.

From the academic point of view we believe that our study can serve as a good starting point for future studies on the understandability of architectural components and component models, but also other kinds of software models. The used instruments and applied statistical techniques provide insight in how the understandability can be appropriately measured and predicted which can help in devising new empirical studies and experiments. From the practitioner’s point of view, the results of our study show which factors and metrics are important for assessing the understandability of architectural components in relation to the system implementation and in how far those metrics can predict the understandability. The understandability effort (time) for new architectural components can be assessed based on the complexity of their implementation. Absolute values for the measured understandability effort for new components are considered to be accurate only for the systems similar to the studied one. In other cases the assessment can vary to a lesser or greater extent. Our tool support facilitates the application of the obtained empirical finding in practice. It is designed to consider the predicted component’s understandability values as more relative values (rather than evaluating the design by giving absolute values), i.e. in comparison to the understandability of other components in the system, that is used for identifying critical components which require more effort to be understood compared to other components in the system. In that way, it can be more or less successfully applied for the systems and components which size and complexity differ from the studied one.

In our future work, we plan to include experts with many years of experience and compare the results with the ones obtained here. We also plan to examine more components, including bigger ones, that would enable us to construct more robust prediction models. However, tackling these challenges is not an easy task since it requires a lot of resources in terms of time and money.