Mining architectural violations from version history

Maffort, Cristiano; Valente, Marco Tulio; Terra, Ricardo; Bigonha, Mariza; Anquetil, Nicolas; Hora, André

doi:10.1007/s10664-014-9348-2

Mining architectural violations from version history

Published: 29 January 2015

Volume 21, pages 854–895, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Empirical Software Engineering Aims and scope Submit manuscript

Mining architectural violations from version history

Download PDF

Cristiano Maffort^1,2,
Marco Tulio Valente²,
Ricardo Terra³,
Mariza Bigonha²,
Nicolas Anquetil⁴ &
…
André Hora⁴

729 Accesses
11 Citations
1 Altmetric
Explore all metrics

Abstract

Software architecture conformance is a key software quality control activity that aims to reveal the progressive gap normally observed between concrete and planned software architectures. However, formally specifying an architecture can be difficult, as it must be done by an expert of the system having a high level understanding of it. In this paper, we present a lightweighted approach for architecture conformance based on a combination of static and historical source code analysis. The proposed approach relies on four heuristics for detecting absences (something expected was not found) and divergences (something prohibited was found) in source code based architectures. We also present an architecture conformance process based on the proposed approach. We followed this process to evaluate the architecture of two industrial-strength information systems, achieving an overall precision of 62.7 % and 53.8 %. We also evaluated our approach in an open-source information retrieval library, achieving an overall precision of 59.2 %. We envision that an heuristic-based approach for architecture conformance can be used to rapidly raise architectural warnings, without deeply involving experts in the process.

Maintaining Security in Software Evolution

Exploring the suitability of source code metrics for indicating architectural inconsistencies

Article Open access 08 March 2018

Test-Driven Reuse: Key to Improving Precision of Search Engines for Software Reuse

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Software architecture conformance is a key software quality control activity that aims to reveal the progressive gap normally observed between concrete and planned software architectures (Passos et al. 2010; Knodel and Popescu 2007; Ducasse and Pollet D 2009; Brunet et al. 2014). More specifically, the activity aims to expose statements, expressions or declarations in the source code that do not match the constraints imposed by the planned architecture. The ultimate goal is to prevent the accumulation of incorrect implementation decisions and therefore to avoid the phenomenon known as architectural drift or erosion (Perry and Wolf 1992).

There are at least two main techniques for architecture conformance: reflexion models and domain-specific languages. Reflexion models compare a high-level model manually created by the architect with a concrete model, extracted automatically from the source code (Murphy et al. 1995). As a result, reflexion models can reveal two kinds of architectural anomalies: absences (relations prescribed by the high-level model that are not present in the concrete model) and divergences (relations not prescribed by the high-level model, but that are present in the concrete model). Alternatively, domain-specific languages with focus on architecture conformance provide means to express in a customized syntax the constraints defined by the planned architecture (Terra and Valente 2009; Eichberg et al. 2008; Mens et al. 2006). However, the application of current architecture conformance techniques requires a considerable effort. For example, reflexion models may require successive refinements in the high-level models to reveal the whole spectrum of architectural violations (Koschke 2010; Koschke and Simon 2003) and domain-specific languages may require the extensive definition of constraints.

In a previous paper, we presented an approach that combines static and historical source code analysis to provide an alternative technique for architecture conformance (Maffort et al. 2013). The proposed approach includes four heuristics to discover suspicious dependencies in the source code, i.e., dependencies that may denote divergences or absences. The common assumption behind the proposed heuristics is that dependencies denoting architectural violations—at least in systems that are not facing a massive erosion process—are rare events in the space-time domain, i.e., they appear in a small number of classes (according to particular thresholds) and they are frequently removed during the evolution of the system (according to other thresholds). In this paper, we extend your previous work by proposing an iterative architecture conformance process, based on the defined heuristics. By following this process, architects can experiment and adjust the thresholds required by the defined heuristics, starting with rigid thresholds. Basically, as the thresholds are made less rigid, more false warnings are generated. Therefore, the architect can finish the conformance activity when enough violations are detected or when the heuristics start to produce too many false positives. We also propose a strategy to rank the generated warnings, which is used to show first the warnings that are more likely to denote real violations.

We evaluated our work in three systems. First, we applied the proposed conformance process in two industrial-strength information systems. We were able to detect 389 and 150 architectural violations, with an overall precision of 62.7 % and 53.8 %, respectively. We also present and discuss examples of architectural violations detected by our approach and the architectural constraints associated to such violations, according to the systems’ architects. Finally, we relied on the proposed conformance process to evaluate the architecture of a well-known open-source system (Lucene). In this case, using as oracle a reflexion model independently proposed by another researcher, we found 264 architectural violations, with an overall precision of 59.2 %.

The remainder of this paper is organized in nine sections and three appendices. In Section 2, we introduce the proposed approach for architecture conformance and the heuristics for detecting absences and divergences, respectively. Section 3 describes the architecture of the prototype tool that supports our approach. Section 4 describes an iterative conformance process, based on the proposed heuristics. Particularly, Sections 5 and 6 describe the usage of this process to evaluate the architecture of two proprietary information systems and an open-source information retrieval library (Lucene). Section 7 discusses the lessons learned with our work. Section 8 presents related work and Section 9 concludes the paper. There are also three appendices, presenting a formal definition of the proposed heuristics (Appendix A), the detailed results of the evaluation of one of the information systems considered in the paper (Appendix B) and the results achieved for Lucene (Appendix C).

2 Heuristics for Detecting Architectural Violations

Figure 1 illustrates the input and output of the proposed heuristics for detecting architectural violations. Basically, the heuristics rely on two types of input information on the target system: (a) history of versions; and (b) high-level component specification. We consider that the classes of a system are statically organized in modules (or packages, in Java terms), and that modules are logically grouped in coarse-grained structures, called components. The component model includes a mapping from modules to components, using regular expressions (complete examples are provided in Sections 5.1 and 5.3). Given the component model, the proposed heuristics automatically identify suspicious dependencies (or lack of) in source code by relying on frequency hypotheses and past corrections made on these dependencies. In practice, the heuristics consider all static dependencies between classes, including dependencies due to method calls, variable declarations, inheritance, exceptions, etc.

We do not make efforts in automatically inferring the high-level components because it is usually straightforward for architects to provide this representation. When architects are not available (e.g., in the case of open-source systems), a high-level decomposition in major subsystems is often included in developers’ documentation or can be retrieved by inspecting the package structure. In fact, as described in Section 6, we applied our approach to an open-source system (Lucene). In this case, we reused high-level models independently defined by other researchers using information available in the Lucene’s documentation.

In the following sections, we motivate and describe the heuristics to detect absences (Section 2.1) and divergences (Section 2.2). We also propose a strategy to rank the warnings produced by the heuristics according to their relevance (Section 2.3). A complete formal specification of the heuristics is presented in Appendix A.

2.1 Heuristic for Detecting Absences

An absence is a violation due to a dependency defined by the planned architecture, but that does not exist in the source code (Murphy et al. 1995; Passos et al. 2010). For example, suppose an architectural rule that requires classes located in a View component to extend a class called ViewFrame. In this case, an absence is counted for each class in View that does not follow this rule.

To detect absences, we initially search for dependencies denoting minorities at the level of components. We assume that absences are an exceptional property in classes and therefore minorities have more chances to represent architectural violations. Moreover, we rely on the history of versions to mine for dependencies dep introduced in classes originally created without dep. The underlying assumption is that absences are usually detected and fixed. The goal is to reinforce the evidences collected in the previous step by checking whether classes originally created with the architectural violation under analysis (i.e., absence of dep) were later refactored to include the missing dependency.

Figure 2 illustrates this heuristic for detecting absences. As can be observed, class C ₂ has an absence regarding T a r g e t C l a s s because: (a) C ₂ is the unique class in component cp that does not depend on T a r g e t C l a s s; and (b) a typical evolution pattern among the classes in cp is to introduce a dependency with T a r g e t C l a s s, when it does not exist, as observed in the history of classes C ₁, C ₄, and C ₅.

Additionally, we consider specific types of dependencies. For example, the planned architecture might prescribe that a given B a s e C l a s s must depend on a T a r g e t C l a s s by means of inheritance, i.e., B a s e C l a s s must be a subclass of T a r g e t C l a s s. Table 1 reports the types of dependency considered by the heuristic.

Table 1 Dependency types, assuming that C ₁ depends on C ₂

Mining architectural violations from version history

Abstract

Similar content being viewed by others

Maintaining Security in Software Evolution

Exploring the suitability of source code metrics for indicating architectural inconsistencies

Test-Driven Reuse: Key to Improving Precision of Search Engines for Software Reuse

Explore related subjects

1 Introduction

2 Heuristics for Detecting Architectural Violations

2.1 Heuristic for Detecting Absences

Definition

2.2 Heuristics for Detecting Divergences

2.2.1 Heuristic #1

Definition

2.2.2 Heuristic #2

Definition

2.2.3 Heuristic #3

Definition

2.3 Ranking Strategy

3 Tool Support

4 A Heuristic-Based Architecture Conformance Process

5 First Study: Proprietary Systems

5.1 Methodology for the SGA System

5.2 Results for the SGA System

5.2.1 Results for Absences

Example 1

Example 2

5.2.2 Results for Divergences - Heuristic #1

Example 3

5.2.3 Results for Divergences - Heuristic #2

Example 4

5.2.4 Results for Divergences - Heuristic #3

Example 5

5.2.5 Overall Results for Divergences

5.2.6 Comparison with Reflexion Models

5.2.7 Historical Analysis

5.3 Methodology for the M2M System

5.4 Results for the M2M System

5.5 Threats to Validity

6 Second Study: an Open-Source System

6.1 Study Setup

6.2 Results for the Lucene System

6.3 Threats to Validity

7 Discussion

7.1 Are our Results Good Enough?

7.2 How Difficult is to Set Up the Required Thresholds?

7.3 How Much Overlapping is there in the Heuristics for Divergences?

7.4 What are the Most Common Dependency Types Responsible for Violations?

8 Related Work

8.1 Static Analysis Tools

8.2 Software Repository Analysis Tools

8.3 Architecture Conformance Tools

9 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Formal Definition

1.1 Notation

1.2 Detecting Absences

1.3 Detecting Divergences

1.3.1 Heuristic #1

1.3.2 Heuristic #2

1.3.3 Heuristic #3

Appendix B: M2M Conformance Process

Appendix C: Lucene Conformance Process

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation