Keywords

1 Introduction

The popularity of microservices for the development of enterprise applications has increased tremendously over the past decade. Microservices are acclaimed to bring a wide range of benefits over monolithic applications (monoliths), such as language agnosticism, improved scalability and maintainability. However, adopting a microservices architecture (MSA) can be challenging. For instance, guaranteeing data consistency in MSA with multiple databases may require significant effort [15]. Transactions in coarse-grained systems are encapsulated in a single service, which facilitates handling data consistency. However, if the granularity level of a coarse-grained system enforces tightly-coupled services, this can significantly reduce the maintainability and scalability of a system. Hence, one of the main challenges in MSAs is the definition of an appropriate level of granularity for the microservices [14].

Maintainability is strongly influenced by the granularity of the microservices [5]. It is generally expected that the maintainability of an application improves when an MSA is adopted, because different development teams can be assigned to specific microservices and work in parallel [16]. However, if the granularity of the microservices architecture is not properly designed, dependencies among microservices can result in a maintenance nightmare. This tight coupling can enforce change propagation, which requires developers to update multiple services as a consequence of changes in one service. Since maintainability is a challenge in MSA [5], we investigated the assessment of microservices granularity from a maintenance perspective. Domain-Driven Design (DDD) [9] is an approach that can be used to define the granularity of the microservices at design-time by determining boundaries (the scope) of the services. However, DDD does not offer concrete decision support, since it only outlines how bounded contexts can be identified for each domain concept [22], relying on (expensive) experienced architects to get appropriate results, which are still prone to subjectivity.

This paper presents a method to assess the granularity of microservices with respect to maintainability. The method is based on metrics that are relevant for maintainability, namely change coupling, structural coupling, weighted service interface count, lines of code, service interface data cohesion, and change frequency. By evaluating these metrics, our method can assess refactors of a microservices architecture according to maintainability requirements. To automate the execution of this method and capture these metrics, we reused available tools and developed features to extract relevant information from code repositories. We validated our method with three open source projects of different sizes and structures, and this paper discusses two of them. Our results show that the design decisions identified by our method are mostly in accordance with the maintainability evolution as perceived by the experts involved in the investigated projects, which indicates that our method can be potentially used as a component in a design decision support tool especially for enterprise applications.

This paper is further structured as follows: Sect. 2 discusses related work, Sect. 3 describes our research approach, Sect. 4 defines the metrics to assess the granularity of microservices from a maintenance perspective used in our method, Sect. 5 introduces our assessment method, Sect. 6 describes the validation of our method and Sect. 7 gives our final remarks.

2 Related Work

Inherent to the popularity of microservices is the intensive research on this topic. However, we found only a few publications that focus directly on assessing the granularity of microservices. A method to collect coupling metrics at runtime with focus on monitoring the evolution of maintainability in an MSA is introduced in [3]. Although their results are promising, as the evolution in metrics seems to correspond with the architectural evolution of the system, their method does not assess cohesion and requires a test suite covering the complete system, which decreases the accessibility of their method as such a test suite is not always at hand. Our research addresses these limitations by introducing metrics that capture cohesion as well as coupling and size, providing a different perspective for evaluating maintainability in relation to granularity.

A method based on Model-Driven Engineering techniques is introduced in [8], which provides insight into the evolution of different quality attributes of an MSA in reaction to architectural changes. An obstacle to using their approach is that a model of the architecture is required. Although there are techniques to automatically recover architectures from MSAs [2, 10, 11], these techniques require specific input data that is not always available. In contrast, our work focuses primarily on maintainability and uses metrics that can be automatically obtained from the software and version control data. Our research complements existing works by offering a different method that uses a larger metric suite and improves the understanding and evaluation of microservice architecture quality.

3 Approach

This section discusses the problem context, our research steps and the selection of cases to develop and validate the method.

3.1 Problem Definition

Currently, practitioners identify service boundaries (and therefore their granularity) primarily based on their experience and insight without making use of any frameworks or tools, except for DDD [9, 22]. However, DDD fails to offer concrete decision support, leaving room for subjectivity, and we can never be sure that sufficient experience for determining microservice boundaries is available in each MSA-based project. Furthermore, it is not possible to define some generic optimal granularity values that apply in all circumstances, since varying system requirements often require different levels of granularity. For example, while finer-grained services are locally less complex, the whole application may become less flexible if these services are tightly coupled. Hence, a method to support granularity decisions cannot be agnostic with respect to system requirements, i.e., it should consider granularity from the perspective of some specific requirement.

Design decision support for the definition of microservices’ granularity is a relevant open issue for both practice and research [21], which calls for the development of a method that supports the definition of granularity for improved maintainability based on assessment of microservices. The goal of this research has been to develop such a method, which should be able to reduce the need for experience in making choices on microservice granularity.

3.2 Research Steps

Our research approach consisted of seven steps: Literature review, Requirement elicitation, Instrumentation, Case selection, Data collection, Data analysis and Validation. With a systematic literature review we identified a set of maintainability metrics that are applicable to MSA. We then selected metrics that complement each other, aiming to cover maintainability aspects in our assessment method.

Based on the maintainability metrics, we formulated two sets of requirements: (1) for the software projects we used as cases; and (2) for the instrumentation we selected. We identified 3 cases, selected 4 tools and developed a script for calculating change frequency and for data cleaning. Subsequently, we performed data collection from the selected cases. The preparation of the data (extracting and cleaning) was an essential step in our research. The compatibility of the tools with the selected cases is not always a given: it was not possible to derive certain metrics for some cases while for others some case-specific adjustments to the instrumentation were made. For some metrics, manual interventions were inevitable and additional steps that had to be taken were project-specific.

In the data analysis step, we analysed the different metrics obtained during our assessment, calculating the metrics per refactor type. Expert opinion was used to validate whether the findings of our assessment method corresponded with reality, where the experts were architects or developers involved in the real refactor(s) of the cases. During these semi-structured interviews, the experts were asked about the team’s intentions for a refactor, i.e., which system property they tried to improve by that refactoring, and the aftermath, i.e., the extent to which the refactor can be considered successful. Finally, the findings of our assessment method were compared to the statements of the interviewees for each case. The validity of our assessment method depended on whether it could reflect the same evolution in maintainability as experienced by the experts.

3.3 Cases Selection

We selected three freely available cases, which are MSA-based open source application projects. An important requirement is that in each project different code releases were properly stored in a version control system (Git), so that we could access this code before and after each code refactor. In addition, each application should be implemented to address real-life business needs, i.e., it should not be a sample application for demonstration purposes. Each project should also provide refactor(s) in which the granularity of the application was affected.

To be able to draw conclusions on the generalisability of our method, we covered the following additional requirements: (i) a smaller application consisting of less than 3 services; (ii) a large application consisting of more than 10 services; (iii) an MSA using orchestration; (iv) an MSA based on choreography; (v) a system using synchronous API calls; (vi) a system communicating asynchronously via events (Pub-Sub message brokers); (vii) an open-source project and; (viii) a project developed in an enterprise context. In this way, we aimed at capturing development experience in terms of the architectural reasons for rearranging microservices in each of these refactors.

4 Maintainability Metrics

This section discusses the metrics and heuristics that we selected to assess maintainability in MSAs, based on [6].

4.1 Size Metrics

Lines of code (LOC) refer to the number of lines of code of the service implementation, from which blank lines and comments are excluded. This is a popular but controversial metric, as it is influenced by the verbosity of the programming languages [6]. However, we considered LOC due to its direct relation to complexity. Since the freedom of choosing a programming language (polyglot programming) is one of the claimed benefits of microservices, other complexity metrics such as, e.g., cyclomatic complexity are hard to measure as the required tooling is language-dependent. However, a strong correlation exists between LOC and cyclomatic complexity [12].

Since LOC may not be ideal to assess an MSA with microservices implemented with different programming languages, we complemented it with Weighted service interface count (WSIC), which is a size metric proposed in [13] that considers the number of operations exposed in the interface of a service. Intuitively, it can be an appropriate indication for maintainability, as a higher number of operations implies a more complex service, and higher complexity for the system as a whole, since more operations directly require larger implementation and testing efforts. This metric can be weighted in different ways to account for the number and complexity of the parameters of each operation, but in the absence of validated weighting methods, we used the default weight of 1. Thresholds to interpret WSIC values are given in [7].

4.2 Coupling Metrics

Two software modules \(s_1\) and \(s_2\) are structurally coupled if there are code or structural dependencies between them. In the context of MSA, such a structural dependency can be, e.g., in the form of service calls or a producer-consumer relation. A definition of structural coupling (SC) specifically for microservices is given in formula (1), based on [18].

$$\begin{aligned} StructuralCoupling(s_1,s_2) = 1 - \frac{1}{degree(s_1,s_2)} * LWF * GWF \end{aligned}$$
(1)

This definition is based on the local weighting factor (LWF) and the global weighting factor (GWF), defined by formulas (2) and (3), respectively.

$$\begin{aligned} LocalWeightingFactor(s_1,s_2) = \frac{1 + outdegree(s_1,s_2)}{1 + degree(s_1,s_2)} \end{aligned}$$
(2)
$$\begin{aligned} GlobalWeightingFactor(s_1,s_2) = \frac{degree(s_1,s_2)}{max(degree(all\_services))} \end{aligned}$$
(3)

LWF considers the degree and the out-degree from service \(s_1\) to service \(s_2\), where the degree represents the total number of structural dependencies between \(s_1\) and \(s_2\), and the \(out\_degree(s_1,s_2)\) the number of dependencies among the total degree that is directed from \(s_1\) to \(s_2\). GWF weighs the degree between two services with the highest degree between a service pair of the application, considering all combinations of services in the application as possible pairs. SC is a normalised metric, and a value close to 1 indicates high structural coupling. Since this metric has been validated in 17 open-source projects, we considered it in our method as originally intended [18].

Change coupling (CC) between two software artefacts, also known as logical coupling, is defined as “the implicit and evolutionary dependency of two software artefacts that have been observed to frequently change together during the evolution of a software system” [17]. “Changing together” can be defined in many alternative ways, and the most appropriate definition depends on the purpose of the analysis. In our method we consider all revisions in the version control data of a service that were committed within an interval of a day as a logical change set, and artefacts that change simultaneously in a large number of change sets as change-coupled. This metric can uncover relations between software artefacts that are not explicitly present in the code of a system. This makes the metric appealing to apply to MSAs, as it is able to reveal hidden dependencies regardless of hindrances such as REST calls and event buses, which obstruct code-based analyses from discovering these relations [20]. Furthermore, this metric is programming language-agnostic, since it can be directly derived from the version control history [17].

4.3 Cohesion Metrics

Change frequency (CF) is another metric that can be extracted from version control data. In MSAs, this metric corresponds to the number of times a service is modified (i.e., the number of commits) per time unit. A high CF of a service with respect to other services in the system is not a direct indication for low maintainability, but in combination with size metrics CF can help pinpoint low-cohesive services that are candidates for refactoring [20]. This is because large services with a relatively high change frequency could be covering multiple bounded contexts, and the developer should make an informed decision on whether to split these services. To allow for comparisons between different CFs, we consistently calculate CF by dividing the absolute number of changes by the number of months the changes were accumulated over.

Service interface data cohesion (SIDC) is another metric appropriate for MSAs as it is measured on an interface level. SIDC captures the equality between the parameter types of operations in an interface [4], so that if all operations defined in an interface use a common parameter data type, then the corresponding service is considered highly cohesive [6]. SIDC is defined in formula (4) for an interface I of a service S as the sum of operations with common data types (OC) divided by the total number of distinct operations (OD) assuming that this number is not zero, normalising the metric to values between 0 and 1 [4, 6].

$$\begin{aligned} SIDS(S) = \frac{OC(I_\textrm{S})}{OD(I_\textrm{S})} \end{aligned}$$
(4)

Thresholds for the interpretation of SIDC values were defined in [7], which were calculated using a benchmark-based approach.

5 Assessment Method

Our assessment method consists of two steps, namely data preparation, in which the metrics values are collected and prepared for analysis, and data interpretation, in which each metric value is interpreted with regard to maintainability in different refactor contexts. This section discusses which versions of a system to analyse to properly capture the evolution of maintainability, gives guidelines to perform data preparation by prescribing tools and data cleaning steps, and presents the framework for metrics interpretation that is used to assess the maintainability of the MSAs.

5.1 Capturing Experience

Surveys show that approximately 90% of software projects use a version control system [1], which enables the retrieval of previous versions of an application. In our approach, version control systems allow us to apply our assessment to different versions of an application and learn about the evolution of maintainability in reaction to changes. For validation purposes we focused on refactors that affected the granularity of the system, analysing system versions before and after such a refactor to gain insight into how the changes in granularity influenced maintainability. This allows one to capture experience by learning from the past.

To accurately capture the impact of a refactor on maintainability, we need to make an informed decision regarding which versions of the application to analyse. We need to ensure that the refactor of interest is the only refactor implemented between the two analysed versions, to isolate and measure the effect on maintainability of only the refactor under analysis, since if we analyse versions between which multiple refactors had been carried out then we would end up measuring the combined impact of these multiple refactors. Refactors are not always implemented in atomic commits, but can entail a transition period. The assessment should be conducted both before and immediately after such a transition period, excluding the transition itself. This ensures that the entire impact of the refactor is accurately captured.

5.2 Data Preparation Guidelines

Our assessment method provides guidelines for data preparation and cleaning. Although we selected some specific tools, alternative instrumentation can be used as long as the necessary metric values can be obtained.

MicroDepGraph (https://github.com/clowee/MicroDepGraph) is used to capture the dependencies between microservices from the Dockerfile file (docker-compose.yml), and is used to calculate structural coupling. Dockerfile is commonly used to define the entire service composition of a system [19]. Since the tool focuses on Docker configurations, it has low applicability in event-driven architectures (EDA) in which services produce and consume events for and from a message broker. EDA adhere to the loose coupling principle, in the sense that services are agnostic of which services consume their events and of which services produced the events they consume. In case of an EDA, an alternative approach is to manually calculate the structural coupling. A condition is that documentation on the dependencies between the services is available and complete.

Code-Maat (https://github.com/adamtornhill/code-maat) is an open-source tool for mining version control and is able to perform code age analysis, ownership analysis and change coupling analysis. Code-Maat implements change-coupling analysis and allows the specification of temporal windows, in which all commits should be considered to be part of the same logical change set. Temporal windows should be selected based on the behaviour of the developer(s) of an application. The version control data should be extracted for each service of interest, and bot commits and commits affecting more than 100 files should be deleted, since they negatively affect the accuracy of the change coupling analysis. In case a service of interest is renamed or refactored during the selected time interval, our labelling scripts should be used to label this new service(s) as part of the service of interest.

We implemented our own tool to calculate change frequency (CF) of a service. The implementation is straightforward: the version control history of a service S is analysed and we count the number of commits (C) performed within the time interval under analysis. C is subsequently divided over the number of months (M) the time interval spanned to determine the CF. CF should be calculated over the same cleaned version control logs as used for the CC calculations.

cloc (https://github.com/AlDanial/cloc) is an open-source tool for measuring the Lines of code (LOC) of a service. cloc is able to recognize a wide range of programming languages and can differentiate between comment lines, code lines and blank lines for each of these languages. cloc was used to calculate the average LOC of all services in an application to identify services with a size substantially larger than the average, which might be candidates for refactoring. Using the git reset command, we were able to revert to older versions of a system and measure the LOC of these versions.

RAMA-CLI (https://github.com/restful-ma/rama-cli) is a command-line tool to calculate maintainability metrics related to size, complexity and cohesion from interface specifications. The tool can parse three types of RESTful API specification languages and the metrics of our interest are service interface data cohesion (SIDC) and weighted service interface count (WSIC). As RAMA-CLI has been designed specifically for the analysis of RESTful APIs, the tool is not directly applicable to EDA, since services can produce and consume events in parallel and do not expose endpoints to each other, but to the message broker through the topics of the events. It is advisable to verify whether a specification represents a single service or multiple services, such as in the API gateway pattern, and whether it belongs to an actual microservice or to components of the message broker. In the latter cases, assessing such a specification cannot provide an accurate indication of maintainability.

5.3 Metric Interpretation Framework

This framework describes how each metric value should be interpreted with regard to maintainability in different refactor contexts, i.e., a merge (M1, M2) and a decomposition (D1, D2). For the hybrid refactors we studied all presented interpretation guidelines are relevant, as hybrid refactors (H1, H2) encompass both a merge and a decomposition.

Table 1. Interpretation of metric values in different refactor contexts.

6 Validation

To validate our assessment method, we first applied it to each selected case, by assessing pre-refactor and post-refactor versions of the systems, reverting to older versions using the version control system used in each project. After obtaining the information related to each refactor (pre-refactor and post-refactor metrics), we applied the interpretation framework presented in Table 1 to assess the maintainability of the resulting architectures. Here we only describe the refactors studied in two of the cases, but for a complete account of these assessments we refer our GitHub project siteFootnote 1. MX, DX and HX denote a merge refactor, a decomposition refactor and a hybrid refactor, respectively.

6.1 Case 1: Metadata

Metadata is a microservice-based metadata-driven user interface (UI) generator. It is an open-source project, developed by a single developer. It allows its users to specify UI metadata via REST endpoints, as well as via GraphQL queries. The architecture of Metadata encompasses four microservices (metadata-​rest, metadata-​engine, metadata-​graphql and metadata-​deploy), while other modules are provided as binaries instead of services so that the user does not have to deploy yet another microservice.

The following refactors of this case have been considered:

  • D1. In this refactor, to increase the separation of concerns and in line with the single responsibility principle, the developer decided to decompose the ref-​impl service, generating the metadata-​deploy service and shifting some functionality to the service provider. By doing this, functionality related to data management was separated from the operations performed on the data.

  • H1. The application contains one REST-related service and one GraphQL-related service, which have a lot of functionality in common. Besides the REST and GraphQL-specific code, the operations available to both services are similar since in the end they implement the same features but using a different type of API. The developer pointed out the common functionality of the services to be a candidate for extraction into a separate service, as currently, the developer is required to make duplicate modifications in both services to maintain consistency. We included this hybrid refactor in our study, although it is not implemented yet, as we wanted to investigate to what extent our method could identify these services as candidates for refactoring.

6.2 Case 2: Loan Eligibility Checker

This case is an industrial project in which a bank wanted to automate the loan eligibility check for small and medium-sized enterprises (SMEs). This system has primarily been implemented in Java and has been developed by a team consisting of over 40 engineers. The architecture of the system covers 13 services, among them: service PSD2-​service that allows multiple banks to connect through an API Gateway (each bank as a service) that serves the journey-​API service as well as services authentication-​service and transaction-​processing. A message broker (Kafka) is used to enable event exchange with services journey-​ rule-​engine, riskmodels, email-​service, as well as termsheet-​service, which interacts with service file-​upload-​service. The refactors that took place to improve maintainability during the life cycle of the project are the following:

  • M1. The journey-​API and the journey-​rule-​engine services both keep track of the state of the customer journey. As they form the bridge between the front-end and the back-end of the system, the two services need to be in the same state for each process instance. To achieve this, they need to have the same data at their disposal, so complex state-carrying events are constantly being exchanged by the two services. According to the engineers, merging these services would directly increase the maintainability of the system, as changes to the state-carrying events would not require modifications in two services anymore. Although this refactor is acknowledged by the team to be beneficial, it is planned but not implemented yet.

  • M2. Due to a company-wide policy which initially prescribed a strict separation between business logic and the corresponding APIs, the system contains a group of three services that together form one bounded context: a service in which the business logic was implemented, a service that implemented the API of the business logic service and a service that contained the configurations to communicate with the API gateway of the system. The engineer pointed these services out as a textbook example of a bounded context divided over multiple services.

  • D2. By introducing the journey-​rule-​engine, a central point in the system was introduced from where all services were reachable. It was a convenient place to add new features, as the service handles the entire workflow. As a result the service grew over time. This became problematic when the bank wanted to reuse certain functionality that was now mixed into the journey-​rule-​engine in other applications. Using the same rule engine for multiple journeys would become too complex, so the team was forced to extract the new functionality from the rule engine into separate services.

  • H2. Employee journeys are initiated by employees of the bank and are orchestrated by the employee-​rule-​engine and their API gateway is implemented by the employee-​api-​service. These services are similar to the rule-​engine and journey-​API, but are tailored towards the employee journeys. The services have been partially merged, to make the constant exchange of complex state events between the two services obsolete. The merge is partial as the employee-​rule-​engine and employee-​api-​service are involved in multiple employee journeys, and the merge is only done for one specific journey. This means that the employee-​api-​service, employee-​rule-​ engine and this new service in which the two are partially merged (review-​ flow-​engine) co-exist after the refactor.

6.3 Results

We validated our assessment method by investigating the alignment between the assessment observations and the evolution in maintainability as perceived by the case experts. Table 2 gives an overview of this alignment, where for each refactor it indicates whether the assessed metric was in line (M), conflicted (C) or could not be determined or interpreted (NA) with respect to the experiences of the expert. In some of the assessments, some metrics play a role both in identifying refactor candidates as well as in reflecting the evolution in maintainability. These entries, such as CC in the assessment of H2, contain two outcomes (M/C in this case). The first outcome refers to the ability of the metric to identify the involved services as refactor candidates in the pre-refactor assessment, which was successful in this case, while the second one refers to the alignment of the metric with the increase in maintainability as experienced by the case expert, which conflicted in this example.

Table 2. Matching between the assessment results of our method and the experiences of the 3 case-experts

We observed that CC is able to identify candidates for merging (M1, M2, H1, H2) and for reflecting the perceived maintainability evolution (D1, D2). We discussed possible causes for conflicts with the case experts since these causes can be valuable to learn the applicability and limitations of our assessment method.

SC shows low applicability. This metrics could not be measured during the assessment of the loan eligibility checker, as this system implements an event-driven architecture. In these architectures, services are agnostic with respect to the services with which they exchange data, so the actual inter-service dependencies cannot be extracted from the Dockerfiles used to calculate SC.

The interface-based metrics (WSIC and SIDC) frequently conflicted with the experiences of the expert for two reasons: (1) an interface does not map one-to-one to the functionality of a service, which sometimes caused a misalignment between our assessment and the expert’s experience (e.g., in the API gateway pattern) and (2) an interface is not always updated in parallel to a refactor.

LOC and CF were both well-aligned with the experiences of the case experts. Conflicts mainly arose because some deprecated code was not removed from a service (influencing the accuracy of LOC) and because of the “newness” of a service. For example, in H2 the new service post-refactor had a higher CF than the pre-refactor services, which the case expert expected to decrease over time since a new service has more bugs that need to be fixed. This was inquired after with the case expert. We expect this to also be the cause of the conflict between the CC assessment and the experts experience for H2, where the services that had just been refactored showed high change coupling, which the expert expects to be the aftermath of refactoring.

7 Conclusions

This paper presented a quantitative method to assess the granularity of microservices in an MSA with respect to maintainability. We applied the method to selected projects to evaluate the impact of granularity on maintainability before and after refactors were performed. We compared the results of our assessment method with expert observations. Our assessments were aligned with the experts’ experiences in many cases, indicating that our quantitative assessment method often matches their intuitive understanding. Exceptions were particularly for metrics measured over a short interval or compared to system averages based on a small number of services. These factors should be considered when evaluating the usability of our assessment method in future studies.

Availability of suitable cases is a limitation that needs to be addressed. Finding cases was challenging as there are only a few open-source microservice-based projects available. A data set of projects implementing an MSA is presented in [19], but most of them are sample projects that demonstrate a design pattern or the use of a framework. Additional complicating factors were the inclusion criteria imposed by our validation strategy, which required contact with a case expert and implemented refactors in the history of the system that affected granularity.

A potential threat to the validity of our research is the lack of evidence demonstrating the correlation between grouping change sets based on temporal windows and the actual change sets in multi-repository microservice-based projects. Empirical validation of this correlation would be valuable, considering that many MSA projects use multiple repositories, which hinders change coupling analysis at the commit level.

Automatic identification of dependencies between microservices in existing systems is a relevant topic for future work. Currently available support (e.g., MicroDepGraph) can derive dependencies based on Docker dependencies, but their applicability is limited, especially in EDA.

Our approach allows for the assessment of existing systems. At design-time however, only SC, WSIC and SIDC can be determined. It would be valuable to investigate the value of this smaller metric set in determining an appropriate granularity for green field applications. Finally, for our assessment method to ultimately lay the basis for a decision support tool for microservice granularity, future research should focus on empirically validating our findings on larger data sets (i.e., more projects and more refactors).