Information-Theoretic Remodularization of Object-Oriented Software Systems

Prajapati, Amarjeet; Chhabra, Jitender Kumar

doi:10.1007/s10796-019-09897-y

Information-Theoretic Remodularization of Object-Oriented Software Systems

Published: 25 January 2019

Volume 22, pages 863–880, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Information Systems Frontiers Aims and scope Submit manuscript

Information-Theoretic Remodularization of Object-Oriented Software Systems

Download PDF

Amarjeet Prajapati¹ &
Jitender Kumar Chhabra²

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Software remodularization consists in reorganizing software entities into modules such that pairs of entities belonging to the same modules are more similar than those belonging to different modules. In recent years, Search-Based Software Engineering (SBSE) approach has gained unprecedented growth for solving software remodularization problem. Most of the previous studies remodularize the software system by optimizing the structural coupling and cohesion metrics as objective functions. These metrics are defined in terms of the number of structural relationships counts, rather than taking patterns of relationships. It has been observed that the computation of coupling and cohesion based on patterns of relationships (i.e., information-theory based) are more accurate than the number of relationships. This paper proposes an information-theoretic software remodularization where an entropy-based similarity measure is introduced as an objective function along with other objective functions i.e., inter-module class change coupling, intra-module class change coupling, module size index (MSI), and module count index (MCI) and is further optimized using many-objective meta-heuristic algorithm. To evaluate the effectiveness of the proposed approach, seven object-oriented software systems have been remodularized using NSGA-III, MOEA/D, IBEA, and TAA algorithms. The results are compared with existing multi-objective formulation of remodularization problem in terms of authoritative software remodularization, non-extreme distribution, and stability. The experimentation results suggest that the proposed approach can be a good alternative to improve the quality of software systems. The findings suggest that the approach is more suitable for generating remodularization solution good from both quality metrics and developers perspective.

Novel Automatic Approach Using Modified Differential Evaluation to Software Module Clustering Problem

Article 21 October 2023

Software Remodularization by Estimating Structural and Conceptual Relations Among Classes and Using Hierarchical Clustering

Restructuring Object-Oriented Software Systems Using Various Aspects of Class Information

Article 21 July 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In order to keep pace with ever-changing user, business and technological requirements, the source code of a software system often needs to be changed. It has been observed that the short deadlines of project delivery, budget constraints, and unfamiliarity of existing source code generally forces developers’ to focus on functionality rather than design structure (Fowler et al. 1999; Mancoridis et al. 1998). Such maintenance practices increase the complexity of design structure and degrade software quality. The software system with poor design quality is difficult to understand and evolve. The problem becomes more difficult in case of highly convoluted software design with the unavailability of their up-to-date documentation as well as original developers (Mkaouer et al. 2015).

There are many source code anomalies that contribute in degradation of design quality of a software system. In an object-oriented software system, the suboptimal placement of source code classes into packages is one of the crucial anomalies that cause the degradation of design quality (Bavota et al. 2014). To improve the design quality of software system various software remodularization approaches based on deterministic and search-based techniques have been proposed (e.g., Praditwong et al. 2011; Barros 2012; Prajapati and Chhabra 2017, 2017a; Parashar et al. 2016; Corazza et al. 2016; Bavota et al. 2010, 2013; Prajapati and Chhabra 2017b; Mkaouer et al. 2015a; Kumari et al. 2013; Doval et al. 1999; Prajapati and Chhabra 2014). The deterministic based software remodularization approaches perform well for small size software, however, for large and complex software they become impractical and sometimes infeasible (Mancoridis et al. 1998; Harman et al. 2012).

In case of a large and complex software, search-based software engineering (SBSE) (Harman et al. 2012) approach has been found as a good alternative for solving a software remodularization problem (Prajapati and Chhabra 2017; Bavota et al. 2013; Prajapati and Chhabra 2017b; Mkaouer et al. 2015a). The major advantage of using SBSE approach is that it guarantees in the generation of near-optimal solution within a reasonable amount of time. The effectiveness of SBSE based software remodularization approaches depends on many factors such as search algorithms, fitness and objective function formulations, etc. However, fitness and objective function formulation are the two most important factors that help in driving the remodularization process towards good solutions (Bavota et al. 2014; Anquetil and Lethbridge 1999). Hence, in order to achieve a good quality remodularization solution appropriate formulation of fitness and objective functions need to be incorporated in search algorithms.

In last two decades, unprecedented efforts have been put forward in designing fitness and objective functions along with search algorithms to solve the different aspects of single and multi-objective software remodularization problems (e.g., Kumari et al. 2013; Prajapati and Chhabra 2014, 2018, 2018a). Majority of the fitness and objective functions for software remodularization are designed in terms of direct link coupling of software artefacts, such as method calls or inheritance (Mancoridis et al. 1998; Praditwong et al. 2011; Barros 2012; Prajapati and Chhabra 2017, 2017a). However, some researchers (Parashar and Chhabra 2016; Corazza et al. 2016) have attempted to design the objective functions by analyzing the sibling link similarity based on lexical and changed information. Some other researchers have tried to combine the structural based direct link similarity with lexical based sibling link similarity in their remodularization techniques (Bavota et al. 2010, 2013, 2014; Prajapati and Chhabra 2017a). Other researchers have also attempted to design the objective functions by combining the changed history information with structural information and lexical information (Mkaouer et al. 2015a).

Although the existing fitness and objective functions designed for search-based software remodularization approaches have been reported to be quite effective, the major limitation of such approaches is that they are not able to generate the remodularization solution that is meaningful from developers’ perspective (Mkaouer et al. 2015a) The main reason is that such remodularization fitness and objective functions do not comply with the perspective of developers. Therefore, to generate a remodularization solution that is meaningful from developers’ perspective, we need to improve our remodularization fitness and objective functions that must consider factors that convey the developers’ perception along with optimization of design principles.

Further, the existing software remodularization approaches commonly treat all sources of information of entities to be modularized equally (i.e., presence or absence of a feature) to determine the fitness and objective functions. However, software developers usually give different importance to different types of features while fitness and objective functions (Bavota et al. 2013a). Therefore, the fitness and objective functions used for remodularization evaluation should consider different dimensions of information with their relative importance. Most of the search-based software remodularization approaches except (Mkaouer et al. 2015a) ignore the changed history dependency information. However, the changed history information may reveal many dependencies among software components which cannot be observed by the structural or lexical based information. The study (Bavota et al. 2013a) showed that the changed history information is also one of the factors that some extent reflects the developers’ perception of coupling. Therefore, remodularization fitness and objective functions should also consider the changed-history information along with the structural and lexical information.

To address the above-discussed issues, this paper introduces a multi-objective formulation of object-oriented remodularization problem where entropy-based similarity measure along with inter-module class change coupling, intra-module class change coupling, module size index (MSI), and module count index (MCI) have been used as objective functions. In this contribution, the remodularization objective functions use different types of structural as well as lexical information that captures the developers’ aspects of coupling in the computation of similarity measure. Moreover, the approach uses different types of structural and lexical information with their relative importance. However, relative weights of different dimensions of information are subjective in nature and depend on many factors (e.g., quality measurement goal). To deal with this, this paper uses term frequency-inverse document frequency (TFIDF) (Yates and Neto 1999; Corazza et al. 2016) to compute the weight. Using the different dimensions of structural and lexical information, an information theoretic similarity measure (i.e., entropy-based similarity measure) has been used. The information theoretic concepts have been successfully applied to other unsupervised machine learning approaches (e.g., data clustering) (Gokcay and Principe 2002; Sugiyama et al. 2014; Andritsos and Tzerpos 2005). The entropy measures uncertainty about a random event, which can be used to design a remodularization criterion for restructuring the packages of software systems. In fact, when we assign a class to one of the different modules we incur an entropy cost. Minimizing this incremental entropy cost could be an effective evaluation criterion for software remodularization. The model also exploits change-history information stored in the version repository. In particular, the approach extracts the changed dependencies between the classes and uses them to make the remodularization solution consistent with changed history. The major idea of using such dependencies in remodularization is to force the remodularization process towards a solution where classes changing together should be grouped together.

The organization of the rest of this paper is as follows: Section 2 presents related works. Section 3 provides background from information theory and structural/lexical based coupling computation. Section 4 discusses the proposed approach. Section 5 presents an experimental setup. Section 6 presents results and discussion. Section 7 concludes with future directions.

2 Related Works

Automatic remodularization of software systems has become an interesting application for SBSE, where a different aspect of software remodularization problems are simulated as search-based optimization problems (e.g., such as single, multi or many-objective optimization) and are solved using search-based meta-heuristics (Harman et al. 2012). The main attraction of SBSE towards software remodularization is that the combinatorial and NP-hard nature of software remodularization problem makes SBSE approaches best alternative (Praditwong et al. 2011). Recently, many researchers have applied various SBSE approaches by adopting different metaheuristics and single/multi-objective formulations to address the different aspects of software remodularization problems (Praditwong et al. 2011; Ouni et al. 2013, 2014, 2015; Prajapati and Chhabra 2017a; Kumari et al. 2013; Ouni et al. 2016, 2016a, b , 2017; Mancoridis et al. 1998).

In previous literature, the software remodularization problem has been defined in different ways according to different aspect of software restructuring. For example, 1) number of objectives: single-objective remodularization (Mancoridis et al. 1998, 1999; Doval et al. 1999) multi-objective remodularization (Praditwong et al. 2011; Barros 2012; Kumari et al. 2013; Prajapati and Chhabra 2014), and many-objective remodularization (Mkaouer et al. 2015, 2015a; Prajapati and Chhabra 2018, 2018a), 2) type of information: structural-based remodularization (Praditwong et al. 2011; Mancoridis et al. 1998, 1999; Mahdavi et al. 2003), lexical-based remodularization (Corazza et al. 2016), and combined structural + lexical based remodularization (Mancoridis et al. 1998; Prajapati and Chhabra 2017a; Bavota et al. 2010) level of modifications: moderate remodularization (Bavota et al. 2010; Prajapati and Chhabra 2017; Abdeen et al. 2009) software clustering (Praditwong et al. 2011; Kumari et al. 2013; Ouni et al. 2016).

The application of SBSE technique to the software remodularization is approximately two-decade-old. In the formulation of search-based remodularization, the first credit goes to the authors Mancoridis et al. (1998) who first applied the SBSE concepts to cluster the software entities into more cohesive form. In their contribution, they also introduced modularization quality (MQ) measure to evaluate the clustering quality which is defined in terms of two software quality attributes (i.e., inter-connectivity and intra-connectivity). The MQ, a structural information based software modularity quality criterion, has been widely used as the fitness function to guide the remodularization process (Doval et al. 1999; Harman et al. 2002; Mitchell and Mancoridis 2002; Mamaghani and Meybodi 2009). The authors Mancoridis et al. (1999) customized different meta-heuristic search techniques such as Genetic Algorithm (GA), Simulated Annealing (SA), and Hill-Climbing (HC) algorithm to address the software module clustering problem.

Recently, authors Praditwong et al. (2011) used the MQ measure along with other software clustering criteria to formulate the software clustering problem as a multi-objective optimization problem. They also introduced two new multi-objective clustering formulations namely maximize cluster approach (MCA) and equal cluster size approach (ECA). Each of the MCA and ECA formulations contains five partially conflicting objectives and is based on the structural information. The researchers (Barros 2012; Kumari et al. 2013; Prajapati and Chhabra 2014) have also used the MCA and ECA formulation to evaluate different meta-heuristic algorithms.

The above discussed MQ measure is based on the direct link coupling. Recently, the authors Jinhuang and Jing (2016) defined a new MQ measure which is determined on the basis of similarity coupling. Their experimentation results demonstrated that the similarity based MQ outperformed compared to the direct link based MQ. The authors (Prajapati and Chhabra 2017a) have also used the similarity based coupling measure to remodularize the software system. The results demonstrated that the similarity based coupling measure is able to generate good quality software modularization.

Even though structural based software modularity measure able to drive search algorithms towards remodularization solution which is good from the structural point of view but not good from a semantic perspective or developers view. To remodularize the software system which is good from the semantic point of view the researchers Corazza et al. (2016) used the lexical information to compute the similarity between the software entities. Their results showed that the lexical based software remodularization is able to generate remodularization solution which is better from the semantic perspective. Some researchers (e.g., Prajapati and Chhabra 2017a; Bavota et al. 2010, 2013, 2014) used combined structural and lexical information to remodularize the software system.

3 Basic Concepts

This section presents a brief description of information theory which is used in our proposed many-objective remodularization approach. The information theory concepts are very wide; here it is not possible to describe them in detail. The interested reader may find details about the information theory concepts in any information theory textbook (e.g., Cover and Thomas 1991). Apart from the basic concepts of information theory, in this section, a brief description about the various types of structural (e.g., calls, inheritance, contains, etc.) and lexical (e.g., method name, class name, parameter name, etc.) are also provided.

3.1 Minimum Entropy Concept

In this section, we explain the concepts of software entropy corresponding to an object-oriented software remodularization. Here the term feature is used to refer to different types of coupling information (e.g., structural and lexical) of a class. The different values that each feature takes are referred as a feature values. We assume that the object-oriented software to be remodularized contains a set of N number of source code classes, i.e., C = {c₁, c₂, …, c_N} and each class have a set of M features, i.e., F = {f₁, f₂,. .., f_M} with feature values w_i of i-th feature. Our approach starts by representing software system into matrix M as given in Table 1.

Table 1 Matrix M representing software system

Information-Theoretic Remodularization of Object-Oriented Software Systems

Abstract

Similar content being viewed by others

Novel Automatic Approach Using Modified Differential Evaluation to Software Module Clustering Problem

Software Remodularization by Estimating Structural and Conceptual Relations Among Classes and Using Hierarchical Clustering

Restructuring Object-Oriented Software Systems Using Various Aspects of Class Information

Explore related subjects

1 Introduction

2 Related Works

3 Basic Concepts

3.1 Minimum Entropy Concept

3.2 Structural and Lexical Features

3.2.1 Structural-Based Features

3.2.2 Lexical-Based Features

4 Proposed Approach

4.1 Extraction of Software Information

4.2 Remodularization Objectives

4.3 Remodularization Problem Encoding

4.4 Many-Objective Evolutionary Algorithm

5 Experimental Setup

5.1 Studied Software Projects

5.2 Collecting Results

5.3 Result Evaluation Criteria

5.3.1 Non-Extreme Distribution (NED)

5.3.2 Authoritativeness

5.3.3 Stability

5.4 Rival Remodularization Approaches

5.5 Statistical Tests

6 Results and Analysis

6.1 Authoritativeness

6.2 Non-Extreme Distribution (NED)

6.3 Stability

6.4 Discussion

7 Threats to Validity

8 Conclusion and Future Directions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation