Keywords

1 Introduction

1.1 General Background

As software systems grow, maintenance becomes challenging for incoming engineers unfamiliar with the original code, often leading to the need for significant overhauls or discontinuation of extensive legacy systems. To ensure sustainable management, creating modular subsystems is crucial. Instead of portraying them as clusters in the source code, a more practical approach is representing dependencies in graph form. Mancoridis et al. define the software modularisation problem as arising from the exponential complexity of interconnected software module relationships within evolving systems. This is often approached as a heuristic-search-based clustering problem to identify optimal representations by clustering subsystems based on the strength of their relationships [26].

The escalating complexity, often addressed through evolutionary computation, is evident in various software implementations, including both single-objective [3, 16] and multi-objective [18, 27] approaches. Pioneering methodologies aim to enhance the structure of software systems. Optimisation of subsystems extends to diverse attributes, such as classes, methods, and variables. Methodological advancements now consider type-based dependence analysis [22], multi-pattern clustering [8], and effort estimation [32]. These efforts explore pre-processing and post-processing improvements alongside optimisation strategies.

The modularisation of software, mainly through heuristic search and evolutionary computation methodologies, extensively incorporates graph theory and data clustering. Academic works commonly use graph representations of software systems, employing data clustering for nodes and implementing algorithms to assess cluster quality [2, 19, 37]. Despite graph creation not inherently enhancing software engineers’ understanding of architecture structure, language-independent graphs can focus on specific relationships or entire systems [10, 30, 35]. Clustering arrangements can be portrayed through various methods, such as a one-dimensional vector, a two-dimensional cluster-based structure, or a one-dimensional constrained representation known as a restricted growth function, which, despite its constraints, exhibits distinctive properties [7]. Clustering arrangement measurement typically addresses cohesion and coupling, striving for optimal cohesion within clusters and minimal coupling between clusters, fostering the creation of clearly defined groups [2].

1.2 Motivation

A recent study conducted by our research group explores varied representations for clustering arrangements and different starting points, providing insights into the search space of software systems [25]. The study highlights the list-of-lists representation as the most robust, emphasising its significance in problem-solving. Notably, the paper suggests that the starting point choice is inconsequential, as various representations converge towards similar outcomes regarding final fitness, especially one and two-dimensional list-based ones.

This paper is motivated by exploring converging results based on starting points. Our primary objective is to determine whether alternative starting positions can replicate or potentially improve previous findings. If diverse starting positions tend to converge toward a similar region in the search space, we aim to uncover the reasons behind this convergence. Is there a basin of attraction leading to a potential global optimum solution, or do these methods unintentionally get stuck in closely adjacent local optima?

In recent years, a discernible research gap has emerged in clustering arrangement representation and software graph representation. Additionally, up to the present time, there is a notable absence of publications in the field of software modularisation specifically dedicated to addressing the concept of starting points. While we recognise that meta-heuristics, such as Iterated Local Search [21], can generate seeded starting points based on previous experimental iterations, our reference pertains to the primary initial search, distinct from subsequent iterations.

Building upon this motivation, we aim to explore innovative approaches for generating starting points that surpass the performance of previous experiments. If our findings suggest the existence of a basin of attraction, our goal is to devise more efficient methods to reach this point faster than conventional approaches. However, even if the evidence points in a different direction, our overarching objective is to develop a more efficient method for navigating and exploring the search space.

This paper focuses on enhancing software system clustering by integrating graph partitioning techniques with seeded search methods applied to graph-based representations. Situated within Search-Based Software Engineering, our research particularly centres on software modularisation. To achieve our goal, we begin with a domain background, introduce innovative concepts, outline our experimental procedure, and present our results.

2 Related Work

2.1 Bunch and Munch

Exploring software modularisation can be achieved using tools such as Bunch [23] and Munch [4] [5]. Bunch, developed by Mancoridis et al., combines a Steepest Ascent Hill Climbing (SAHC) and Genetic Algorithms for improved clustering arrangements [23, 24]. On the other hand, Arzoky et al., Munch employs Random Mutation Hill Climbing (RMHC) for enhanced performance and ease of implementation [4]. Both strategies use different fitness functions - Bunch utilises the MQ fitness function, while Munch employs EVM and EVMD [4, 24, 34]. Despite employing different measurement strategies, MQ, EVM and EVMD yield similar clustering results [17]. However, the exhaustive nature of Bunch may hinder performance when runtime is a critical consideration.

2.2 Starting Points and Search Space

In the context of a heuristic-search-based clustering problem, the quest for optimal solutions necessitates delving into the search space, which comprises all conceivable arrangements of a clustering configuration. This exploration entails generating an initial clustering arrangement known as the starting point. Subsequently, through mutation (searching), this arrangement is modified and compared to the graph representation of the software. The goal is to enhance the clustering of nodes that demonstrate robust relationships. Before embarking on a search, a crucial decision lies in determining the optimal starting point for seeking an improved clustering arrangement.

Several starting points are available when searching for local optima, which, in our context, represents the nearest approximation to the optimal clustering arrangement that maximises the cohesion of each cluster within the search space. We provide three illustrative examples: we can cluster all nodes individually for maximum coupling (Fig. 1), together for maximum cohesion (Fig. 2), or randomly (Fig. 3).

Fig. 1.
figure 1

Independent

Fig. 2.
figure 2

All In One

Fig. 3.
figure 3

Random

3 Research Questions

We aim to address research inquiries regarding our endeavour to discover improved starting points for software modularisation. We aim to uncover more effective strategies for achieving optimal outcomes. In this paper, we outline the following research questions that we intend to investigate:

  1. 1.

    What is the performance difference between graph-partitioned clustering arrangements and randomly generated ones when applied to large and small software systems?

    1. (a)

      When using hill climbing with various initial clustering arrangements on the same software system, do the solutions converge to similar outcomes or do disparities persist?

    2. (b)

      How do the runtimes of searches using graph partition and randomly generated clustering arrangements vary, and are there trade-offs between runtime and solution quality?

  2. 2.

    Is there a significant disparity between the Weighted KappaFootnote 1 values of the final clustering arrangements and a gold standardFootnote 2, and what is the nature of this comparison?

Initially, we aim to evaluate whether the graph-based initial clustering arrangements result in enhanced outcomes compared to randomly generated configurations through Munch. We aim to contrast the clustering patterns derived from graph partitioning with those generated randomly across software systems of varying sizes. Alongside assessing our fitness function, we analyse the documented improvements at the final iteration. This entails identifying the convergence point and scrutinising the runtime of the search, which encompasses both the initialisation of the starting configuration and the subsequent search process. By possessing information about the ultimate fitness value and the corresponding iteration when it is achieved, we aim to discern the genuine impact of the initial clustering configurations on the search dynamics. We aim to determine whether specific clustering arrangements contribute to a faster convergence, enabling us to refine our search methodology for reaching the convergence point earlier and mitigating the risk of potential time loss.

In addition to assessing the effectiveness of our initial clustering configurations based on fitness, convergence, and runtime, we also evaluate the final clustering arrangements against gold standards using Weighted Kappa (WK) [1]. WK serves as a measure of agreement between two clustering arrangements, explicitly focusing on modularisation. As the WK values increase, the level of agreement between the two solutions also rises. A WK value 1 signifies identical clustering arrangements, while 0 indicates empirical dissimilarity. A WK value of 0.5 or higher indicates a robust structural similarity between the two clustering configurations. We opt for WK over other methods, such as Adjusted Rand [29], due to its ease of implementation, longstanding presence in the field, and well-established interpretability/quality scale. The authors also note that WK and Adjusted Rand are identical.

4 Methods

Our focus now shifts towards the methodologies aligned with our exploration of optimal starting points for software modularisation. We present our selected search method and detail our implementation of graph partitioning designed to yield appropriate starting points.

4.1 Munch

As previously indicated, software modularisation is characterised as a heuristic-search-based clustering problem. Therefore, our initial consideration lies in devising a strategy for heuristic search before delving into the discussion of our implementation of graph partitioning for generating starting points. We adopt a reverse-engineered adaptation of Arzoky et al.’s Munch to address this [5]. This adaptation has been enhanced to afford us the flexibility to determine the commencement of our exploration and the nature of our search strategy. We will now delve into an exploration of the various components that constitute Munch.

Foremost, Munch uses Module Dependency Graphs (MDG) as our graph-based representation of software systems. As defined by Mancoridis et al., MDGs illustrate subsystem connections to gauge relationships between components. In the context of our current research, we designate the nodes of MDG as software classes, and the edges represent interconnected relationships. MDGs prove versatile, capable of describing software structure over time or facilitating the segmentation of extensive software systems for enhanced comprehension. Let MDG M be an n by n symmetric binary matrix, where a 1 at row x and column y (\(M_{xy}\)) indicates a relationship between software components x and y, and 0 indicates that there is no relationship. To avoid confusion throughout this paper, MDG and graph are considered synonymous.

$$ M_{xy} = {\left\{ \begin{array}{ll} 1 &{} \text {if a relationship exists between } x \text { and } y \\ 0 &{} \text {otherwise,} \end{array}\right. } $$

For Munch, we adopt a list-of-list-based cluster representation based on its ease of implementation. A list-of-list clustering arrangement (C) is defined as a list (\([C_1, ..., C_k]\)), with each subset list/cluster (\({C_i}\)) containing 1, 2, ..., n elements. These subsets must be non-empty (\({C_i} \ne \emptyset \)), and they should not share any common items (\({C_i} \cap {C_j} = \emptyset \)) for different subsets. Effective optimisation problem-solving requires consideration of the search space, exploration strategy, and fitness function. Equation 1 illustrates all possible ways to partition \(C_k\) clusters containing n elements. Note that \(1 \le k \le n\). We justify opting for lists over sets in the implementation, emphasising the advantages of simpler implementation and reduced computational complexity, particularly in scenarios involving non-indexed sets. Since each cluster and cluster element requires indexing, the search space aligns with Eq. 1, deviating from the set nature characterised by Bell(n).

$$\begin{aligned} \sum _{k=1}^{n} \left( \frac{n!}{k! \cdot (n - k)!} \cdot k! \right) \end{aligned}$$
(1)

Before delving into the search strategy of Munch, it is essential to define the fitness function. The primary goal of a search strategy is to uncover a clustering arrangement that most effectively aligns with the ideal modular structure of the software system. The assessment entails analysing the subsets \([C_1, ..., C_k]\), where the elements (\({C_i}\)), representing 1, 2, ..., n, illustrate their relationships within the MDG. To avoid confusion, we will refer to the subsets as clusters.

For our replication of Munch, it is unsurprising we introduce EVM as our selected fitness function. We opt for EVM over Bunch’s MQ due to its demonstrated robustness against noise and suitability for real-world software systems, as substantiated by research [17]. When provided with an arrangement C and an MDG, EVM evaluates and scores each cluster by considering the number of intra-relationships in the MDG. To prevent any potential confusion, we establish the definition of EVM as the aggregate of individual cluster scores, denoted as SubEVM (refer to Eqs. 2 and 3). EVM aims to maximise the score of relationships within a specified clustering arrangement. However, a potential drawback exists, as EVM may mistakenly assign high scores to clustering arrangements with high cohesion. Even minor adjustments to a solution can significantly enhance its fitness.

$$\begin{aligned} \textit{EVM}(\textit{C}, \textit{MDG}) = \sum _{i=1}^{k} \textit{SubEVM}(C_{i},\textit{MDG}) \end{aligned}$$
(2)
$$\begin{aligned} \textit{SubEVM}(C_{i}, \textit{MDG}) = \sum _{a=1}^{|C_{i}|-1} \sum _{b=a+1}^{|C_{i}|}(2M(C_{ia}, C_{ib}) - 1) \end{aligned}$$
(3)

To enhance the efficiency of Munch, we incorporate Arzoky et al.’s EVMD. This method generates a score aligning with EVM by integrating past EVM outcomes and determining the new result based on the classes designated for exchange. It demonstrates computational efficiency by computing the new fitness before implementing any modifications. Throughout this paper, we choose to utilise EVM as a collective term, encompassing both EVM and EVMD, to prevent potential confusion in future discussions about EVM.

Concerning the mentioned modifications, the inclusion of EVMD enables the execution of “Try/Do Moves.” This variant of Small-Change, involving the random mutation of clustering arrangements, reduces computational overhead by initially testing the result of a small change (Try Move) before actual implementation (Do Move). To effectively utilise EVMD, the small-change process is limited to two elements simultaneously.

Finally, our focus shifts to the heuristic search. As mentioned earlier, Arzoky et al.’s Munch primarily employ RMHC as its heuristic search method. Despite implementing the ability to alter the heuristic search in our Munch, we opt to persist with RMHC. This choice is motivated by its reliability, ease of implementation, and superior performance compared to stochastic heuristics, such as SAHC. Below, we present Algorithm 1, elucidating how Munch searches for enhanced clustering arrangements. For practical reasons, we choose to employ EVM in the pseudocode example, even though we leverage Arzoky et al.’s EVMD fitness function to enhance performance:

Algorithm 1
figure a

Munch

4.2 Graph Partitioning

So far, we have established the importance of graphs and clustering arrangements regarding software modularisation. Now, we focus on using the structure graphs to discover new clustering arrangement starting points. Specifically, our focus shifts to the Fiedler Vector [12]. This vector is linked to the second smallest eigenvalue, the Fiedler Eigenvalue, of a Laplacian Matrix [13]. Denoted as \(L_{n \times n}\), a Laplacian Matrix is defined as \(L = D - A\) where \(D\) represents the degree matrix of \(A\) which represents the connections between nodes [9]. In this context, \(A \equiv \textit{MDG}\). The Fiedler Vector is distinctive in its capability to enable a nearly perfect binary split of any given matrix. With this characteristic in mind, we have developed a tool that generates starting points through the recursive decomposition of graphs until no more Fiedler Vectors can be produced.

We generate a tree structure to facilitate the recursive decomposition of input graphs. The root of the tree is our input software graph and clustering arrangement. The clustering arrangement must begin with all nodes placed in a single cluster. This initial cluster will be subsequently split alongside the graph, ultimately leading to our final clustering arrangement, representing a fully decomposed software graph.

Leveraging our understanding of the Fiedler Vector, we identify the Fiedler eigenvalue of its attributed graph at each tree node, deduce its associated eigenvectors, and establish a well-balanced, binary-split graph partition. Simultaneously, we split the associated cluster with each partition, ensuring a one-to-one relationship between the subgraph’s nodes and the associated cluster concerning the root MDG. This approach allows us to maintain traceability as we proceed with the decomposition. The new branches that emerge from the root node are reintroduced into a recursive function that continues to iterate until it identifies all possible partitions.

4.3 Starting Points

After generating a tree, we have two starting point approaches. Algorithm 2 illustrates the initial method for creating a “Leaf” arrangement. We gathered all leaf nodes from the tree, identified by their lack of children. Subsequently, we arrange these leaf nodes in ascending order based on SubEVM (see Eq. 3) and then incorporate nodes with unique clusters into our clustering arrangement. We organise all leaf nodes in ascending order to prevent branches from becoming disconnected at different depths, possibly leading to duplicate values. In an ideal scenario, all leaf nodes, regardless of depth, should be unique, and therefore, we incorporate this logic for peace of mind.

Algorithm 2
figure b

BuildLeaf

Algorithm 3 exemplifies our alternative approach to constructing a clustering arrangement. In this method, we recursively traverse the tree, evaluating the cohesion of each node in comparison to its children. This ensures the creation of a clustering arrangement containing all unique values, emphasising the highest possible cohesion within the context of the MDG for a given tree. We refer to this starting point as our “Max” arrangement.

Algorithm 3
figure c

BuildMax

Apart from Leaf and Max, our modified version of Munch can generate clustering arrangements randomly distributed uniformly, denoted as “Random.” Due to publication constraints, we abstain from delving into the intricacies of this method. In summary, a “uniformly distributed random” arrangement is defined by a clustering setup generated through the utilisation of Bell Numbers, Stirling Numbers of the Second Kind [20, 33, 36], and their interconnected relationships [11].

5 Experimental Setup

Before presenting the Munch results of our graph partitioning tool, we need to establish an empirical framework.

5.1 Graph Collection and Pre-processing

First, we must collect software systems. Throughout our research, we developed a specialised tool that extracts open-source software systems using GitHub’s RESTful API [14]. GitHub is our platform of choice for several compelling reasons. With a substantial user base exceeding 94 million developers, a continuously growing number of 52 million open-source repositories, and a cumulative total of 413 million contributions [15], we have access to a wide and diverse range of graphs.

Collecting and forming these graphs is often neglected in academic literature, creating a challenge in determining the authenticity of these systems - whether they are genuinely open-source, artificially generated, or specific to certain industries. Generating MDGs requires understanding the relationships between each class within a given software system. This can be achieved using software metric tools such as SciTools Understand [31], which provide pairwise relationships to build a symmetric graph. After our extractor downloads the desired software system, we manually process each system using SciTools Understand. Future efforts will explore using GitHub’s TreeSitter parsing system [6] to automatically generate MDGs.

In this experiment series, we collect 50 “Small” open-source MDGs with class counts from 100 to 300, chosen based on relevance and high popularity (“stars”) using the GitHub API. Due to storage constraints and the laborious manual MDG creation, we aim to develop an automated MDG generator, contemplating additional storage allocation pending study outcomes. Additionally, we have five “Big” MDGs (class counts: 1000 to 1500) sourced from prior research and industry collaboration, allowing exploration of size and characteristic-based result variations. Refer to Appendix A for a detailed breakdown. The terms “Small 50” and “Big 5” distinguish the two MDG groups in this paper.

5.2 Experiment Setup

Our experiments are described as follows. First, we collect Munch results for each MDG using all starting point combinations and iterations, as outlined below. Secondly, we collect Gold Standard results involving high-iteration/high-fitness outcomes to compare with our initial experiments. Finally, we analyse the results and present our findings concerning our outlined Research Questions.

  1. 1.

    For each experiment, for every graph (Small 50 and Big 5):

    1. (a)

      Select one of three starting points (Leaf, Max, Random)

    2. (b)

      Select one of three iterations (10k, 100k, 1m)

    3. (c)

      Run Munch

    4. (d)

      Document final iteration statistics and associated clustering arrangement

    5. (e)

      Repeat Steps a-c 250 times

  2. 2.

    Repeat Step 1 until all starting points and iterations are explored.

For our gold standards, we generate a Random starting point for each graph and run Munch for 100 million iterations, collecting the same information as in our initial experiments. We repeat the process 250 times to ensure that we compare our initial experiment clustering arrangements to an absolute-best gold standard. Although conducting more iterations would have been preferable, it was impractical due to the extended runtime, taking several days per graph. To streamline experimental runs with our chosen iteration increments, we implemented parallel thread management, allowing multiple instances of Munch to run concurrently while optimising CPU and memory usage.

5.3 Data Collection and Analysis

We gather data on the fitness scores of the ultimate clustering configurations, pinpoint the convergence point (the last iteration demonstrating improved fitness), and gauge the runtime. Furthermore, we document the final clustering configurations into text files. Employing these files, we have crafted a bespoke tool to methodically evaluate the WK between the ultimate configurations derived from our initial points in contrast to our gold standards.

Table 1. Average Final Fitness using Starting Point by Iterations

We have compiled a dataset of 275,000 files, combining the initial experiment results and gold standards. To enhance the manageability of these results for analysis, we employ MS Access, MS Excel, and Python for data processing. Due to the extensive volume of results and constraints in page space, our principal methodology involves computing averages across all data. Additionally, we streamline our findings by identifying and formatting the optimal results, providing a count of these instances per starting point type, thereby highlighting the suitability of each.

6 Results

Initially, we present RMHC results for each starting point category across selected iterations. Our research evaluates the performance of diverse starting points in searches across graphs of varying sizes, considering fitness, convergence, and runtime. The goal is to identify similarities or disparities in these aspects based on our predefined research questions. When implemented on large and small software systems, the performance differences among Leaf, Max, and Random are apparent in Tables 1Footnote 3 and 2Footnote 4. Max consistently demonstrates superior fitness across iterations, as evidenced by the average final fitness values obtained from our three starting points over the specified iterations.

Table 3Footnote 5 details the average final convergence statistics across iterations, indicating the iteration where improvement was observed. A strong resemblance between the average fitness and convergence strongly implies a correlation, potentially indicating a basin of attraction where all solutions converge. Compared to Random in Tables 1 and 2, Leaf and Max achieve final fitness levels more rapidly across all iterations, notably enhancing results. For smaller graphs, convergence is reached well before the considered iterations. Although final iterations align for smaller graphs, more iterations could enhance the likelihood of reaching local optima in larger datasets.

Table 2. Average Final Fitness using Starting Point by Iterations
Table 3. Convergence Statistics

In contrast to average final fitness and convergence, Table 4Footnote 6 highlights cumulative average runtimes presented for each start and subsequent search at various iterations, measured in milliseconds. Notably, these reported runtimes represent summed average runtimes, excluding additional computational overhead related to data I/O. While Random clustering allows faster processing, the overall statistical significance of runtimes is debatable. This prompts consideration of potential trade-offs between runtime efficiency and solution quality.

Table 5Footnote 7 displays WK results, juxtaposing clustering configurations resulting from our initial starting points against gold standards. In multiple statistics and iterations, Leaf and Max consistently surpass Random. The notable closeness between Max results and their corresponding gold standards in smaller graphs contrasts the generally low agreement observed for larger graphs.

Table 4. Average sum of runtime in milliseconds
Table 5. WK against Gold Standard Statistics

7 Summary of Main Findings

In summary, we aimed to show that graph-partitioning can generate starting points capable of improving the results of software modularisation. We encapsulate the findings to address the research inquiries in the following manner:

  • Max starting point:

    • Attains the highest average fitness over 10k, 100k, and 1m iterations, with a pronounced emphasis on lower iteration counts.

    • Attains the highest count of average convergence across all iterations while sustaining the optimal average final fitness.

    • Attains the maximum average agreement (WK) with gold standards across 10k and 100k for both Small 50 and Big 5 graphs, highlighting noteworthy performance, especially in lower iterations.

  • Leaf starting point:

    • Demonstrate fitness levels equal to or surpassing Random across all iterations, especially in the early stages

    • Surpasses Random with higher average fitness levels on large datasets at 10k and 100k iterations

    • Consistently exhibits faster convergence compared to Random.

  • Random starting point:

    • Shows a quicker average total runtime in milliseconds compared to Leaf and Max.

    • Better suited for smaller datasets; however, an improvement over Max and Leaf necessitates higher iterations.

    • Demonstrates greater resemblance to the gold standard than Max in larger systems at 1m iterations.

Distinct fitness variations emerge among Leaf, Max, and Random, with Max consistently outperforming over 10k, 100k, and 1m iterations, notably in Small 50 vs. Big 5 comparisons. Random outperforms Max and Leaf at 1 million iterations for large datasets. However, Max proves to be more suitable for average fitness and faster convergence across iterations and graph sizes. Since there are currently no guidelines for determining the number of iterations based on the size or properties of an MDG, the most prudent approach would be to initiate seeding with Max before executing Munch. WK comparisons show Max starting points yield higher average agreements, with potential improvements around 70% and significant opportunities at 90% agreement in 1m iterations. Thorough exploration is vital for understanding software system graph intricacies. Our commitment to accelerating software modularisation drives deeper exploration, with partition-based clustering performing significantly, especially at smaller iterations, making it compelling for future software optimisation.

8 Generalisability

This publication focuses on utilising graph partitioning for software modularisation. However, the application of graph partitioning for optimising initial positions can extend to other graph-based optimisation problems, contingent on the chosen fitness function. Although we prioritise EVM for its simplicity, other alternatives like MQ are viable. Our aim is to inspire exploration of graph partitioning for seeded optimisation.

9 Future Work

We plan to integrate our graph-based initial clustering with metaheuristics, specifically incorporating seeded starting points into the history of Iterated Local Search, as part of our ongoing investigation [28]. This initiative seeks to evaluate the potential improvement in the exploration of the search space and overall efficiency. Furthermore, our goals include delving deeper into software systems’ search space, exploring graph structure, convergence prediction, and other avenues for enhancing software modularisation.