Keywords

1 Introduction

Industry practitioners perceived legacy systems as business critical and reliable system operated for more than ten to twenty years, but inflexible to adapt to new changes [1]. In respect of legacy system modernization, lacking of knowledge [2] and high maintenance cost [3] are the main drivers [1]. Similarly, it is a challenge for software developers to maintain and support the legacy systems due to lacking the latest documentation and increasing complexity of source codes as times are passing. Thus, software redocumentation is essential to rebuild the documentation of existing resources in order to give a better understanding of the system for the purpose to generate documentation for the modified programs, update the system changes, and creating alternative views. The redocumentation process can be carried out at several stages of the software design process, such as source code, design, or requirement. However, according to a survey conducted by Souza et al. [4], source code level documentation is more relevant documentation or technical that may be classified as technical documentation or functional documentation to aid in program comprehension and maintenance tasks. As specified by Geet et al. [5], the technical document contains features such as a summary of the source code, source code metrics, forms, and method dependencies that are derived from current redocumentation tools [6].

The redocumentation process is comprised of five major components: source code, parser, system knowledge base, view composer, and software documentation [7]. We concentrated on the parser in this study because it is critical for extracting the necessary information to build the documentation. The present limitation of the existing parser is its inability to extract the pieces required for technical documentation from huge amounts of old source code. Existing parsers are embedded into software tools that place a premium on traditional extraction methods, which reduces the effectiveness of extracting pertinent information due to the enormous source code size. Additionally, retrieval should be performed in a timely manner in order to comprehend the program and assist with software maintenance tasks. The limitations of the parser in the redocumentation tool in terms of handling huge source code, the explosion of big data, and the evolution of data processing technologies all encourage the investigation of the proposed approach in the process of software redocumentation.

Analysing software systems and extracting source code components requires processing of source code and rebuild the structure of information. There are many existing studies related to different techniques used in extracting the data [8, 9]. Nevertheless, those approaches used Extract, Transform, Load (ETL) based on relational query approach unable to handle streaming or near real-time data and stimulating environment which demands high availability, low latency and scalability features [10]. Although the traditional ETL may prove to be effective in managing structured, batch-oriented data which, up to date and within scope for corporate insights and decision making [10, 11], it is not suitable for source code that consists of semi structured or unstructured data [12].

Thus, we proposed an approach to use Hadoop Distributed File System (HDFS) and Spark which provide a cluster computing model for distributed computing platforms intended to run the process of redocumentation. The proposed approach would assist software developers maintain the source code efficiently and effectively solving the problems of processing, analyzing and redocument the massive source codes.

This paper is organized as follows: The first section is a background of software redocumentation and related works that gives an overview of some studies and research carried out involved big data processing. In the methodology section, we describe the proposed approach of distributed data processing in software redocumentation. The Case Study section presents the initial work following the proposed approach. We end with a conclusion and give some future works to complete the process of redocumentation.

2 Background and Related Work

As defined in the IEEE Standard for Software Maintenance (IEEE 14764-2006-ISO/IEC), after development and delivery, software maintenance undergoes a similar process as in software development to modify the software product to correct faults, improve performance, or to adapt the product to the modified environment. Software maintenance aims to preserve the software product over time. Christa et al. [13] indicated that the legacy system is a contributor to maintenance cost, effort, and productivity. A significant challenge for software developers is taking over development work if the source code is the only source of understanding the written codes and system documentation is out-of-date, lacking, or incomplete. The software developer spent more time with existing code than creating new software. These are emphasized by [13].

Nallusamy et al. [7, 14] mentioned that software redocumentation is a software documentation update created within the same abstraction and in line with the latest developments of the code. Additionally, it includes analyzing a static representation of a software system to give a different perspective. Earlier studies examined software redocumentation to support software evolution and software maintenance. Essentially, re-documentation is intended to help developers comprehend programs [15, 16]. The results of a four-year long case study [17] demonstrate a significant decrease in maintenance costs and effort due to redocumentation effort. Methods and approaches to solve program understanding have attracted the software engineering community. Such development in this area can be seen from the studies on the category of redocumentation approaches comprise of XML based approach, incremental redocumentation, model-oriented redocumentation, island grammar, doclike modularized graph, ontology-based approach [7], and reverse engineering to transform the source code to UML diagrams such as [18,19,20].

Most evaluation studies of format, granularity, and efficiency showed that all approaches were of low quality when it came to granularity and efficiency [14]. Comparison of two approaches of incremental and model-oriented had shown different usage and purposes [21]. This approach works if outdated or missing documentation is an issue. Rebuild the documentation incrementally, while using a model-oriented approach will produce models from existing systems and generate documentation based on the models. As a result, model-oriented approach is best for redocumenting a legacy system from source code. In other words, effective and efficient software maintenance is necessary. Moreover, code analysis keeps evolving in terms of technique and application development.

As software becomes more complex over time, the number of lines of source code increases, particularly for huge legacy systems. The team has spent several years writing and maintaining these program codes. When new programmer takes over the maintenance job including change request, new programmer required an efficient tool to extract the software components from the source code which handle by the parser in the software redocumentation process. Current parsers in re-documentation tools may be incapable of handling this massive volume of data, as they were not designed for high-volume data processing [22].

As a result, the endeavor to leverage contemporary technology in the management of big legacy systems continues to evolve, as indicated by Wolfart et al. [23]. On the other hand, the work of Ruchir Puri et al. [24] emphasized the necessity of artificial intelligence in acquiring information from huge amounts of source code in order to assist software maintainers in performing maintenance activities. Verena Geist et al. [22] used machine learning to analyze source code comments. These studies focus the strategies used to circumvent the current difficulties connected with digesting large volumes of source code in order to comprehend and conduct software maintenance activities on time. Additionally, several of these investigations used cutting-edge data processing methods to redocument the source code. As a result of these investigations, we have begun to investigate data processing using distributed computing frameworks based on commodity cluster designs, such as Hadoop and Apache Spark. This approach is widely utilized in a variety of fields for the processing of large amounts of data and is constantly evolving [25]. Recent examples include processing and analyzing YouTube big data to determine the success of films and items in comparison to competitors [26] and analyzing airline delays using Spark [27].

Apache Spark is distributed computing system designed to run in a cluster, it is also fast and general purpose. Spark extends the MapReduce model of Hadoop to efficiently support more types of computations, including interactive queries and stream processing [28]. One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also more efficient than MapReduce for complex applications running on disk. The core data units in Spark are called Resilient Distributed Datasets (RDDs). They are only read only collections which partitioned to multiple machines and can be rebuilt if the partitions are lost. RDDs are collections of elements that are distributed, immutable, and fault-tolerant. They can be produced by performing a series of actions on either stable storage data or other RDDs. RDDs can be stored in memory, on disk, or in a combination of the two storage media types. Furthermore, RDD is adopting the Lazy Evaluation approach in order to complete the action task. This is done in order to ensure that compute and memory are used as less as possible. As RDDs are not cached in RAM by default, a persist method is required when data is reused to avoid re-computation [27]. One of the advantages of the Spark environment is provide Application Programming Interface in Scala, Phyton and Java. Furthermore, Spark provides Spark Context as a master node and distributed to worker node through cluster manager. Spark allows to configure properties such as the number of executors, the number of cores per executor, and the amount of memory per executor for each application [29].

Hadoop Distributed File System (HDFS) a reliable and has scalable storage and processing system for a large volume of distributed unstructured data. On the other hand, Apache spark used to speed the processing power which is 100 times faster in memory and 10 times faster by running on disk. Thus, HDFS is an ideal technique to develop a highly scalable application that able to process massive data as compared to a traditional method such as database management systems [25]. As far as our knowledge goes until this paper is written, none of the study that using distributed data processing techniques in the field of software maintenance. This paper shows the approach of using HDFS and Spark environment to generate documentation to assist in software maintenance.

3 Proposed Approach

Our main contribution to the suggested solution is the development of a parser that is utilized to extract the source code component using HDFS and Apache Spark via a distributed data processing technique. As a result, the parser processes raw source code using a distribution strategy to accelerate the process of extracting source code components within the constraints of limited run times.

Fig. 1.
figure 1

Software redocumentation using distribution data technique

Figure 1 illustrates the system architecture for locating and generating source code components via the HDFS and Spark environments. Each component is thoroughly explained as follows:

3.1 Legacy Source Code (LSC)

A software work product or artefact consists of source code, configuration files, built scripts, and auxiliary artifacts [7, 30]. However, this study looked at only LSC during the redocumentation process [4]. SWPs were excluded for two main reasons. First, the most up-to-date or reliable source is the source code. SWPs contain greater precision when compared to other data sets. Second, SWPs are poorly maintained when compared to the source code [31]. Legacy systems undergo numerous changes. Additionally, these changes include numerous software maintainers, who must spend almost half of their time understanding the program’s functionality versus their total time spent on maintenance. This problem affects the software maintenance efficiency.

3.2 Parser

The parser is used to extract necessary information from the SWP and store it in the repository. The proposed approach utilized HDFS for storing, processing, and analyzing the LSC across multiple nodes of commodity hardware. There will be a master node (Name Node) and a slave node (Data Node) [27]. A Name node distributes the works to data node at load time and blocks from the same file are all on different machines. When a data node is failed, it is replicated across multiple data nodes. Yarn acts as a distributed container for the master node’s resources. Figure 2 shows how the source code is distributed across the network through the data nodes. Distributed computing has multiple advantages. It’s scalable and makes it easier to share resources. It also speeds up computation tasks.

Fig. 2.
figure 2

Block replication of the source code dataset in Hadoop cluster

In our proposed context, Spark plays an important role as a parser [29], performing ETL from source code and providing source code components. Yarn is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on Yarn. We used Scala as the programming language and the Databricks Community Cloud (DCC) platform [32] to execute the Scala code, which includes a notebook (workspace) and a spark session. Figure 3 illustrates the Spark data processing flow in DCC:

Fig. 3.
figure 3

Spark data processing flow

3.3 Knowledge Based

A repository component is used in the software redocumentation knowledge based on the data process and produce documentation as defined by Nallusamy et al. [7]. It provides appropriate semi-structured data on source code for building documentation content as well as allowing browsing and searching for relevant content in the documentation [33]. In the redocumentation, some current repositories used conventional models like flat archives, databases, or knowledge bases [34]. These repositories must be used to locate and create different views or documents that software maintainers have requested. These capabilities save software maintainers time and effort in learning about the application domain. As a result, in our proposal, we used a Command Separated Values (CSV) file to store the flat file in the repository and convert it to a data frame during document generation. Data frames enable the query for specific data to be used in the documentation.

3.4 View Composer

View Composer is a user interface that is used to interact with a knowledge-based system in order to retrieve specific components [7]. In addition to a list of modules, the interface needed to be able to see the dependencies between them. Understanding the dependencies between the components is a crucial problem during software maintenance tasks. Software maintainers must be aware of the change impact of making improvements to a specific piece of source code. Therefore, it’s critical to use the search and browsing functions to locate the relevant item.

3.5 Technical Documentation

According to a survey conducted by Souza et al. [4], the most important documents for software maintenance are the source code and comments. However, the problem with outdated comments in the source code can lead to the wrong interpretation of the meanings of the code. Moreover, experts are not available and new software maintainers may find it difficult to understand the current system for carrying out the maintenance tasks. In this perspective, Van Geet et al. [5] emphasized a redocumentation technique to generate a detailed design document. This document is related to the technical documents containing the structure with all the functions, database tables, screens, batch jobs, dependencies among the components and the slices of the program [35].

In the proposed approach, as shown in Table 1, the documentation generated is the technical document that consists of certain elements, such as the source code summary, source code metrics, classes, packages and functional dependencies. The functionalities and elements in this technical documentation are defined to retrieve only relevant parts of the source code using the HTML based documentation, as suggested by Van Geet et al. [5].

Table 1. Software technical documentation schema

4 Case Study

The implementation of this approach is still in the initial stage. In this stage, we have created a simple prototype and have implemented each process specified in Fig. 1. The detailed process of implementation is described in the following sections.

4.1 Legacy Source Code

We used Restaurant Management System legacy source code for this proposed model. This software, which was built ten years ago in Java, provides restaurant management for customers. This end-to-end restaurant management system manages orders, inventory, and employee management. All orders and employee data will be stored in a database. The application has 14905 lines of code. These java files include the backend database code until the front-end GUI interfaces.

4.2 Parser

Once the LSC identified, the first step is to load LSC into HDFS environment. As specified earlier, we used Scala as a programming language and DCC as a cloud platform. We have created the cluster and notebook space to execute the Scala commands. During data preparation, we have identified and grouped java files on Restaurant Management system as specified in the previous section loaded in the data directory of the cloud platform. Next step, we load the source code into RDD which is the fundamental storage unit of Spark in order to extract some information from the source code to do analysis. Spark automatically and transparently divides the data in RDDs into partitions which are distributed across worker nodes in the cluster and parallelize the data performed on these partitions as specified in Fig. 2. Figure 4 shows the command to load the source code into RDD for each 14 java file. After loading all the files into their respective RDD, we created a list to store all the RDD so that created RDD it is easy to use later by using loop structure instead of typing a command that works for each RDD repeatedly.

Fig. 4.
figure 4

Load source code into RDD

In data transformation, the process of classification done through the process of filtration in the data loaded in RDD using Action and Transformation operation. The main classification that needs to be done in this source code is based on the documentation section specified in Table 1. RDD transformation commands such as filter, map, flatMap used in our proposed approach to filter the source code by extracting java packages, classes, interfaces and abstract class in java packages. On the other hand, RDD Actions such as count and aggregate used to perform aggregation functions to provide the source code metrics such as finding the total line of codes, imported libraries, classes, packages and other relevant components. Figure 5 shows the partial Scala command use RDD Transformation and Action used to extract source code components in Spark environment.

Fig. 5.
figure 5

Partial Scala code to extract component

On the other hand, one more important component that needs to be extracted is the dependency which in our implementation emphasize Package, class and function dependency analysis. Figure 6 shows partial Scala code to extract the functional dependency.

Fig. 6.
figure 6

Extract the dependency code

The next process is to load dataset to dataframe. Source code components are extracted and dependencies loaded into a DataFrame. We performed column transformation, and query the DataFrame to get useful information such as code metrics, source code component list and component dependencies to save into CSV file.

4.3 Knowledge Based

The extracted source code components stored in few CSV files based on documentation elements namely source code metrics, and list the components and dependencies among the components.

4.4 View Composer

View Composer or in our context called web-based user interface provides related function to extract the components and present them as HTML documentation. The user interface will use the data from CSV and loaded it into dataframe.

4.5 Technical Documentation

As specified in Sect. 4.2, once data loaded into dataframe, SPARK SQL used to query and retrieve relevant source code components and classified according to documentation section as shown in Table 1. The sample technical documentation generated can be referred to Fig. 7.

Fig. 7.
figure 7

Generate software technical documentation

On other hand, the functional dependencies also shown in Fig. 8 below generated from the Scala code in Fig. 6.

Fig. 8.
figure 8

A function call graph

5 Conclusions and Future Work

In this paper, we have presented our proposed approach for software redocumentation that employs the distribution data technique. As an initial effort, we present the system architecture for locating and generating the source code component through HDFS and Spark environments. Our experiments focus on the parser used to extract the source code components from the SWP and store it in the repository. As a result, Sparks plays an important role as a parser to perform ETL on legacy java source code. The significance of the experiment shows the process of a raw source code by using a distribution technique. This technique helps to speed up the extraction process of the source code component within limited run times. For future work, we plan to use the same approach in different languages and other large legacy systems with precise evaluation to improve the efficiency of our proposed approach.