1 Introduction

Software Product Lines (SPLs) are families of software products with focus on reuse of artefacts (Batory et al. 2004). SPLs have been receiving wide attention in the industry due to their advantages, among them, higher software quality and shorter time-to-market for new products (van d. Linden et al. 2007). However, there are some challenges on the adoption of SPLs in industrial settings. One of the main challenges is how to consolidate existing system variants into a SPL. To tackle such a challenge, reverse engineering strategies have been proposed to deal with different stages of the SPL development (Assunção and Vergilio 2014). The starting point to this reverse engineering process is the extraction of Feature Models (FMs), the de facto standard for modeling variability and commonality in SPLs (Benavides et al. 2010), which denote the set of valid configurations of features that constitute the products of a SPL.

In recent years several approaches to reverse engineer FMs have been proposed. They are based on configuration scripts (She et al. 2011), propositional logic expressions (Czarnecki and Wasowski 2007; She et al. 2014), natural language (Weston et al. 2009), ad hoc algorithms (Acher et al. 2012; Haslinger et al. 2011, 2013), and search-based techniques (Linsbauer et al. 2014; Lopez-Herrejon et al. 2012, 2015; Thianniwet and Cohen 2015). However, they present two main limitations. They do not take a multi-objective perspective to capture the trade-offs that software engineers must make for the reverse engineering task, and do not exploit any knowledge on how the system variants are actually implemented.

A multi-objective perspective has advantage in situations where there are conflicts among the goals of the software engineer. For instance, obtaining an FM that represents a set of desired product configurations can lead to a model that also generates a surplus of configurations, which are not desired. Nevertheless, if we tweak the FM to avoid such additional configurations we might loose some desired configurations. Here a multi-objective algorithm aims to optimizing both goals and enables the engineer to make a decision based on an analysis of the different trade-offs of importance in the problem domain. In addition, the use of knowledge available in implementation artefacts is an important characteristic because we can generate FMs in accordance with the already existing software variants.

To cope with these limitations, our previous work presented an approach to reverse engineer FMs based on the multi-objective algorithm NSGA-II (Assunção et al. 2015). This work used a graph to represent dependencies in the source code artifacts of the existing system variants, and provided software engineers with sets of FMs with different trade-offs. In this paper we extend our previous work by including: i) a second multi-objective evolutionary algorithm, namely Strength Pareto Evolutionary Algorithm (SPEA2) (Coello et al. 2007), ii) a single-objective evolutionary algorithm that relies on a genetic programming representation for baseline comparison, iii) a detailed description of our problem representation and evolutionary operators, iv) seven new case studies, and v) a more thorough empirical evaluation and analysis. In summary, our current work addresses the following research questions:

  • RQ1: What are the benefits and advantages of using a multi-objective approach over a single-objective one to reverse engineer FMs?

  • RQ2: How does the performance of NSGA-II and SPEA2 compare for reverse engineering FMs?

  • RQ3: How could a multi-objective perspective be used in practice to support software engineers in the decision making process?

The remainder of this paper is structured as follows: Section 2 presents the fundamental concepts of feature models, a running example, and the definitions regarding the dependency graphs. The details of the proposed approach are described in Section 3. The setup used in the evaluation of the proposed approach is found in Section 4. Section 5 has the results obtained in the evaluation and the corresponding analysis in order to answer the research questions. Related work is found in Section 6. The research avenues for future work and conclusions are presented respectively in Sections 7 and 8.

2 Background

In this section we present an overview of feature models, a running example to illustrate the details of our approach, and define more precisely the notions of dependencies and dependency graphs which we use to represent the source code dependencies. We rely on these definitions for describing our multi-objective approach in Section 3.

2.1 Feature Models

Feature Models (FMs) are widely used to model the different combinations of features in a SPL (Kang et al. 1990), and are key for supporting variability and commonality management (Benavides et al. 2010). These models follow a tree-like structure where features are depicted as boxes, labelled with the feature name, which are connected by lines with their children features.

An FM always has a root feature that is present in all products of the SPL. The other features can be of type mandatory or optional, or a set of features can be organized in groups as alternative groups or or groups. A mandatory feature is always selected when its parent feature is selected. An optional feature may or may not be selected when its parent is selected. The mandatory feature is represented with filled circle at the end of the relation line and the optional feature is represented with an empty circle. These two types of features are respectively presented in Fig. 1a and b. An alternative group indicates that when the parent feature of the group is selected then exactly one feature of the group must be selected. When the parent feature of an or group is selected then at least one feature of the group must be selected. The alternative group is represented by an empty arc and the or group is represented by a filled arc, as illustrated in Fig. 1c and d, respectively.

Fig. 1
figure 1

Feature models graphical notation

In addition to the hierarchical relation between the features, the valid configuration of features can also be restricted by some additional constraints, called Cross-Tree Constraints (CTCs) (Benavides et al. 2010). The most common types of CTCs are requires and excludes. If a feature A is selected and requires feature B, then feature B must also be selected. If a feature A excludes feature B then these two features cannot both be selected in the same feature combination.

These two types of CTCs are usually respectively represented by single and double-arrow dashed lines as shown in Fig. 1e.

2.2 Running Example

In this subsection we present a running example to illustrate the details of our approach. We selected a set of variants of a drawing application. Our goal is to use these variants as a starting point to obtain a product line called Draw Product Line (DPL). This application offers users the ability to handle a drawing area (feature BASE), draw lines (feature LINE), draw rectangles (feature RECT), draw filled rectangles (feature FILL), select a color for the line or rectangle (feature COLOR), and clean the drawing area (feature WIPE). With these six features we have a total number of 16 variants of the drawing application, presented in Table 1. The symbol \(\checkmark \) indicates the selected features. We refer to each feature combination as a feature set, formally defined as Linsbauer et al. (2014):

Table 1 Feature sets for DPL

Definition 1

Feature Set A feature set is a 2-tuple [sel,\(\overline {\texttt {sel}}\)] where sel and \(\overline {\texttt {sel}}\) are respectively the set of selected and not-selected features of a system variant. Let FL be the list of features of a feature model, such that sel, \(\overline {\texttt {sel}} \subseteq \) FL, sel \(\cap \overline {\texttt {sel}}\) = , and sel\(\overline {\texttt {sel}}\) = FL.

2.3 Source Code Dependency Graphs

One of the main contributions of our approach is to use knowledge from implementation artefacts, besides the feature sets, to reverse engineer FMs. Based on this knowledge, we evaluate variability safety of FMs, a property that guarantees that the feature sets denoted by an FM are structurally well-formed according to the source code. In particular, we exploit the information of existing dependencies between source code fragments, as described next.

To represent the dependencies existing in the source code of the different system variants we use a weighted directed graph. To create this graph we use the terminology and the tool from our previous work (Fischer et al. 2014; Linsbauer et al. 2013). We identify modules of two kinds:

Definition 2

Base Module A base module implements a feature regardless of the presence or absence of any other features and is denoted with the feature name written in lowercase.

Definition 3

Derivative Module A derivative module m = δ n c 0 c 1 c n implements feature interactions, where c i is F (if feature F is selected) or ¬F (if not selected), and n is the order of the derivative.

These two types of modules are the basis of our extraction algorithm (Linsbauer et al. 2013). This algorithm computes traces from the modules to their implementing source code fragments and identifies dependencies between the modules that have dependencies in their implementations. The algorithm considers any granularity of the implementation artefacts, from class level to statement level. Figure 2 presents some examples of traces computed by the algorithm. The traces are indicated using comments at the end of the lines. For example, the field defined in Line 5 traces to the base module Color of the corresponding feature Color. This source code fragment is present in all variants with feature Color, regardless the existence of other features. Another example of a base module is observed in Line 7 of the example. This method header and most of its implementation will be included in class Canvas whenever feature Line is present, regardless of any other feature. However, in Line 10 of the method we can observe a derivative module δ 1(Color, Line), which means that the corresponding line of code will be part of its containing method only when the variant has both features Color and Line.

Table 2 Dependency matrix for DPL
Fig. 2
figure 2

Source code snippet for DPL example

In order to more formally define dependencies and their representation as graphs, we first introduce the concept of module expression as follows:

Definition 4

Module Expression A module expression is the propositional logic representation of modules. For a base module b, the module expression is its own literal b. For a derivative module m = δ n c 0 c 1 c n its module expression corresponds to c 0c 1∧...∧c n .

As an illustration, the module expression of base module Color is Color. For the derivative module δ 1(Color,Line), which indicates the interaction between features Color and Line, the module expression representation is ColorLine.

The traces produced by our trace extraction algorithm are used to identify the dependencies between fragments of source code. These dependencies and the dependency graph they form are formally defined next.

Definition 5

Dependency A dependency establishes a requirement relationship between two sets of modules and it is denoted with a three-tuple (from, to, weight), where from and to each are a set of modules (or module expressions) of the related modules, and weight expresses the strength of the dependency, i.e. the number of dependencies of structural elements in modules from on structural elements in modules to.

We use the dot (.) operator to refer to elements of a tuple, e.g. the weight of a dependency dep is denoted by dep.weight. A dependency’s propositional logic representation is defined as:

$$\bigvee\limits_{mfrom \in dep.from} mfrom ~\Rightarrow ~ \bigwedge\limits_{mto \in dep.to} mto $$

Definition 6

Dependency Graph A dependency graph is a set of dependencies, where each node in the graph corresponds to a set of modules (or module expressions), and every edge in the graph corresponds to a dependency as defined above. Edges are annotated with natural numbers that represent the dependencies’ weights.

Now we recall our running example of the drawing application, Fig. 3 presents the dependency graph considering all its feature sets. To avoid clutter only the lowest order modules are presented, since they are the most relevant to our approach. Self-dependencies are removed from the graph for better readability. To make clear the different kinds of modules, the base modules have solid borders and the derivative modules have dashed borders. In the figure we can observe that the strongest dependencies (i.e. those with the highest weights) are those that go to base modules, e.g. Fill to Rect, and the core feature Base has the largest number of incoming dependencies. As mentioned before, weights in the graph represent the number of structural dependencies between source code elements belonging to different features. For instance, in Fig. 3 there are 10 dependencies from source code elements of WIPE to source code elements of BASE. These dependencies are field accesses (4) and containment relationships (6), i.e a field belongs to a class.

Fig. 3
figure 3

Dependency graph for DPL

Alternatively the dependency graph can be represented as a dependency matrix, as presented in Table 2. The rows are the dependencies, the first column is a dependency identification for easy reference, the second and third columns have the modules of the dependency in the order from to to, respectively. The weight of the dependency is displayed in the fourth column. In the fifth column the weight is normalized to keep the sum of all weights of the graph equal to 1.0. This normalization enables a better interpretation of the values in the optimization process.

To illustrate the propositional logic representation of dependencies let us use again the source code shown in Fig. 2. Firstly consider the dependency that exists between the module δ 1(Color, Line) and the module Color. This dependency exists because the field color defined in Line 5 belongs to the module Color and it is used by newLine = new Line(color,start); in Line 10 which belongs to the module δ 1(Color, Line). The propositional logic expression for this dependency is (ColorLine)⇒Color. In the same code snippet of Fig. 2 we can see that module δ 1(Color, Line) depends on module Line, because the statement in Line 10 is contained in the method void mousePressedLine(MouseEvent e) which belongs to module Line. In this case, the propositional logic expression is (ColorLine)⇒Line.

An important point is to observe that the propositional logic constraints of some dependencies are tautologies, hence they always hold. The two above examples illustrate this situation, since (ColorLineLine)⇔TRUE. This indicates that the implementation artefacts are consistent with the feature combinations represented in the FM.

3 Multi-Objective Approach to Reverse Engineer Feature Models

In this section we describe the details of our multi-objective search-based approach which relies on and extends upon our previous work (Linsbauer et al. 2014). We further describe the requirements identified by Harman et al. (2012) to implement a search-based solution: (i) an adequate representation of solutions, (ii) a set of operators to improve the solutions and explore the search space, and (iii) an adequate way to evaluate the quality of the solutions, the fitness functions.

3.1 Feature Model Representation

The feature model representation uses a simplified version of the SPLX metamodel.Footnote 1 This metamodel, presented in Fig. 4, defines both structure and semantic of the FMs. In the figure, the elements in the left part describe the tree-like structure of the FM and the elements in the right describe the CTCs between the features. The tree nodes Root, Mandatory, Optional, and GroupedFeature inherit from Feature. The tree is composed by exactly one Root feature. Features Mandatory and Optional have the cardinality of zero or more. The tree can also have an arbitrary number of Alternative and Or groups and each group must have at least one GroupedFeature. In the right part of the figure we can observe that an FM has exactly one ConstraintSet. This ConstraintSet describes the propositional formula in CNF. Zero or more Constraint are acceptable, each constraints is a clause in a CNF expression. Each Constraint has exactly one OrClause that has at least one Literal. A literal can either be an Atom which refers directly to a feature, or a Not which refers to an Atom.

Fig. 4
figure 4

Feature model metamodel, extracted from Linsbauer et al. (2014)

Following this representation, the initial population is created by generating random feature trees and random CTCs. Some additional domain constraints are also taken into account, they are presented in the next subsection. The tools FaMa (Benavides et al. 2007) and BeTTy (Segura et al. 2012) were used to create the initial population.

3.2 Evolutionary Operators

We adopted the evolutionary operators from our previous work (Linsbauer et al. 2014), and employ standard tournament selection as selection operator. There are some domain constraints that should be taken into account in the evolutionary process to guarantee the semantics of FMs and to avoid generating invalid solutions:

  • Each feature is identified by its name, so every feature appears exactly once in the FM tree;

  • All FMs have a fixed set of feature names, so in different FMs only the relations between features are different;

  • CTCs can only be either requires or excludes, i.e. exactly two literals per clause with at least one being negated;

  • CTCs must not contradict each other, i.e. the corresponding CNF of the entire constraint set must be satisfiable;

  • There is a maximum number of CTCs (given as a percentage of the number of features) that must not be exceeded.

Our domain constraints do not consider the rare case of contradictions between CTCs and the FM tree for which the detection and repair is computationally expensive. For individuals in that case, we let the evolutionary process itself weed them out because of their bad fitness value.

3.2.1 Mutation

The mutation operator applies small changes in randomly selected parts of the tree or in the CTCs of the feature model. The mutation probability is used to decide if the change is applied in the tree part, in the CTCs, or in both. The kind of change is randomly selected from the following lists:

  • Mutations performed on the tree:

    • Randomly swaps two features in the feature tree;

    • Randomly changes an Alternative relation to an Or relation or vice-versa;

    • Randomly changes an Optional or Mandatory relation to any other kind of relation (Mandatory, Optional, Alternative, Or);

    • Randomly selects a subtree in the feature tree and puts it somewhere else in the tree without violating the metamodel or any of the domain constraints.

  • Mutations performed on the CTCs:

    • Adds a new, randomly created CTC that does not contradict the other CTCs and does not already exist;

    • Randomly removes a CTC.

3.2.2 Crossover

Just like the mutation operator, the crossover must generate offspring in conformance to the metamodel and to the domain constraints. The steps of the crossover process are:

  1. 1.

    The offspring is initialized with the root feature of Parent 1. If the root feature of Parent 2 is a different one then it is added to the offspring as a mandatory child feature of its root feature.

  2. 2.

    Traverse the first parent depth first starting at the root node and add to the offspring a random number r of features that are not already contained by appending them to their respective parent feature already contained in the offspring using the same relation type between them (the parent feature of every visited feature during the traversal is guaranteed to be contained in the offspring due to the depth first traversal order).

  3. 3.

    Traverse the second parent exactly the same way as the first one.

  4. 4.

    Go to Step 2 until every feature is contained in the offspring.

The second child is obtained by performing the same process but with reverse parents, i.e. the position of the parents is swapped.

The CTCs offspring are obtained by merging all the constraints of both parents and then randomly selecting a subset of CTCs that are assigned to the first offspring and the remaining to the second offspring.

3.3 Multi-Objective Perspective

In this section we describe the three objective functions used in our approach and present an illustrative example.

3.3.1 Auxiliary Functions Definitions

In order to compute the objective functions of our approach we need some auxiliary functions. Let us consider \(\mathcal {F}\mathcal {M}\) as the universe of feature models, \(\mathcal {S}\mathcal {F}\mathcal {S}\) the universe of set of feature sets, and sfs a set of feature sets defined by the software engineer. An example of sfs is the feature sets presented in Table 1 for the drawing application. Based on this terminology, we introduce the function featureSets:

Definition 7

featureSets. Function featureSets returns the set of feature sets denoted by a feature model.

$$featureSets : \mathcal{F}\mathcal{M} \rightarrow \mathcal{S}\mathcal{F}\mathcal{S} $$

To measure the variability safety of an FM we have to check if its feature sets are in conformance with the dependencies of the dependency graph. To check this we define the function holds:

Definition 8

holds Function holds(dep, fs) returns 1 if dependency dep holds on the feature set (of a system variant) fs and 0 otherwise. A dependency dep holds for a feature set fs if:

$$\left( \bigwedge\limits_{f \in fs.sel}f \land \bigwedge\limits_{g \in fs.\overline{sel}}\neg g\right) \Rightarrow (dep.from \Rightarrow dep.to) $$

To illustrate this function recall our running example. Let us consider the dependency FillRect and a system variant with the features Base and Line. In this case the function holds returns 1, since the propositional logic formula (BaseLine∧¬Fill∧¬Rect∧¬Wipe∧¬Color)⇒(FillRect) is true.

3.3.2 Fitness Functions Definitions

Considering the two auxiliary functions presented before we are able to introduce the three objective functions of our approach. The first two measures are based on information retrieval metrics, for further details refer to Manning et al. (2008), and the third measure is based on variability safety.

Definition 9

Precision (P). Precision expresses how many of the feature sets denoted by a reverse engineered feature model fm are among the desired feature sets sfs.

$$precision(sfs,fm) = \frac{|sfs \cap featureSets(fm)|}{|featureSets(fm)|} $$

Definition 10

Recall (R). Recall expresses how many of the desired feature sets are denoted by the reverse engineered feature model fm.

$$p recall(sfs,fm) = \frac{|sfs \cap featureSets(fm)|}{|sfs|} $$

Definition 11

Variability Safety (VS). Variability Safety expresses the degree of variability-safety of a reverse engineered feature model fm with respect to a dependency graph dg.

$$variabilitySafety(fm,dg)\,=\,\! \sum\limits_{dep \in dg} dep.weight \times\! \left( \frac{\sum\limits_{fs \in featureSets(fm)} holds(dep, fs)}{|featureSets(fm)|}\right) $$

3.3.3 Fitness Functions Illustration

To illustrate our three objective functions let us consider the feature sets of sfs in Table 1, the dependency graph dg in Fig. 3, and the normalized weight values for dg from Table 2. Figure 5 presents three examples of FMs extracted from the drawing application variants. We computed the values of precision, recall and variability safety for these FMs.

Fig. 5
figure 5

Examples of extracted feature models for DPL

In Fig. 5a the feature model FM1 is an ideal solution for the sfs in Table 1. This feature model has |featureSets(FM1)|=|sfs|=16, leading to precision and recall equal to 1.000, which means that its valid configurations are exactly the same as the desired feature sets. The value of variability safety for this FM is also 1.000, indicating that the valid configurations of FM1 do not break any dependency.

The feature model FM2, presented in Fig. 5b, denotes |featureSets(FM2)|=12 feature sets of which |sfsfeatureSets(FM2)|=10 are in the desired feature sets. With these values we have precision = 0.833 and recall = 0.625. Some feature sets denoted by this feature model do not satisfy all dependencies, leading to a value of variability safety = 0.996. To illustrate some broken dependencies we use the feature set [{Base, Line, Rect, Fill},{Wipe, Color}]. For this feature set the dependency ID 7 (Fill⇒Color), shown in Table 2, is not satisfied because the dependency indicates that when Fill is in the feature set the feature Color must also be included. When a dependency is not satisfied, its normalized weight is not added to the accumulated weight of the satisfied dependencies effectively decreasing the value of variability safety.

Let us now consider the feature model FM3 presented in Fig. 5c. This feature model denotes six feature sets and all of them are desired feature sets. Hence, |featureSets(FM3)|=6 and |sfsfeatureSets(FM3)|=6, leading to precision = 1.000 and recall = 0.375. Furthermore, no dependency is broken by its feature sets, so the value of variability safety is 1.000. For instance, considering a single feature set [{Base, Line, Rect, Color, Fill},{Wipe}], the dependencies with IDs {1, 3, 4, 5, 6, 7, 8, 10, 12, 13} are satisfied because both the from modules and the to modules are contained. In addition, dependencies with IDs {2, 9, 11, 14, 15, 16} are also satisfied because the from modules are not part of the feature set. Once the from module is not included in the feature set, it is not required the to module be contained.

These illustrative examples show the possible diversity of values among the three objectives. From the decision maker point of view FM1 is the best one, since it has the best values for the three objectives. However, in most of the situations, ideal solutions like this do not necessarily exist, and hence many trade-offs must be considered.

4 Experimental Description

In this section we present our experimental setup and describe the case studies used for our evaluation. The implementation and data are available online for replication.Footnote 2

4.1 Experimental Setup

In order to answer our research questions we perform an experiment, which is described as follows. As mentioned before, our focus is to solve the problem of reverse engineering of FMs using three objective functions: Precision (P), Recall (R), and Variability Safety (VS). We designed these measures to be normalized values in the interval between 0 and 1, where the goal is to maximize the values of all objective functions. Hence, the ideal solution is P = 1.0, R = 1.0, and VS = 1.0.

In Section 3.1 we described the genetic programming representation we used. In addition to our previous work (Assunção et al. 2015), where we applied only the Non-Dominated Sorting Genetic Algorithm (NSGA-II) (Deb et al. 2002), here we consider two additional algorithms. The single-objective Genetic Programming (GP) algorithm from related work (Linsbauer et al. 2014), and the Strength Pareto Evolutionary Algorithm (SPEA2) (Zitzler et al. 2001), which is characterized by its external archive used to create the fronts of non-dominated solution in each generation. The GP algorithm was applied to serve as a baseline comparison. Our GP algorithm uses a weighted measure of precision and recall with same weight for both values, called F 1 based on information retrieval theory (Manning et al. 2008). The implementation is the same as in our previous work, for further details, refer to Linsbauer et al. (2014). SPEA2 was added because it is widely used with NSGA-II (Coello et al. 2007) and commonly applied in search-based software engineering approaches (Harman et al. 2012). For the experimentation we used ECJ Framework.Footnote 3 The parameter settings used to configure the algorithms are shown in Table 3. We performed 30 independent runs for each algorithm for each case study. The runs were performed on a machine with an Intel® Core TM i7-4900MQ CPU with 2.80 GHz, 16 GB of memory, and running on a Linux platform.

Table 3 Algorithm’s parameters

4.2 Case Studies

To evaluate the proposed approach we used the case studies presented in Table 4. ArgoUML is an open source tool for UML modelling (Couto et al. 2011). Draw Product Line (DPL), briefly presented in our running example, is a small drawing application. Game Of Life (GOF) is a customizable game. Video On Demand (VOD) implements video-on-demand streaming. ZipMe is an application for files compression. MobileMedia (MM) is an application to manipulate media files, such as photo, music, and video, on mobile devices. In our evaluation we used seven versions of MobileMedia (Figueiredo et al. 2008).

Table 4 Case studies overview

5 Results and Analysis

The average runtime per run of each algorithm is presented in Table 5. GP is the fastest algorithm in all case studies, respectively followed by NSGA-II and SPEA2. As expected, the multi-objective algorithms performed slower than the single-objective GP because of the computation to compose the Pareto fronts in each generation. Nonetheless, next we elaborate on the advantages of a multi-objective approach for our reverse engineering task.

Table 5 Average runtime per run

For the analysis of results obtained by the algorithms we computed two different sets of solutions. Our terminology is based on our previous work (see Assunção et al.2014) and standard multi-objective optimization literature (Coello et al. 2007).

  • Pareto Front True (PF True ): since we do not know the real Pareto Front True, we use PF True as an approximation of the best solutions. This set consists of the best solutions reached by all algorithms for each case study. These best solutions are found by merging all the solutions of all runs of the three algorithms together, then leaving only the non-dominated solutions. The cardinality of PF True for each case study is presented in the second column of Table 6.

  • Pareto Front Known (PF Known ): this set contains the best solutions reached by each algorithm for each case study. To compute PF Known we merged all the solutions of all runs for each algorithm and then we keep only the non-dominated solutions. The cardinality of PF Known for each case study and each algorithm is presented in the fourth column of Table 6.

Table 6 Pareto fronts

5.1 Answering RQ1

Recall that RQ1’s purpose is to find what the benefits are of a multi-objective perspective for reverse engineering FMs. We do so by comparing solutions obtained from multi-objective algorithms against solutions from a single-objective algorithm. From Table 6 we can observe that in seven case studies there is only one single solution in PF True , so all the three objectives could be optimized independently. For the other five case studies with more than one solution there are conflicts among the objectives. Taking into account the seven versions of Mobile Media, we can observe that the three objectives became conflicting in MM-V6. Besides, we can note that MM-V6 and MM-V7 have a large number of solutions in comparison with the other case studies with cardinality larger than one in PF True . We found out that this happens because MM-V6 introduced big changes in the source-code of MobileMedia (Figueiredo et al. 2008). As explained in Figueiredo et al. (2008), MM-V6 and MM-V7 introduced the two alternative features Music and Video, and the mandatory Photo feature was made optional leading to a big impact on the whole system. These changes lead to a larger number of nodes and edges, see Table 4, which makes the dependency graph more complex and impacts the number of solutions found.

For the analysis on the ability of each algorithm to find non-dominated solutions we consider two values presented in Table 6: (i) the number of solutions found by each algorithm, i.e. PF known , that are in PF True , fifth column; and (ii) the average number of solutions found per run that are/were in PF True , sixth column. To illustrate the meaning of these two values we consider ArgoUML, which has the PF True composed of four solutions. For this case study GP found two solutions that are in PF True , NSGA-II found three solutions, and SPEA2 two. On average GP was able to find almost one solution in PF True per run, what is expected because it is a single-objective approach. On the other hand NSGA-II found on average 1.9 solutions per run, and SPEA2 2.0 solutions. For this case study GP is able to find good solutions in comparison with the multi-objective algorithms, NSGA-II found the largest amount of solutions in PF True after the thirty runs, but SPEA2 finds more non-dominated solutions on average in each run.

Regarding the number of solutions in PF True , fifth column of Table 6, we can observe that the three algorithms have the same results for five case studies: DPL, MM-V1, MM-V2, MM-V4, and MM-V5. NSGA-II and SPEA2 outperform GP for two case studies: VOD and MM-V3. NSGA-II found more solutions in PF True for five case studies: ArgoUML, GOL, ZipMe, MM-v6, and MM-V7. On the other hand, considering the average of solutions in PF True per run, sixth column of Table 6, the three algorithms have the same results for four case studies: ArgoUML, ZipMe, MM-V1, and MM-v2. NSGA-II and SPEA2 are better than GP in five case studies: GOL, VOD, MM-V3, MM-V4, and MM-V5. NSGA-II outperform GP and SPEA2 in three case studies: DPL, MM-V6, and MM-V7.

In summary, from Table 6 we can observe that GP can only have similar results of NSGA-II and SPEA2 in two case studies and never reached the best results. NSGA-II and SPEA2 reached the same results and outperform GP in four case studies, and in six case studies NSGA-II is better than GP and SPEA2.

RQ1 Discussion

From our set of case studies we identified that five of them have conflicting objectives. This means that when we get better values for one objective, then another objective is penalized, leading to a set of possible good solutions with different trade-offs. In such situation the multi-objective algorithms NSGA-II and SPEA2 are better than the single-objective GP. From the results we observed that on average the multi-objective algorithms found a set of solutions per run, on the other hand the single-objective algorithm is able to find only one. Furthermore, considering the set of solutions obtained after the 30 runs, GP still was not competitive against the multi-objective algorithms. GP tends to find in each run the same single solution, not exploring other parts of the search space, which is done by the multi-objective algorithms. For seven case studies the three objectives are not conflicting, however NSGA-II and SPEA2 found the same good solutions found by GP. In summary, we can conclude that a multi-objective approach is the best choice to reverse engineer FMs considering our case studies.

5.2 Answering RQ2

Recall that RQ2 aims at comparing the performance of algorithms NSGA-II and SPEA2 for the three selected objective functions. An interesting characteristic to be analysed is the position of the solutions on the search space. Figure 6 presents the three graphs for each case study with more than one solution in PF Known . The first column of graphs presents the view of objectives precision and recall, the second column the objectives of variability safety and precision, and the third column presents variability safety and recall. As already analysed above, for all these case studies NSGA-II reached a larger amount of solutions than SPEA2. However, in the graphs we can observe that despite the smaller number of solutions, SPEA2 has solutions spread over a very similar area on search space explored by NSGA-II. On the previous analysis of Table 6 we observed that in six case studies NSGA-II reached better solutions than SPEA2, probably because of the larger number of solutions returned by the former algorithm. On the other hand, the smaller amount of solutions reached by SPEA2 can help in situations such as observed in the case studies MM-V6 and MM-V7, where a large number of solutions was returned and the decision maker has to choose only one to be used in practice.

Fig. 6
figure 6

Solutions on the search space

Statistical Analysis

To reason about the differences between NSGA-II and SPEA2 we used the well known quality indicator called Hypervolume (Zitzler et al. 2003). Table 7 presents in the second and third columns the average of Hypervolume and standard deviation, in parentheses, for the 30 runs. To compute the Hypervolume the reference point for all case studies was P=1.1, R=1.1, and VS=1.1. Since our problem is a maximization problem, then lower values of Hypervolume are better. To check statistical difference we applied the Wilcoxon test (Bergmann et al. 2000). The p-value obtained for each case study is presented on the fourth column of Table 7. To corroborate our analysis we also compute the effect size with the Vargha-Delaney’s Â12 statistic (Vargha and Delaney 2000), used for assessing randomized algorithms in Software Engineering (Arcuri and Briand 2014), presented on the last two columns of Table 7.

Table 7 Hypervolume and effect size

From the results on Table 7 we observe that there is a difference between both algorithms only for the case studies GOL and MM-V7. The boxplots for these two case studies are presented in Fig. 7. In these two systems the best algorithm was SPEA2. The results for the case studies VOD and ZipME are exactly the same. Despite some differences in the average Hypervolume for the other case studies, they are statistically similar.

Fig. 7
figure 7

Hypervolume boxplots

RQ2 Discussion

Regarding the number of good solutions found on average and after the 30 runs, NSGA-II outperformed SPEA2. However, both are able to widely explore the search space. Using the well-know Hypervolume indicator we observed that in ten case studies there are no significant differences. In the two case studies with significant difference, the algorithm SPEA2 was the best. In conclusion, both algorithms are good to solve the problem of reverse engineering of FMs using our three objectives.

5.3 Answering RQ3

Recall that the purpose of RQ3 was to explain how our approach could be used in practice. Let us now illustrate how. In Table 8 we present the values of the three objectives for all solutions that compose the sets PF known . The exceptions are case studies MM-V6 and MM-V7. For these two case studies we selected those solutions with the value 1.0 for at least one of the objectives. Here the goal is to analyse qualitatively the solutions reached by the algorithms NSGA-II and SPEA2.

Table 8 Non-dominated solutions

As mentioned before, the solutions of NSGA-II and SPEA2 are the same for seven case studies, namely DPL, VOD, MM-V1, MM-V2, MM-V3, MM-V4, and MM-V5. For the remaining case studies it is possible to observe how conflicting the objectives can be. For example, in ArgoUML we can observe that the conflict occurs only between recall and variability safety, since in all solutions have the value 1.0 for precision. A similar situation happens for GOL, but in this case study the variability safety is 1.0 in all solutions. For the case studies ZipMe, MM-V6 and MM-V7 there are solutions with different trade-offs among the three objectives. These solutions, with different values for each objective, help the software engineers in the decision making. They can decide which measure is more important and then select the corresponding solution.

To illustrate this process of selecting one solution, in Fig. 8 we present four FMs from MM-V7 obtained by SPEA2, their tree-like structure and cross-tree constraints. We selected these FMs based on two criteria: i) solutions with value equal to 1.0 for at least one objective, and ii) solutions with the best value for a second objective. For example, The solution presented in Fig. 8a has P = 1.0, satisfying the first criteria, and R = 0.8000000119 the best value of recall among solutions with P = 1.0, satisfying the second criteria.

In the FM presented in Fig. 8a we have 192 valid configurations and all of them are desired products, so the value of precision is 1.0. However, the total number of desired products for this case study is 240, so the recall of this FM is not 1.0. The value of variability safety equal to 0.999162264 means that there are 324 broken dependencies from the dependency graph on the valid configurations. The FMs in Fig. 8b and c have the best values for variability safety, however with very low value of recall. In the FM of Fig. 8b only four valid configurations are possible mainly because of the constraint "Include_CopyMedia EXCLUDES Capture_Photo" what makes all the configurations after the third level of the tree invalid, because the feature Include_CopyMedia is mandatory of Capture_Photo. In Fig. 8c only 16 valid configurations are possible, because the constraint "Include_Video EXCLUDES Include_Music" makes invalid all the configurations below the second level of the tree, since the feature Include_Music is mandatory child of feature Include_Video. In Fig. 8d we have an FM with recall equal to 1.0, which means that all desired products are valid configurations here; however, it denotes more valid configurations than the input, hence decreasing the value of precision. In this same FM, the number of broken dependencies is 624. These are four examples of multi-objective solutions that are given to the software engineers, so that they can analyse the trade-offs and select the one that best meets their needs.

The use of dependency graphs to compute variability safety helps to identify another characteristic in the implementation artefacts, the existence of compilation errors or incoherences in the source code. For example, looking at the FM presented in Figure 8a, the precision is 1.0, this means that all feature sets denoted by the FM belong to the set of desired feature sets, but there are configurations that even though are valid, they do have broken dependencies. In the dependency graph of this case study there is the dependency Include_SMSCapture_Photo, which means that in the source code the feature Include_SMS depends on Capture_Photo; however, in the set of desired configurations there are 72 configurations that have the feature Include_SMS but do not have Capture_Photo, decreasing the value of variability safety.

Fig. 8
figure 8

MM-V7 feature models

RQ3 Discussion

As discussed before, the main advantage of a multi-objective approach is to find a set of good solutions regarding different trade-off among the objectives. In a practical point of view, these solutions allow the software engineer to reason about the different measures used to evaluate the FMs. At the end, the software engineers can select the solution that best fits his/her needs. In addition, multi-objective algorithms can return solutions that can call the attention of the decision maker to characteristics that are not considered in the beginning of the optimization process. We discussed this point about the possible identification of inconsistencies between the source code and the feature sets used as input, e.g. invalid references. To identify such inconsistencies was not the goal of the software engineers when starting the optimization process, however our approach further enables software engineers to reason about inconsistencies by looking at the solutions with different values among the objectives.

5.4 Threats to Validity

The first threat to validity identified was the parameter settings for the algorithms. We addressed this threat by using conventional values for our parameters as we have done in our previous work (Assunção et al. 2015); however, we increased the population size to allow the algorithms to generate more individuals.

The second threat regards to the used case studies. Despite using only twelve case studies, these systems are from different domains and have different sizes. We argue that they are representative to evaluate our approach.

A third threat concerns the baseline comparison. To the best of our knowledge, our approach is the first to use information from implementation artefacts and a multi-objective perspective to reverse engineer feature models. Then as baseline we use the genetic programming algorithm from our previous work (Assunção et al. 2015). Since this single-objective algorithm returns a single solution per run we used averages per run and a set composed with the 30 runs to reason about the differences in comparison with the multi-objective algorithm.

The fourth threat is related to the set of measures we used to evaluate the reverse engineered FMs. Different measures and metrics could produce FMs with other characteristics. However, we believe the three measures we considered are well designed for our goals of representing the actual set of product variants and take into account the source-code structure.

The last threat refers to the validation of the reversed engineered feature models by developers. Certainly, because there are no canonical representations of feature models, domain knowledge can play a significant role when choosing among different feature models. However, this threat is mitigated by considering real case studies that have been extensively studied by us an others for different purposes.

6 Related Work

Search-based techniques have been applied in a wide range of SPL activities such as feature selection, architectural improvement, SPL testing, and feature model construction (Harman et al. 2014; Lopez-Herrejon et al. 2015). To reverse engineer FMs, search-based algorithms were explored in the work of Lopez-Herrejon et al. (2012, 2015). In this work an evolutionary algorithm uses as input a set of desired feature sets and, as objective, a function that maximizes the number of the desired feature sets contained in a feature model disregarding any surplus feature sets that the model could denote (Lopez-Herrejon et al. 2012, 2015). This work was extended by Thianniwet and Cohen Thianniwet and Cohen (2015) to reverse engineer complex features models. The authors designed a fitness function to balance both additional and missing products and create a representation and evolution operators to support complex cross-tree constraints. But none of these works includes the information from the implementation artefacts or a multi-objective perspective.

Feature sets were also used in the work of Haslinger et al. (2011, 2013). The authors used an ad hoc algorithm to identify patterns in the selected and not selected features mapped in parent-child relations of feature models (Haslinger et al. 2011). An extension was done to consider the CTCs requires and excludes (Haslinger et al. 2013). Again the authors do not consider the implementation artefacts.

Czarnecki and Wasowski used a set of propositional logic formulas as input to the reverse engineering task (Czarnecki and Wasowski 2007; She et al. 2014). They propose an ad hoc algorithm to extract multiple feature models from single propositional logic formula preserving the original formulas and reducing redundancy (Czarnecki and Wasowski 2007). Recently, the algorithm has been improved based on CNF and DNF constraints (She et al. 2014). In contrast with our work their starting point are configuration files, documentation files, and constraints expressed in propositional logic instead of feature sets and source code.

Acher et al. proposed an interactive process to map each feature into a feature model, and then all the feature models are merged in a single feature model (Acher et al. 2012). In the work of Sannier et al. they consider product matrices from Wikipedia as start point (Sannier et al. 2013). These matrices can have other values besides select or not select. From these matrices an analysis is performed to identify variability patterns. The authors only mention about the benefits of exploiting that information to extract models such as feature models.

Genetic Programming is also the focus of other pieces of work on software development. An example is the paper of Chan et al. where the authors deal with the problem of product planning and customer satisfaction using a method based on GP (Chan et al. 2011).

There is extensive work using multi-objective optimization algorithms in the field of search-based software engineering. For instance, software module clustering, integration testing, testing resource allocation, protocol tuning, software project scheduling, software project effort estimation, and software defect prediction (Harman et al. 2012; Yao 2013). However, they do not address reverse engineering of FMs.

7 Future Directions

In this section we describe some research opportunities and trends on the task of reverse engineering of feature models.

Note that the variability safety measure is not restricted to implications obtained from a dependency graph. A similar metric can be computed on arbitrary constraints, as long as it can be determined whether they hold or not. This is important because constraints may not only be provided in the form of a dependency graph as a result of a code analysis tool, but also from other sources like constraints defined by domain experts. Generally speaking this measure expresses the conformance of the found candidate feature models to a set of given constraints.

In this sense, the exploration of different sources of information is a future direction. For instance, the use of test cases is a possible target, since they are usually available in most projects. The comments in the source code are also a possible target for new research. Design models, like UML diagrams, are another interesting starting point to be considered.

Besides the use of different artefacts, another research opportunity is to design new measures and metrics to evaluate the feature models. For example, the use of graph similarity could be applied in dependency graphs of each variant. Perhaps the use of semantic similarity measures among elements is a research opportunity. Non-functional characteristics of systems can be another measure to be included in the reverse engineering process. We also identified a lack of complementary metrics to evaluate and validate the reengineered SPL artefacts, these metrics could facilitate the process of comparison between obtained results and expected ones.

There is a lack of tools that support the reverse engineering of feature model in practice. Recently, tools like BUT4Reuse (Martinez et al. 2015), a framework that provides technologies for leverage commonality and variability of software artefacts, have appeared. However, more tools with different purpose and focus are needed. For example, they should cover different programming languages, different artefact types, etc.

Interactive approaches that include the user in the loop are an alternative strategy to extract information that sometimes is only present in the user’s mind. Furthermore, interactive approaches allow the user to guide the reverse engineering process to some preferred directions. Hence performing empirical studies that involve users providing feedback on the generated models is an avenue for future research.

From the identified related work we did not find any strategy to deal with ambiguity in the input artefacts. In other words, the strategies work only with well defined and structured inputs. To deal with incomplete or unsound input is an open challenge.

8 Concluding Remarks

This paper presents an approach to reverse engineer feature models from a set of feature sets and a dependency graph, extracted from the source code. The set of feature sets is used to compute the precision and recall of the feature models. The dependency graph is used to compute the variability safety, i.e. how well-formed the feature model is regarding the implementation artefacts. Furthermore, the approach takes a multi-objective perspective to deal with the three different measures independently, supporting the software engineers in the decision making process.

To evaluate the proposed approach we designed an experiment to answer questions regarding the benefits and advantages of using our multi-objective approach, the performance of two multi-objective evolutionary algorithms for reverse engineering of FMs, and about the practical use of a multi-objective perspective by the software engineers. The experiment was conducted with twelve case studies and used NSGA-II and SPEA2, and a single-objective GP algorithm.

The results indicate that the multi-objective algorithms are better than the single-objective one, which was expected, because in five case studies the three measures are in conflict. From the number of solutions obtained after 30 runs and the average of good solutions found per run, we observe that the algorithm NSGA-II outperforms the algorithm SPEA2. However, the statistical test using the Hypervolume values indicates they do not have a difference in ten case studies, and in the other two the algorithm SPEA2 has better Hypervolume values.

Reverse engineering of feature models has recently received an increasing attention from the research community on SPLs; however, our work also revealed several research avenues worthy of further investigation. Among them, we can mention: using different types of constraints in addition to the source code dependencies, exploiting new sources of information like non-functional properties, devising new metrics as objective functions, dealing with ambiguity in input artefacts, and developing more robust and interactive tools and techniques.