1 Introduction

Software quality is one of the most important issues in software engineering, drawing attention from both practitioners and researchers. Developing quality software is important, but preserving or increasing quality during maintenance is even more important. Controlling maintenance processes is crucial and more difficult than other phases of the software development cycle for several reasons: source code size and complexity, amount and frequency of the maintenance tasks, the assortment of tools for controlling, managing, and documenting modifications, lack of staff experience, staff instability, etc. Maintenance processes consume most of the software development cost (Mens and Tourwé 2004). In order to improve and manage system maintainability, appropriate strategies that integrated with experts efforts should be applied.

Software pieces, which are suffering from poor programming, bad design, or other problems in source code or design, should be identified and improved. Identification of these pieces is what we call Design Smell Detection. A Design Smell does not produce compile-time or run-time errors, but it does negatively affect the system quality attributes, such as understandability, testability, extensibility, reusability or maintainability in general (Pérez-García, 2011; Yamashita and Counsell 2013). Design Smells are alerts, refactoring opportunity indicators (Bavota et al. 2015), or even alarm signals about an increment in the Technical Debt that can lead to project failure.

Ward Cunningham in 1992 presented the metaphor known as Technical Debt.Footnote 1 Cunningham himself in 2008 relaunched the concept in a series of conferences around the world. Technical Debt is associated with a lack of quality due to, for example, the presence of Design Smells. The evidence that a poor structure in source code or design (which is precisely the definition of a Design Smell) is one of the most important factors contributing to Technical Debt was corroborated in Budgen et al. (2008).

Several terms have been used to describe Design Smells in the literature. Code smell was the first related term, coined by Kent Beck, as a hint that something has gone wrong somewhere in the code, when Beck and Fowler prepared a chapter, subsequently called “Bad Smell in Code,” of the Refactoring book (Beck and Fowler 1999, Chapter 3, pp. 83–93). They set out a list of 22 Bad Smells and a description of how to recognize them. The authors even related each Bad Smell with the refactoring catalog (set of steps that can be applied to remove each Bad Smell from code). Robert C. Martin (2003) used the term Design Smells in reference to higher-level smells that cause the decay of the software system’s structure. Other relevant terms can be found in the literature, such as Antipatterns (Brown et al. 1998), Disharmonies (Lanza and Marinescu 2006), Design Flaws (Salehie et al. 2006), Design Defects (Moha 2007), Code Anomalies (Wasylkowski et al. 2007), and Design debt (Zazworka et al. 2011; Suryanarayana et al. 2014).

The term Design Smell was defined as a unifying term in (Pérez-García, 2011), in the line of Martin (2003). Consequently, we adopt Design Smell in this paper as a concept similar to Code Smell or Bad Smell as in (Hassaine et al. 2010; Líška and Polášek 2011; Polásek et al. 2012; Suryanarayana et al. 2014; Alkharabsheh et al. 2016a, b), but in a more general sense, covering the whole range of problems related to software structure, i.e., the design part.

Design Smells can be detected in different software artifacts from fine-grained to coarse-grained, including variables, instructions, operations, methods, classes, packages, subsystems, layers, and their dependencies. We can find several examples of Design Smells affecting different granularity levels, from methods (e.g., Long Parameter List from Bad Smells (Fowler et al. 1999)) to the whole system architecture (e.g., Stovepipe System from Antipatterns (Brown et al. 1998)). This could be one of the reasons why some authors have made further additions to smell terminology.

In the state of the art of Design Smell Detection, we have noticed the lack of systematic mapping reviews. However, the volume of research in the domain has multiplied over the more than 17 years of activity in the field. This has led to the need for critical integration and evaluation of the available research in Design Smell Detection. In our opinion, understanding this area is important because, nowadays, a considerable number of software projects have huge dimensions, so manual Design Smell Detection is not realistic. Problems are latent in code; detection usually occurs very late, and then, solutions are very complex. As a consequence, the software quality is negatively affected and technical debt increases, so redoing the software becomes the most realistic option. We believe, in a comparative perspective, that while refactoring has been extensively adopted by the software industry, Design Smell Detection is far from that reality.

According to our experience, this minor adoption of Design Smell Detection, instead of refactoring adoption, seems contradictory because of the intensive activity in both fields and the intimate relationship between refactoring and Design Smell Detection. A systematic analysis of the state of the art is essential to identify and evaluate the fields that require additional research compared to those in which there is more research available and thus help to define future work with this target.

The methodology followed in this article includes planning (determining the research questions), conditions, and criteria for identifying and selecting primary studies and a final report to gain an extensive overview on Design Smell Detection through answering the research questions. We explore different types of Design Smells, different approaches and strategies in detection, tools, techniques, methods, and their validation, in a broad sense. We follow the same methodology as (Laguna and Crespo 2013) used in their systematic mapping study of software product line evolution and refactoring, which is based on the seminal papers of Kitchenham et al. (2006) and (Kitchenham and Charters 2007). Moreover, we have taken a step forward and apply conceptual modeling techniques to organize the knowledge in the area as a domain model.

The rest of the article is structured as follows: Section 2 comments on related work. Section 3 presents how the mapping study protocol has been planned (research questions, research strategy, classification schema, data extraction, conceptual model). Section 4 presents an overview of the study results obtained at each step of the process described in the previous section and an executive summary of the results. Section 5 presents the detailed study to answer the research questions. Section 6 discusses the findings and their implications. Section 7 presents the threats to the validity of this mapping study. Section 8 presents the conclusions, identifying tendencies and open problems. Appendix A lists the references of all the papers included in the systematic mapping.

2 Related work

Zhang et al. (2011) published a literature review identifying current knowledge concerning the 22 Bad Smells described by Beck and Fowler. They reviewed 319 papers (published from 2000 to 2009) and analyzed in detail 39 articles related to these Bad Smells. They found that some Bad Smells, such as Duplicated Code, have attracted the most attention from researchers, while others, such as Message Chain, receive the least attention. They believe it could be due to the fact that the former have a completely different nature from other Bad Smells, are easy to understand, are widespread in code, and that developers are aware of the problems that can arise in maintenance because of Duplicated Code. They also found that the focus of the research papers can be split into two parts: developing novel tools/methods for Bad Smell detection and improving the current understanding and knowledge of Bad Smell refactoring.

Rattan et al. (2013) presented a systematic literature review of software clone detection (another name related to Duplicated Code detection). This code smell can be understood as the copying and pasting of code fragments during software development, with or without modification. They used the search string “code clone” in the title or abstract of articles. A set of 213 articles were analyzed in detail (published from the initial date of the digital library to 2011). The existing literature about Code Clones was classified broadly into different categories: semantic clone detection and model-based clone detection, empirical evaluation of clone detection, tools/techniques, clone management, etc. They found several contradicting studies stating that Code Clones are harmful and others stating the contrary.

Rasool and Arshad (2015) published a review study of the tools and techniques used for what they call code smell mining. The study period, from 1999 to 2015, included 42 articles reviewed and analyzed in detail. The authors focused on Fowler’s 22 Bad Smells only and classified the tools and techniques based on their detection methods. They then compared a set of tools to detect four Bad Smells. The results found that different tools and techniques rendered different results.

Fernandes et al. (2016) performed a review-based study summarizing and comparing the available Design Smell Detection tools. They analyzed 107 papers relevant to detection tools published from 2000 to 2015. They found 84 tools able to detect different types of Design Smell, support programming languages, such as Java, C, C++, and C#, and which could be produced as commercial or open source release. They then conducted a comparative study between four detection tools in order to detect two of Fowler’s Bad Smells, and they found (as opposed to (Rasool and Arshad 2015)) a high agreement between tools.

Singh and Kaur (2017) conducted a systematic review of refactoring concerning the code smells and Antipatterns. Two hundred thirty-eight papers were reviewed and analyzed from the initial dates of the digital libraries to September 2015. The authors focused on identifying the current status of refactoring, types of approaches, and tools. They revealed six types of approaches for performing refactoring, including automatic method, metric-based method, traditional method, visualization-based technique, semi-automatic method, and empirical studies. Also, different tools were used for detecting Design Smells.

Recently, Gupta et al. (2017) published a literature review to study the code bad smells in Java code. Sixty papers were analyzed in depth from 1999 to 2016. The authors focused, on the one hand, on the detection techniques and the correlation between them and, on the other hand, on identifying Design Smells that require more investigation from the research community. In this study, we have discarded the term “code clone” from the search string, because we assume that it is sufficiently covered by (Rattan et al. 2013) and any interested researcher can be redirected to read it. Our research interest focuses on Design Smell Detection in a broad sense.

This paper presents a systematic mapping that differs from the previously mentioned studies by focusing, on the one hand, on all types of Design Smell (Bad Smell, Antipatterns, Disharmonies, etc., using Design Smell as a unifying term) and, on the other hand, on the detection activity and some other related activities, such as specification, correcting (refactoring), and prioritization. Consequently, the main goal of our systematic mapping study is to collect and organize the knowledge on Design Smell Detection in general (approaches, tools, techniques, datasets, quality factors, etc.) and not just tools, as is the case of some related work. Table 1 shows a brief comparison between our work and the previous works regarding the goal, period of study, number of selected papers, and the total number of Design Smells (DS) that are covered in the study.

Table 1 Related works summary

3 Systematic mapping methodology

Several methods have been performed by researchers for literature reviews, such as experiments, surveys, case studies and systematic reviews. In our study, to address the current status of the Design Smell Detection domain, we conducted a comprehensive systematic mapping study. In this section, we describe the phases of the method we use and how the mapping study has been planned.

The methodology includes defining and answering the research questions regarding the topic of research, defining the research strategy (scope of information, search string, period of study, search method, inclusion, and exclusion criteria), classification of included works and the synthesis of the results. Figure 1 presents the flowchart of our systematic mapping process. The full details of the steps in the chart are described in the following subsections.

Fig. 1
figure 1

Systematic mapping process performed for this paper

3.1 Research questions

This study aims to organize the knowledge on the software Design Smell Detection domain. To clarify the scope of knowledge, several concepts should be identified before formulating the research questions. The core concepts identified are mainly focused on the type of detected Design Smell, the different approaches in detection, the availability of detection tools, the validation evidence for the proposed approach, and the influence on quality factors. The goal of the study is then broken down into five research questions (RQs):

  • RQ1 Which types of Design Smell are detected?

This question can be completed with two subquestions:

  1. (a)

    Which types of Design Smell are dealt with in papers and tools?

  2. (b)

    Which types of Design Smell are more frequently detected in software?

Different types of smell have been found in the literature. We want to identify which Design Smells had gained the focus of researchers, determine which types of smell are widely detected in applications and measure the frequency with which each Design Smell appears. This information can help to reduce the gap between Design Smell Detection research, identifying the directions of future works and planning educational actions to improve developers’ programming skills in specific aspects.

  • RQ2 What approaches have been proposed to detect Design Smells in software?

Several Design Smell Detection approaches have been proposed to detect Design Smells manually, semi-automatically, and automatically. We want to know which approaches are used, how often each approach is used, if there exist compound approaches working together, and the influence of these approaches on smell detection. This information helps to identify differences in results regarding the approach and new trends.

  • RQ3 Which (prototype) tools have been used to detect Design Smells in software?

Different (prototype) tools have been developed to detect Design Smells in software. The goal of this question is to know which tools are used and how often. Also, if the tool depends on a particular type of Design Smell, representation model, or software artifact. We can map the type of smell, the approach, and the kind of artifact with the tools.

  • RQ4 Is the (prototype) tool validated by expert or benchmark or by comparison with other tools? Is the tool or strategy assessing results and measuring performance, precision, and recall?

This question aims to assess the results that the proposed tool or prototype has produced. This assessment can be done through human experts, benchmarking, or comparing the results with different previous results from other tools/prototypes. We want to know which types of validation evidence were used, such as case studies, experiments, surveys, and which type of performance evaluations were used (precision, recall, false positive, false negative, etc.). We are also looking for whether there are guidelines from human experts to enhance the results of automatic detection provided by tools.

  • RQ5 Is Design Smell Detection related to the quality attributes of a quality model?

This question explores whether Design Smell Detection is useful to solve a particular quality problem, determining internal or external attributes from the quality model affected. It could be interesting for developers if Design Smell Detection is related to quality factors. Hence, they can know whether the effect of detecting, and later removing, a Design Smell will have an impact on quality factors and can also help in prioritizing them.

3.2 Search strategy

The research strategy addresses the search string we propose, the scope of the research, the period of search, the search method, and the inclusion/exclusion criteria we identify. In order to have a broad coverage of the state of the art, the search string should be carefully defined.

3.2.1 Search string

We constructed the search string according to the main goal and the research question. The strings should be simple, so as to achieve multiple results and cover the topics exactly. We use the OR Boolean operators to link the main terms and their synonyms. The final search string is:

(“Design Smell” or “design-smell” or “bad smell” or “bad-smell” or “code smell” or “code-smell” or “design defect” or “design-defect” or “design flaw” or “design-flaw” or “antipattern” or “anti-pattern” or “disharmony” or “disharmonies”) and (“detection” or “detecting” or “identification” or “identifying” or “finding” or “empirical”)

In order to use the string composed with the AND/OR Boolean operators, we used the available advanced search in each database.

3.2.2 Search scope

To obtain a high coverage of finding the relevant studies and publications, six electronic databases were included in the search scope. According to Dyba et al. (2007), Laguna and Crespo (2013), Novais et al. (2013), and Vasconcellos et al. (2017) in their systematic mapping studies, these electronic databases are the most popular and efficient to conduct systematic studies in the context of software engineering and reengineering. They are strongly recommended for searching widely. These databases are Science Direct (www.sciencedirect.com), Scopus (www.scopus.com), Web of Science (www.isiknowledge.com), IEEE Xplore (https://ieeexplore.ieee.org/Xplore/home.jsp), ACM Digital Library (www.acm.org), and Springer (www.springerlink.com). In order to find more useful studies that do not appear in the standard search process, the Google Scholar (https://scholar.google.com/) database was included, despite the fact that the search results tend to be repetitive with the search results of the selected databases.

3.2.3 Search period

The time period covered all related papers published in books, journals, conferences, symposiums, and workshops from January 2000 up to the end of December 2017. Design Smell Detection became popular around the beginning of 2002. Nevertheless, the seminal paper on Smells is considered to be Fowler and Beck’s chapter on Bad Smells in 2000. Therefore, we chose this date as the starting time. January 2018 is the time that we started working on this systematic mapping.

3.2.4 Search method

We used both automatic and manual searches in the study. In the automatic search, each electronic database, provided by a search engine, checks the search terms against the metadata, including the title, abstract, and keywords of each paper in the database. In the manual search, on the one hand, we looked inside each conference and workshop report listed in the database that was related to the search string for the papers closely relevant to our mapping study, but which did not appear in the automatic search. On the other hand, we adopted the snow-balling technique (Wohlin 2014), which permits the reference list of the final determined studies from the automatic search (electronic databases) to be analyzed. Hence, we could include them in our mapping study.

3.2.5 Inclusion/exclusion criteria

The selection criteria aim to find all relevant papers in our topic of systematic mapping as follows.

  • Inclusion criteria:

  • Papers published from January 2000 to December 2017

  • Full paper published in conference, journal, workshop, symposium, and chapter in a book

  • Papers where the search string appears in title, abstract, and keywords

  • Exclusion criteria:

  • Papers that do not relate to search strings once their text was explored manually

  • Reports, position papers, PhD theses, research proposals, projects

  • Duplicated papers of the same study in different versions, journals, conferences, and workshops

  • Papers not having their full text available

  • Papers not written in English

Each selected study must satisfy all inclusion criteria and might not satisfy any of the exclusion criteria. Despite the inclusion/exclusion criteria to select the relevant papers, we still found papers that do not answer the research questions, because the keywords on the search string may have different meanings or were used in studies that are beyond the Design Smell Detection topic; for example, the word “smell” can be related to sensors. A cursory reading is required to obtain the final relevant papers, following which the number of papers decreased significantly, as can be seen in Fig. 2.

Fig. 2
figure 2

Flowchart of the search and filtering process

Figure 2 shows the filtering process after searching in the electronic databases and obtaining the primary studies. The first search process returned 3962 papers, discarding duplicated papers reduced the list to 2643. Discarding non-English, non-full text paper reports, position papers, PhD theses, and proposals reduced the list to 2476. The decision on excluding PhD theses could be controversial. We decided to exclude PhD theses because we found most of the theses which are relevant to the search string include a list of publications (Inproceedings, journal) already included in this work. Furthermore, we think the final number of selected studies is enough to identify the current knowledge of the Design Smell Detection field.

Finally, after exploring the text manually based on title, abstract and keywords, to discard non-related papers (not relevant to any of the research questions), we obtained 328 selected papers, and after the snow-balling, we ended up with 395 relevant papers, which represent 10.7% of the initial 3962. These papers are listed in appendix A.

3.3 Classification schema

According to (Kitchenham and Charters 2007), in phase two of the systematic literature review (SLR), the quality assessment process of the studies is used for analyzing and assessing the selected papers to be included in the data extraction and reporting process. Classifying the papers in facets is a nice base for answering research questions. Each facet is defined by means of different related keywords. As a result of the cursory reading, a set of facets can be determined into which the papers can be classified. In our case, the facets were also inspired by the classification framework for Design Smell Management, based on features proposed by Pérez, Moha, and Mens in Pérez et al. (2011).

Table 2 shows the facets we originally defined from this process and the selected keywords we used to categorize the selected papers in the field of study.

Table 2 Facets and keywords initially used to categorize the selected papers in the fields of study (inspired in Pérez et al. 2011)

3.4 Data extraction

One of the most important aspects of systematic mapping is extracting data from the relevant papers after the final search process is completed. A data extraction form is used to extract information relating to the research questions and with the facets stated in Section 3.3. The following items were included in the data extraction form:

  • Paper reference data: title and year, author’s name and email, author’s institution

  • Publication data: type (conference, journal, workshop, symposium, chapter in a book), characteristics (full name, acronym, series, publisher)

  • Facets defined in Section 3.3: type of Design Smell detected and their description, scope of Design Smell, level of Design Smell, type of target artifact, development paradigm of the target artifact

  • Validation evidence conducted by authors of the papers included in the study was also collected in terms of the type of validation, indicators of performance evaluation (precision, recall, false positive, false negative, etc.)

  • Regarding the validation, we found it useful to extract information about the following: name, version of the software used in validation, source of the software used in validation (open source software from SourceForge, GitHub, or other public repositories, proprietary software from industry, some benchmark definition if any), URL (if any), some metrics for characterization of the software used in validation: lines of code, number of packages, classes, methods, and total number of Design Smells detected in software (if any), the count of each type of detected Design Smell in the software (if any), and implementation language of the software used in validation.

After the final search process, the data extraction procedure can be summarized as follows to extract the required information:

  • Reviewing in depth the abstract and conclusions of each article

  • If the extracted information is not found in the abstract and conclusion, the full paper should be read in detail

In order to answer some of the research questions, and after the first data extraction, we introduced extra information items in the data extraction form. These items can be very useful to organize knowledge in this field. Hence, we iterated on the data extraction in order to fulfill all this information.

  • Quality model and quality factors regarding the Design Smell Detection (maintainability, reliability, etc.), if mentioned or analyzed

  • Approach

  • Techniques (algorithms, heuristics, strategies)

  • Degree of automation

  • Related activity with Design Smells (specification, detection, etc.). This was introduced in order to have information not just on the main focus (already collected in the first stage), but also on other activities tackled in the paper

  • The internal representation of the software under analysis (Graph, AST, etc.)

  • Related automatic Design Smell Detection tool (name, URL, free, open source, supported language, automation)

  • Regarding the presence of tools in papers, we collected the kind of presence (i.e., introducing the tool for the first time, comparing tools, improving an existing tool, etc.) and the list of detected Design Smells.

In this step, we used an Excel spreadsheet to document all items obtained from the extraction process. Then, we analyzed the extracted data and applied domain modeling techniques in order to obtain a conceptual model and produce a domain model of Design Smell Detection.

The accomplished domain modeling makes the task more systematic and provides information for the quantitative and qualitative analysis of the domain of knowledge as described in the next subsection. The domain model was the base for relational database design. We designed a preliminary Access databaseFootnote 2 as a prototype to store all items obtained from the extraction process. After that, a full review of the extracted data was done on the Excel spreadsheet in order to populate the Access database and to check whether some information was missing or misplaced, according to the obtained model. A data migration process was then applied from the Access database to a fully revised MySQL database.

Finally, the MySQL database was included in a web application with a twofold purpose. On the one hand, it is a nice way to have all this knowledge available for the community. On the other hand, we decided to design the website to be a collaborative application where researchers can comment, propose a different classification for existing data, and introduce new knowledge on future work on this topic.

3.5 Conceptual model of the knowledge domain

Figure 3 shows the conceptual model represented as a UML class diagram. This conceptual model describes the Design Smell Detection domain of knowledge as an evolution of the previously described facets and the results of several iterations on the data extraction process. According to the lessons learned from analyzing the extracted data, we proceeded to apply the domain modeling techniques. As the figure shows, there is a corner in the model devoted to the article and authors. This part was modeled in such a way that it takes into account the cases where the same author changes, over a period of time, from one institution to another.

Fig. 3
figure 3

Design Smell Detection knowledge as a conceptual model

Despite the fact that 1 to 1 associations are not considered a good practice in domain modeling, we separated both Article and Approach into two concepts, in order to have a kind of descriptive card of the Article in the Approach. As can be seen in the attributes of Approach, we used the enumerated types shown in Fig. 4. The type of some of the attributes is marked with an asterisk (“*”), signifying in conceptual modeling that there can be more than one value. For instance, an article can present an approach that combines different types of approaches (e.g., metric-based and machine learning based) or can deal with different types of artifacts (e.g., source code and test cases).

Fig. 4
figure 4

Enumerated types as a result of evolving facets

Articles can mention tools. Each mention of a tool in an article can be characterized in terms of the values that the enumerated type KindOfPresence represents, such as introducing the tool for the first time, improving an existing tool, comparing the tool with other tools, and reviewing some tool in depth.

Tools are described in terms of some plain data modeled as attributes, but also by means of the association with the Smells the tool can detect. The attribute called “automation” seems to be duplicated in Tool and Approach, but this is intentional. The degree of automation of a tool is considered different from the degree of automation of an approach described in an article that does not mention any particular tool or can even mention several tools in a comparison or related work revision.

Smells deserve a concept in the model with attributes, for instance, describing the scope of the smell (such as package, class, method, …). We also considered it important to represent the Concept the authors use to refer to a particular smell. We observed previously that authors can use the terminology Code Smell, Bad Smell, Design Flaw, …, and we wanted to analyze this and make some clarifications.

In order to answer some of the research questions, such as RQ4, we model whether an article shows any validation evidence. If this is the case, we describe, according to Wohlin et al. (2012), whether the authors conducted a case study, a survey, an experiment, etc. There are also projects associated with the validation evidence.

Finally, RQ5 is concerned with quality factors. In the model, we captured the possibility of relating the detection of some smells with certain quality factors. This relation is characterized in terms of the impact the smell has on the quality factor. We considered the possibility that some smells can relate to more than one quality factor, and even whether, in some cases, the presence of the smell negatively affects the quality factor (which is the intuitive case), but also whether it can affect it positively (for example, the presence of the smell improves efficiency, but hinders maintainability).

Projects are used in case studies, experiments, and so on. Projects are characterized by the attributes showing the project domain, line of codes, and the number of classes. We also describe, on the one hand, which Smells are detected in those projects used in validation and, on the other, which indicators are presented as results of the validation.

Once the model was considered as definitive, a relational database design was obtained, and the physical design of a MySQL database was done. Then, we carried out data migration from the original Access database containing the data extracted at the first stage and completed it with the new data required to fulfill the domain of knowledge model. The data were double checked in order to guarantee the quality of the data extracted and stored in the database. The purpose of the final migration to the MySQL database was to obtain a web application providing the knowledge in this domain, which would then be put in the hands of the community, as mentioned before.

In order to give a descriptive analysis of the collected data and to answer the research questions, several SQL queries were defined and the results were processed, either with Excel and R tools, or with the tool ConExp,Footnote 3 which was used to obtain the Galois lattice when formal concept analysis (FCA) is applied to discover some structure and relationships in the data.

4 Descriptive analysis

We conducted the mapping study according to the steps described in Section 3. After the search, filtering, selection, and data extraction procedures, we have a global view of the investigated field. Firstly, we present an overview of the results based on different aspects of the mapping study. Then, we analyze the data from the selected papers through statistics and FCA operations to answer the research questions. In this section regarding descriptive analysis, we have focused on the most used approaches and tools, as well as the most detected Design Smells. For more details on the rest of the study results, all extracted information is available on the web applicationFootnote 4 that is connected with the fully populated database at https://smellswisdom.herokuapp.com/.

4.1 Overview of results

In the following sections, we describe the results regarding study scope, study selection, and the demographical directions (time period, publication type, geographical area).

4.1.1 Search scope results

Table 3 shows a total of 3962 papers as the result of the first search process, the number of papers per electronic database, and also the percentage that it represents. It can be observed that Springer, Scopus, and ACM DL returned the largest sets of papers. In these databases, we found many papers from other areas that were not related to our study. The IEEE Xplore database returned just 173 papers, but a majority of them were relevant to the study, so this was, as a result, more effective as compared to the remaining databases. Regarding the Google Scholar database, the number of relevant papers was 322, in which all the returned papers were repeated in the six main databases.

Table 3 Distribution of selected studies over electronic database

4.1.2 Time period results

Figure 5 presents the distribution of relevant papers over the time period from 2000 to 2017. During the year 2000, no papers were published in journals and proceedings, only a book chapter. In fact, this book chapter is not the result of the systematic mapping search. It was included manually. It represents the seminal work of Kent Beck and Martin Fowler (Fowler et al. 1999), introducing the term Bad Smell, describing the hints for detecting Bad Smells and relating them with Refactorings. This chapter was included as a reference.

Fig. 5
figure 5

Distribution of relevant papers over time period per publication type

From 2000 to 2003, the number of published papers is less than 5 per year, increasing slowly. As Fig. 5 shows, from 2004 to 2009, the increase is much clearer. From 2010 until the end date of this study (December 2017), there has been a big leap compared with the years before 2009, with a peak in the last 3 years, specifically in 2016.

4.1.3 Publication type results

The selected 395 papers were published as a journal, proceeding papers (conferences, workshops, and symposiums), or book chapters. Figure 5 shows the distribution of selected papers over the publication types and years. Approximately 76% (301 studies) of the selected papers were published in proceedings as follows: (200) conferences, (64) workshops, and (37) symposiums, and 22% (89 studies) of the selected papers were published in leading journals, while only 1% (5 studies) were publications in book chapters.

Figure 6 shows the proceedings that published five or more papers included in this study, and Fig. 7 shows the journals that published three or more papers. Table 4 shows the corresponding full name of the proceedings/journal acronym. As can be seen, the (ICSE) conference, published by ACM/IEEE, has the highest number of selected publications (17 studies) compared with other proceedings. More than 80% of the relevant papers in proceedings are covered by ICSE, CSMR, ICSM, ICPC, ICASE, ICSME, WCRE, SAC, and SANER. On the other hand, more than 80% of the relevant papers published in journals are covered by (TSE, JSS, IST, ESE, SEN, and SQJ). The quality quartile ranking (JCR) for the selected papers ranged from Q1 to Q3 for all journals along the study period as follows: Q1 (17 studies), Q2 (13 studies), Q3 (5 studies). Only four papers are publications in book chapters.

Fig. 6
figure 6

Distribution of papers in proceedings (having published three or more papers included as relevant in this study)

Fig. 7
figure 7

Distribution of papers in journals (having published three or more papers included as relevant in this study)

Table 4 List of proceedings/journals acronym and the corresponding full name

4.1.4 Authors and institutions results

The authors of the relevant papers in the study are from 284 institutions in 48 different countries. Table 5 shows the distribution of the number of institutions and authors by country. Most of the authors (94.5%) were from the academic sector, while (5.5%) were from industry. In some cases, authors from the academic and industry sectors worked together to publish papers. There are 28 collaborative academic/industry papers in the study: [S21, S34, S37, S55, S68, S85, S107, S113, S142, S182, S189, S193, S202, S211, S233, S239, S240, S255, S262, S263, S274, S284, S302, S329, S383, S389, S391]. The highest number of articles were published by Yann-Gauel Guéhéneuc (27), Francesca Arcelli Fontana (23), Marouane Kessentini (18), Naouel Moha (15), and Houari Sahraoui (12). Table 5 shows the distribution of the number of institutions and authors over the countries. The highest number of articles were published by authors from the Université de Montréal in Canada (30 papers) and the Università degli Studi di Milano-Bicocca in Italy (21 papers).

Table 5 Distribution of the number of institutions and authors by country

More than half of the papers were by authors of the same institution (207). Hence, the remaining 188 papers were written in collaboration across institutions.

There are several examples of collaborations in the same country, but also from different European countries (35 papers), from different institutions in North America (USA and Canada: 6 papers), from different institutions in South America (Brazil and Argentina: 3 papers), from different Asian countries (Japan and Qatar: 1 paper). Singular papers (39, which represents 10%) were produced as cross-continental collaborations, namely: [S5, S11, S22, S24, S51, S62, S72, S81, S82, S85, S95, S96, S106, S107, S113, S116, S118, S134, S135, S137, S151, S154, S155, S159, S160, S161, S193, S40, S238, S218, S264, S301, S314, S315, S316, S318, S339, S378, S381].

4.1.5 Key terms in search string

The search terms appear clearly in the title of 284 articles and the keyword terms of 226 articles. Most authors used the concepts “Antipattern” and “Code Smell” to describe Design Smells. More details on this are presented in Section 5.1.1.

Figure 8 shows a cloud of tags produced using https://tagcrowd.com/. Tagcrowd was configured to show words appearing more than 5 times in a text, grouping similar words in English e.g., families of words such as learn, learned, learning in a single word; ignoring common English words, e.g., prepositions, articles, etc., and removing some other words, such as “based” or “towards”). The aim of showing this cloud of tags is to offer a nice picture of the concepts that were relevant for authors when titling their papers in this domain of knowledge.

Fig. 8
figure 8

Cloud of tags produced from joining the titles of the relevant papers

5 Answering the research questions

5.1 RQ1

Which types of Design Smell are detected?

In order to answer the first research question, we tackled the answer from different perspectives: the set of concepts and terms that authors used to describe problems in software structures (code and design); the scope of Design Smells dealt with in the studies; or the most mentioned and detected Design Smells in the selected papers. The information was extracted according to the conceptual model from the “deal with” between the concepts “Article” and “Smell,” the attributes “smellConcept” and “smellScope” of “Smell,” and the attribute “absoluteFrequency” of the association class “Details” that stores information on the association between the concepts “Smell” and “Project” in which each smell is detected.

5.1.1 Design Smell concepts

From our previous experience, we have noticed the variety of terms referring to Design Smells, so we have introduced them all in the search string. During the data extraction process, we confirmed that several terms in the state of the art were used to describe similar concepts. Figure 9 presents the distribution of different terms for Design Smells in the selected papers. As can be seen, more than 80% of the smells cited or detected in the selected papers were mentioned by authors as being Antipattern, Code Smell, or Bad Smell. Nevertheless, there is a consistency problem in the terminology used by authors. The same Design Smell was described by several different terms. Some examples of this are The Blob, which is mentioned in different papers as Antipattern or Design Defect, and Data Class, which is mentioned as Code anomaly, Design Flaw, Bad Smell, etc.

Fig. 9
figure 9

Classification of Design Smell concepts based on the terminology

In the relevant papers obtained, varying amounts of effort were made to classify Design Smells. The authors of [S117, S131, S208, S219] used different classifications.

In [S208], the authors proposed a classification of Design Flaws in three categories: Structural, Behavioral, and Architectural flaws. These categories can intersect with each other. In [S117], Design Defects are classified in the same way as Design flaws in [S208]: Structural, Behavioral, and Semantic smells. According to the study, Data Clumps and Long parameter List are considered Structural, Comments are considered Semantic, and God Package, God Class, and Feature Envy are considered Behavioral Design Defects.

In [S136], the authors used the Design Smell concept to describe Code smells and Antipatterns. They classify Design Smells based on the level of granularity as Inter-Class and Intra-Class. In addition, at each level, the Design Smells are divided into three groups according to their characteristics: Structural, Lexical, and Measurable. Functional Decomposition, The Blob, Duplicate Code, and Message Chain are considered to be Inter-Class, while Data Class, Large Class, Spaghetti Code, and Swiss Army Knife are considered Intra-Class Design Smells.

In [S219], the authors introduce a view of different Design Smell concepts. In their study, Bad Smells were the most general concept, which involved both Code and Architectural Smells. Hybrid Smells combine characteristics from both Code and Architectural smells. Finally, the term Variability Smells was specific for software product lines as a subset of Hybrid Smells, combining characteristics of both Code and Architectural smells.

According to (Brown et al. 1998), in their book, Antipatterns are classified into three categories: Code development, Architectural, and Management. In this context, the Management Antipatterns are not taken into account as they are not related to Design Smells.

Another classification of smells we found is given by (Pérez-García, 2011) according to the problem level. Smells are categorized as low-level and high-level smells.

The low-level smells are related to particular problems in the code such as Large Class, Long methods. The high-level smells are related to more complex problems that may be detected in the structure and code, such as The Blob or StovePipeSystem Antipatterns. Some of the low-level smells can be equivalent to the Code Smell classification. Some of the high-level smells can be equivalent to the Architectural Smells classification. Finally, some other can be a kind of Hybrid Smells.

As shown, some of these classifications have in common the ability to recognize what the authors call “Code” and “Architecture” categories. It is also interesting to consider the possibility of a “Hybrid” category. In the light of the above, as part of the conceptual modeling, we intend to organize the terminology used in the different studies included. Figure 10 shows this organization, which cannot be considered a taxonomy because it is based on the terminology and description used by the authors of the different papers. Design Smell is the most general concept that unifies all other smell types.

Fig. 10
figure 10

Classification of Design Smells based on the author’s terminology and smells

In the figure, Design Smells are classified into three main types: Code Smells, Hybrid Smells, and Architectural Smells. Code Smells are defined in the implementation (lower) level and include what the authors call in their papers Bad Smells in Code, Code Antipatterns, Code Anomalies, Lexical smells, or Change smells. Architectural Smells are defined in the architecture (higher) level (components, connectors, styles, packages, subsystems, communications) and involves what the authors call in their papers Architectural Antipatterns and Architectural Bad Smells. The Hybrid Smells combine the previous two types: Code and Architectural Smells. This category includes what authors call in their papers Design Defect, Disharmony, Design Flaw, Variability Smell, and Usability Smells.

5.1.2 Design Smell scopes in code

Different sets of Design Smells are detected with different scopes at the code level. Smell scopes ranged from a small operation (instruction) to large systems. The conceptual model part, describing the enumerated types, shows the scopes we considered (system, subsystem, package, class, method, operation). Figure 11 presents the distribution of the different Design Smells mentioned in the selected papers. As can be seen, most Design Smells focused on the class (246) and method (183) levels respectively, while a few smells are focused on the package (4) and subsystem (5) levels.

Fig. 11
figure 11

Distribution of Design Smells over scopes in code

5.1.3 Joint Design Smell catalog

In the selected papers, we found 662 different Design Smells present in the different catalogs and included by authors in the different smell concepts which are set out in Fig. 10. Putting them all together in a joint Design Smell catalog, they can be analyzed as a whole. We are particularly interested in which smells attract the attention of the research community and which smells are the most detected in software.

Figure 12 presents the most cited Design Smells in the selected papers. We select those smells cited in more than 25 papers to show in the Figure. The group formed by Feature Envy, God Class, Long Method, Data Class, Shotgun Surgery, The Blob, Long Parameter List, Refused Bequest, Large Class, Duplicate Code, Message Chain, and Data Clumps make up 80% of the most cited smells (more than 40 papers). It can be said that they have attracted the attention of the research community more than other smells.

Fig. 12
figure 12

Distribution of most cited Design Smells over the selected papers

We have also noticed, from the definition of the different Design Smells, that different authors have used different smell names to describe the same problem in design/code. In our experience, the God Class Disharmony, The Blob Antipattern, and Large Class Code Smell are describing similar problems that can essentially be considered as the same. From this set, the most cited Design Smell in the selected papers is God Class. However, we found several authors of studies who have taken these sets (The Blob, God Class, Large Class) independently, such as [S136, S237, S301, S318, S380]. Also, the authors of [S135, S298, S299] cited that The Blob was also called a God Class, but was different from the Large Class. Therefore, we decided to present the relevant information of these sets separately in this work.

Figure 13 presents the distribution of the most detected Design Smells over the total number of projects used in validation (case studies, experiments, etc.) in the relevant papers. The figure only shows those smells detected in more than 40 projects. The Long Method and Feature Envy are detected in the highest number of projects, and God Class, Long Parameter List, and The Blob come next. As can be seen, Long Method, Feature Envy, and God Class have been detected in 65% of the projects used in validation.

Fig. 13
figure 13

Distribution of the most detected Design Smells over the projects used in validation (case studies, experiments,…)

Figures 14 and 15 show, respectively, the class and method level Design Smells that have the highest frequencies in detection. This means the smells were detected in a total amount of classes/methods. All smells in the figures appeared more than 600 times for the method level and more than 1500 times for the class level in all the projects used in validation in the papers included in this study. Regarding the class level smells, Lazy Class was detected in 100 projects, and 80,597 classes were reported as having this smell. On the other hand, in the method level smells, Long Method, which was detected in more than 290 projects, 133,001 methods were reported as having the smell. Detecting Design Smells in a high number of projects does not mean that it has the highest frequency. For example, The Blob Antipattern, which was detected in more than 143 projects, appears in 16,994 classes, while The Blob Operation, which was detected in less than 20 projects, appears 6359 times because there are many more methods than classes.

Fig. 14
figure 14

Frequency of most detected class level Design Smells over the selected projects

Fig. 15
figure 15

Frequency of most detected method level Design Smells over the selected projects

In the data extraction process, we found a set of papers, such as [S7, S29, S48, S54, S74, S113, S167, S225, S234, S237, S264, S327], which are interested in studying and identifying the relationship between different types of Design Smells and if there exists a possible coupling between some types. In [S132], the authors find relations between the following pairs of smells (The Blob, Data Class) and (The Blob, Large Class). Also in [S236], they found a relation between (God Class, God Method), (God Class, Feature Envy), (God Class, Data Class), (God Class, Duplicate Code), (Data Class, Data Clumps), and (Interface Segregation Principle Violation, Shotgun Surgery). In [S264, S327], the authors called the group of related Design Smells that were detected together in the same software artifact (class, method,…) “agglomeration,” as in Divergent Change and Shotgun Surgery [S264], and the group of Divergent Change, Feature Envy, Long Method and Shotgun Surgery [S327].

5.2 RQ2

What approaches have been proposed to detect Design Smells in software?

In order to answer the research question, we analyzed it from different aspects related to the types of approaches, the degree of automation, activities, and types of artifact dealt with. This information was extracted according to the conceptual model using attribute “type” of concept “Approach” and the association called “characterization” with each “Article” (having the “year” of publication as attribute), the attributes “automation,” “mainActivity,” “otherActivities,” and “typeOfArtifact” of the concept “Approach.”

5.2.1 Types of approaches

Several approaches have been proposed in the literature based on different techniques and strategies for Design Smell Detection. In the conceptual model, described in Figs. 3 and 4, each type of approach we detected during the processes of data extraction, classification, and iteration on conceptual modeling is represented in the enumerated type called “TypeOfApproach.”

Figure 16 shows the distribution of the approaches over the publication years. The most frequent approaches found are metric-based, logical/rule-based, machine learning based, Context/Feedback-based, graph-based, and visualization-based. In the years 2000, 2001, and 2002, Design Smell Detection (if done) was carried out by manual inspection based on some metrics, rules, and visualization. From 2003 to 2009, the number of different proposed approaches increased. In 2007, we found 4 papers applying a combination of different approaches. Since 2010 to the end date of the study (December 2017), the different types of approach rose. As can be seen, there is a big leap from the year 2010, with peaks in the last 3 years of study 2015 to 2017.

Fig. 16
figure 16

Distribution approaches over the publication years

More than 60% of the studies are covered by looking at the application of metric-based, logical/rule-based and machine learning-based approaches, where 34% (134 studies) of the selected papers proposed a metric-based approach to detect different types of Design Smells. The following studies [S8, S31, S61, S149, S199, S207, S208, S276, S314, S330, S362, S377, S379] used different sets of defined or proposed quality metrics, such as size, cohesion, inheritance, dependencies, and coupling metrics for smell detection. The relationship between metrics and Design Smells is based on establishing relative threshold values using different techniques and strategies. A substantial effort is needed to identify the right threshold value for each metric. Several studies investigate different methods to obtain these thresholds, such as [S26, S36, S55, S86, S111].

Regarding the logical/rule-based approach, 15% (59 studies) of the selected papers focused on rules for smell detection [S27, S32, S95, S106, S117, S169, S204, S263, S341, S358]. These rules are either specified by using a combination of metrics and threshold values or other different types of facts directly related to the definition of the Design Smells. Each rule is specific to particular Design Smells and can be defined manually or automatically using different techniques.

Hence, the types of approaches applied in the papers are not mutually exclusive. Combinations of some of these types of approaches can lead to better solutions. Between 2010 and 2012, the research focus moved to machine learning approaches in order to improve smell detection. The authors of 11% (45 studies) of the selected papers applied machine learning. Another interesting point can be found in 2015, when the application of search-based approaches gained interest. However, automatic learning and search-based approaches have not been widely exploited in smell detection, when compared with metric and logical/rule-based approaches. Several learning techniques derived from specific classifiers were proposed in the literature [S9, S57, S79, S99, S119, S120, S151, S290, S294, S312, S320, S334, S365, S382] to solve classification problems.

In 8% (33 studies) of the papers, the authors focused their attention on the role of human experts in different Design Smell activities, from detection to prioritization to refactoring [S84, S158, S313, S364, S380]. The Context/Feedback-based approach considered important factors and should be taken into account when suggesting different techniques.

The authors of 6% (23 studies) of the selected papers [S11, S70, S98, S154, S335, S366] used the search-based approach to detect different types of Design Smells. Several techniques and algorithms were proposed for extracting the specified rules to detect smells with techniques based on genetic and heuristic search algorithms.

Regarding the Historical information-based, 6% (22 studies) of the selected papers adopted the historical change in source code as another source of information to identify different types of Design Smells [S155, S160, S260, S264, S307, S353]. The evolution of the software artifact over a long period assists software maintainers in prioritizing the most critical parts of the code.

The application of graph-based and visualization-based approaches are very similar. Diagrams and graphs are used to support the detection process. Graph-based approaches were applied in 7% (26 studies) of the selected papers, for instance [S147, S211, S277, S278, S347]; also 7% (26 studies) were applications of the visualization-based approach, for instance [S55, S140, S322, S326, S346, S378, S386, S391]. In some cases, when the software is very complex, the graphical representation of the software artifact arises as a solution to deal with complexity. These approaches have been integrated with some detection tools, as presented in [S47, S48].

Other approaches that are proposed for Design Smell Detection are applied in fewer than 16 papers each, such as Model-based (16 studies) in [S37, S221, S287], Dependency analysis-based (7 studies) in [S108, S167, S286, S293], Generative from Specification (7 studies) in [S51, S102, S298, S343], and Collaborative-based (4 studies) [S195, S332, S333]. There are also examples of Syntactic-based [S181, S228], Textual-based [S157, S163], and Clustering Analysis-based [S49, S125], with 2 studies for each of them, while the Probability matrix-based [S174] and Filter-based [S53] approaches are singular cases applied in just one study.

When these types of approach are combined to improve smell detection, it is interesting to analyze the relationships between the different types in the papers in the study. We applied formal concept analysis (FCA) (Ganter and Wille 1999) to study these relationships. As a result of FCA, we obtained a Galois lattice representing the relations between types of approaches and the papers that applied them. The construction of a Galois lattice guarantees that each attribute is introduced in a single point (node). Therefore, each node in the lattice represents the objects (papers) with the same attributes (the type of approaches applied in the paper). When a paper combined more than one type of approach, it has them as attributes. Hence, a hierarchical partial order is established. As can be seen in Fig. 17, nodes in the lower levels of the lattice, which are connected to higher level nodes, “inherit” the attributes and introduce new attributes. In this way, the nodes in the lower position represent the set of papers that combine approaches, while the nodes in the higher positions represent the set of papers that applied one type of approach. Descending from top to bottom of the lattice, following the connections, we can find the different combinations applied. For example, the authors of studies [S84, S158] applied the Context/Feedback-based approach only, while the author of [S223] combined two types of approaches: Context/Feedback-based and Historical information-based. Also, the authors of the studies [S6, S17, S52, S73, S89, S101, S103, S105, S107, S124, S126, S130, S133, S135, S137, S138, S139, S177, S187, S222, S272, S285, S306, S316] combined the metric-based approach with other types, such as logical/rule-based, visualization-based, search-based, Historical information-based, graph-based, and/or machine learning.

Fig. 17
figure 17

Galois lattice for combinations of approaches

5.2.2 Degree of automation

The proposed approaches varied in the degree of automation from manual inspection to fully automated. Manual detection is time-consuming and error-prone because it is applied by software developers/maintainers without any assistance from other tools to find Design Smells. So, this type is not efficient with large size projects. The computer-aided degree is focused on using the computer to obtain some numerical measurements, such as metrics, statistical values, or a graphical representation of the software in order to assist in the detection task. Detection is done by the developer/maintainer but with some aid. This is placed between manual guidelines and semi-automated approaches. This kind of processes is less repeatable and efficient than the semi-automatic and automatic. The semi-automatic approaches perform the detection tasks, but they also require human intervention. Some feedback or interaction with experts is needed. Finally, the fully automatic approach is when the highest degree of automation is reached, where no human intervention is required.

Figure 18 shows the distribution of different degrees of automation over the papers in the study, classified according to what the authors said in their papers. As can be seen, 63% of the selected papers focused on fully automated detection, while 19% are considered semi-automated and a few approaches used the manual guidelines (12%) and computer aided (4%).

Fig. 18
figure 18

Degree of automation over the selected papers

During the data extraction, we found a contradiction between many authors on classifying the work done as semi-automated or fully automated. For example, the proposal in [S132] is considered by the author of [S158, S228] as fully automated, while the authors of [S223] consider it as semi-automated. We consider the proposal in [S132] to be fully automated. Another example can be found in [S126, S139], which is considered in [S100] to be in the degree of manual detection, while in [S35] the authors consider them as semi-automated. We considered both to be semi-automated. Also in [S100], the authors considered the proposal in [S174] as semi-automated, while in [S158] the authors considered it as fully automated. We consider the work in [S174] is semi-automated. There is a conflict in classifying the proposed approaches based on the degree of automation in some papers, and the reason upon which this decision is based for distinguishing between automation concepts.

5.2.3 Activities

The work done in the different papers included in the study focused on one or more activities related to Design Smells. As a result of the effort made in the conceptual modeling, the activities developed in the selected papers are divided into six main types that involved Specification, Detection, Correction, Visualization, Prioritization, and Impact Analysis. As can be seen in Fig. 19, most of the proposed approaches focused on smell detection more than any other activity.

Fig. 19
figure 19

Distribution of the Design Smell related activity over the selected papers

The specification activity is related to specifying new types of Design Smells, improving the definition of existing smells, or producing a new taxonomy. These papers used domain-specific languages that allow the specification of several Design Smells and can certainly be found in [S102, S135, S137, S144, S171, S228, S298, S306, S367].

The majority of the works in the study focused on the detection activity, which is the main activity that we focused on in this work.

The correction activity is related to the proposed method for removing the Design Smells and improving the software design. This activity modifies the software artifact that includes the detected Design Smells, without changing the behavior. This is what refactoring does. This activity is implemented with various tools and supported in a semi-automated or fully automated manner. The following studies focused on this activity [S18, S191, S72, S118, S127, S130, S152, S153, S155, S173, S244, S255, S320, S335, S362, S387].

The visualization activity, such as that developed in the studies [S13, S28, S47, S48, S125, S147, S211, S322, S346, S378, S386], presents a graphical representation of the specified software artifact, which helps in the detection or graphical representation of the presence of the smell itself.

The Impact Analysis activity focused on analyzing the impact of relations between different types of Design Smells, or the impact of changing specific Design Smells on different software quality factors, such as maintainability and understandability. The authors of [S52, S59, S89, S190, S236, S240, S245, S268, S307, S327, S369] focused their attention on this activity.

Finally, the prioritization activity related to the ability of a Design Smell Detection proposal to rank them based on their impact on software quality or other factors. This activity can be supported in a semi-automated or fully automated way. Some works have appeared in the last 5 years, such as [S17, S152, S186, S192, S222, S223, S239, S275, S282, S292, S352, S360], focusing on this activity in parallel with other smell activities, such as detection, correction, visualization, and smell impact analysis.

5.2.4 Types of artifacts

Design Smells can be detected in different types of software artifacts, such as Executable or binary code, source code, UML Diagrams (class, communication, activities), workflows or process diagram, test cases, and ontologies. As a result of the conceptual modeling after several iterations on data extraction, we describe an enumerated type TypeOfArtefacts previously shown in Fig. 4.

Design Smells are detected in different types of software artifacts, as shown in Fig. 20. The majority of the papers in the study focus on detecting Design Smells in the software source code, while a few detect them on UML diagrams, such as class or communication diagrams, binary code, ontology and test cases. The nature of Design Smells influences the type of artifact used in detection.

Fig. 20
figure 20

Distribution of the software artifacts over the selected papers

For example, some of the smells related to the software architecture can be detected using UML class and component diagrams, and so on.

5.3 RQ3

Which (prototype) tools have been used to detect Design Smells in software?

In this section, we describe the results obtained from the data collected as part of the concept “Tool” in the conceptual model, such as Availability and Licensing (attributes “isFree” and “isOpenSource”), Internal Software Representation (attribute “internalSoftwareRepresentation”), languages that can be processed (attribute “supportedLanguages”), and the Design Smells detected (obtained throughout the association “can be detected with” with the concept “Smell”). The attribute “kind” of the association class “PresenceDescription” stores information regarding the association called “mention” between each article in the study and the tools mentioned.

Several tools/prototypes have been developed in the literature for semi-automatic or fully automatic smell detection. We found that in general the proposed tools/prototypes detect Design Smells as binary decisions (having the smell or not).

Different techniques are applied in tools/strategies to identify the candidate smells. Figure 21 presents a cloud of tags produced using TagCrowd with the same settings as those used in Fig. 8. The text is extracted from the string field called “technique” that we included in the “Approach” concept when elaborating the conceptual model. This free text was reserved for annotating algorithm names and other details. The size of the words indicates their frequency and importance. Different types of mining algorithms, dynamic or static analysis, meta-modeling, genetic algorithms (genetic programming), visualization, etc., were also used.

Fig. 21
figure 21

Cloud of tags produced from joining all the strings collected in attribute “techniques” from concept “Approach”

During the data extraction, we found 148 different tools and prototypes for Design Smell Detection. The first Design Smell Detection tool to appear was reported in 2002 (jCOSMO [S220]). There has been a continuous rise in the appearance of new detection tools between 2004 and 2010, such as Analyst4j, Cultivate (from jTransformer suite), DÉCOR, iPlasma, inCode, inFusion, jDeodorant, an emerging PMD, Reek, RevJava, SA4J, and Together.

As stated before, between 2010 and 2015, research activity regarding Design Smell Detection has experienced rapid and huge growth. A new group of tools has emerged, including Stench Blossom, ConcernReCS, SourceMiner, BSDT, JCodeCanine, GrouMiner, CodeVizard, JSNose, Hist-Inspect, SVMDetect, PTIDEJ suite (containing DÉCOR and its evolution DETEX), BLOP, and an evolution of the previously emergent PMD with a new set of rules for Design Smell Detection. There is also a set of research prototypes without any particular name that implements the techniques reported by their authors in their publications.

Figure 22 shows the most mentioned detection tools in the selected papers. The tools in the figure have more than 25 mentions in different papers. As can be seen, JDeodorant (Fokaefs and Tsantalis 2007) and DÉCOR (Moha et al. 2010) are the most mentioned, with significant differences concerning the other tools. JDeodorant appears in 65 studies and DÉCOR in 63 studies. The group of JDeodorant, DÉCOR, PMD, and InFusion covers more than 80% of the selected papers.

Fig. 22
figure 22

Frequency of the most mentioned detection tools

5.3.1 Reasons for the presence of a tool in a paper

There are different reasons for the presence of a tool in a paper. We classify the kinds of presence into five categories, as shown in Fig. 23. A tool is mentioned in a paper for comparison purposes 33% of the times, i.e., to compare results in detection with other different tools or prototypes as part of the evaluation processes of a particular tool, or in studies focused on comparing different tools.

Fig. 23
figure 23

Distribution of the kinds of presence of tools across the papers in the study

The presence of tools in papers aimed at the introduction of a novel tool represent 25% of mentions, while improving the tools for detecting a new type of smell, obtaining better performance indicators, etc., represent 19%. The presence of tools in papers due to review studies was 12%.

We find that DÉCOR and JDeodorant were the tools with the greatest presence in comparisons, with DÉCOR appearing in 27 studies and JDeodorant in 20 studies. On the other hand, iPlasma and inCode were the most frequent in papers aimed at improvements (7 studies for each), while, regarding the presence in review studies, JDeodorant is the most mentioned.

5.3.2 Characteristics of the most cited tools

Table 6 summarizes some characteristics of the most mentioned detection tools (those shown in Fig. 22). The tools are characterized by whether they are free or not, whether they are open source or proprietary, their supported languages, the terms used to describe the Design Smells, the internal representation of the software artifact, the degree of automation, the ability to also perform refactoring, the way to run the tool (execution environment), their ability to generate metrics, the type of input source, the output format, the facility to work with command line (TUI) or Graphical User Interface (GUI), and the list of Design Smells the tool can detect.

Table 6 Characteristics of the most cited detection tools

As can be seen in the table, inFusion and inCode are commercial tools which include more features than open source tools. By looking at the origin of the pair of tools (inCode, inFusion), we find they were developed by the same institution, the intooitus company (https://www.intooitus.com/), which was a startup that originally arose from the group (LOOSE) at the Politecnica University of Timisoara in Romania. iPlasma was developed by the same LOOSE research group. Therefore, the iPlasma tool is the origin of inCode and inFusion, and their techniques have iPlasma as their base.

Each tool was developed to detect a particular set of Design Smells. Analyzing the most cited tools in detail, the majority adopted the metric-based approach to detect Design Smells, except JDeodorant, which uses clustering methods, and PMD, which uses a rule-based approach.

The inFusion tool has the ability to detect 22 Design Smells in different software artifacts. DÉCOR can detect both the higher-level smells (Architectural Antipatterns) and a set of lower-level smells (Code smell) as well.

The relation between detection tools and Design Smells will be explained in more detail with the FCA graph later in this section.

5.3.3 Programming languages that can be processed

Figure 24 shows the relation between the most cited detection tools and the supported programming languages. As can be seen from the Galois lattice figure, the objects are detection tools, and the attributes are the programming languages. The context is defined by indicating that an object (tool) has an attribute (language). The objects (tools) in the low level inherit all the attributes (languages they can process) of the objects connected with them in the higher level. At the top of the lattice, the objects (tools) we find are JDeodorant and DÉCOR, which have a single attribute Java. The structure of the lattice shows Java is supported by all these tools. The object PMD, which is in the lowest level, supports all programming languages (Java, C, C++) introduced in the higher levels of objects and the languages that only PMD can process (C#, Groovy, PHP, Ruby, Fortran, JavaScript, PLSQL). Eighty-one percent (120 tools) of the detection tools cited in the selected papers (including all the tools shown in the figure) supported detection in Java source code.

Fig. 24
figure 24

Relation between most cited detection tools and supported programming languages

C++/C are the second most supported, while C# is the third. Also, other languages not mentioned in the figure are supported by some tools, such as Paython by Pysmell and PMD-CPD, Smalltalk by JSpIRIT and SpIRIT, COBOL and Visual Basic by CCFinder, ObjectPascal/Delphi by Sissy,Footnote 5 and UML supported by Together and the PCM Bench tool. However, we do not find any detection tool that can handle a source code with multiple programming languages in the same project, except Understand,Footnote 6 which is a commercial tool.

5.3.4 Internal software representation

In order to achieve detection and determine the candidate Design Smells in software artifacts, each tool/prototype was designed to internally represent that artifact in a different manner. In the selected papers, we find eight different types of representation, which the tools adapted to internally represent software artifacts.

Figure 25 presents the frequencies of the most used internal representations for all the cited tools in the selected papers (148 tools). As can be seen, the majority of the cited tools adopted Graph and Abstract Syntax Tree (AST) representations, as compared with the remaining representations. The group of inFusion, inCode, JDeodorant, and PMD used AST, while the group of SourceMiner, Stench Blossom, Understand, and JCOSMO adopted graph representation. On the other hand, the group of iPlasma, DÉCOR, Together, and JSpirit used complex object models. Some tools used more than one representation, such as Sissy, which used graph and AST, and FxCopFootnote 7 that used object model and graph.

Fig. 25
figure 25

Distribution of internal software representation over the detection tools

All the tools in the group of the most cited support a fully automated approach for Design Smell Detection, the input source is source code, while the output is text in different formats, such as CSV, txt, or XML, with the exception of the group inFusion, inCode, and iPlasma, which bring with them a visualization environment, as visually representing the source code might be easier for identifying Design Smells. This group has the ability to generate a wide set of code metrics, including size, cohesion, coupling metrics, and others. For example, inFusion gives 60 different metrics, inCode 50 metrics, and iPlasma 80 metrics, and they do not allow the users to modify the metrics threshold value.

Also, most of these tools focused only on smell detection, except JDeodorant, which includes refactoring operations that can be performed after the Design Smells have been identified. Besides, some of these tools can be integrated as an Eclipse plug-in or are available as a standalone tool. Nearly all of these tools should be used through a GUI, except PMD, which is the only tool that was developed to support both a Textual User Interface (TUI) and a GUI. Having a TUI available is an advantage when looking to automate the smell detection process, for example in different tool integrations, or as part of the continuous integration and continuous delivery processes of software projects.

5.3.5 Design Smells detected: analyzing the most cited tools

Figure 26 shows the relation between the most cited tools and the set of Design Smells that they can detect. The relation is introduced by constructing the Galois lattice from a formal context (FCA), where the objects are detection tools, and the attributes are Design Smells. The context is defined by indicating that an object (tool) has an attribute (Design Smells the tool can detect). The set of attributes (Design Smells detected) at the lower level nodes (tools), which are connected with higher-level nodes, indicates the “inheritance” of attributes, which is to say the tool can detect the Design Smell directly, tagged at the node, and all the “inherited” attributes.

Fig. 26
figure 26

Relation between most cited detection tools and set of Design Smells detected

The lowest-level nodes (tools) show Design Smells that can be detected only by those tools. For example, the attribute (TC: Type Checking) at the lowest-level node representing the object (JDeodorant tool) indicates that only JDeodorant can detect TC. DÉCOR shows a list of Design Smells that can only be detected with this tool.

On the other hand, in the highest-level nodes, the set of attributes (GC: God Class, FE: Feature Envy) are hierarchically connected with more than one tool in the lower levels, and this means it can be detected by all of them. For example, the node tagged with (GC, FE) attributes connected (directly or indirectly) with nodes representing objects (tools) such as JDeodorant, inCode, iPlasma, and inFusion. Also, the LM attribute is connected with JDeodorant, PMD, and DÉCOR. The attribute (B) is only connected with DÉCOR and inFusion. We can say that the main observation is that there were no common attributes (Design Smells) at the top of the lattice, which means there are no smells that can be detected by all tools.

5.4 RQ4

Is the (prototype) tool validated by expert or benchmark or by comparison with other tools? Is the tool or strategy assessing results and measuring performance, precision, and recall?

In this section, we answer the fourth research question RQ4 regarding validation and performance indicators. The information used in this section is obtained from the attribute “type” of the concept “ValidationEvidence,” the “uses” association between concepts “Project” and “ValidationEvidence,” and the association of this last concept with “Measure” (having the attributes “type” and “reportedValue”).

5.4.1 Types of validations used in the studied papers

During data extraction, we have found a great deal of evidence that the tools/strategies in the study have adopted to validate their proposals. We model this with an enumerated type in the conceptual model, called TypeOfStudy, and whose values are a case study, experiment, systematic literature review, survey, and non-empirical studies.

Figure 27 shows the distribution of different types of validation evidence over the selected papers. As can be seen, most of the authors adopted experiments or case studies to validate their tools/strategies. The authors of 41% of the selected papers validated their work by conducting experiments, as in [S9, S48, S60, S96, S217, S332, S364], or controlled experiments (in the authors’ words) as in [S123, S190, S188, S189,S289, S317, S338, S395]. Another group of studies (29%), as in [S71, S154, S163, S213, S223, S257, S273, S283, S330, S388], focused on analyzing one or more case studies to validate the proposed tool/strategy. In a few studies (6%), validation was performed using surveys. In this type of study, the experience of humans plays the main role, as in [S43, S154, S160, S175, S250, S268,S297, S313, S328, S355, S380]. In 2% of studies such as [S14, S110, S142, S154, S371, S390], the authors conducted a systematic literature review to identify the current state of the art on Design Smell-related topics, while the authors used post-mortem analysis for validation in less than 1% of studies [S270, S321, S357]. The remaining 16% of papers are considered theoretical.

Fig. 27
figure 27

Distribution of validation evidence over the selected paper

5.4.2 Projects used in validation processes

A large number of projects from different domains, sizes, status, and versions had been used to validate the proposed tool/strategy. The vast majority of the validated projects used are open source obtained from different public repositories. By looking at the project repository source mentioned by the authors, we have found that most of the projects were obtained from SourceForge (24%), GitHub (20%), and less than 1% from the PROMISE.Footnote 8 For the remaining projects used in validation, the authors did not mention the repository source.

In most of the papers, the authors give some metrics on the analyzed projects, such as the number of lines of codes (LoC), while a few mentioned some other metrics, such as number of packages, classes, and methods or other nominal information, such as the project domain, the project status (beta, production/stable, etc.), or the URL of the analyzed project.

Figure 28 shows the distribution of implementation languages over the projects used for validation in the selected papers. As can be seen, most of the projects are implemented in Java and a few in another language. Figure 29 shows the distribution of the most commonly used projects over the selected papers. All projects in the figure are used in 15 papers or more. As we can see, the group of projects that includes GanttProject, JFreeChart, Xerces, and ArgoUML was the most used, especially the versions GanttProject v1.10.2, JFreeChart v1.0.9, ArgoUML v0.19.8, and Xerces v2.7.0.

Fig. 28
figure 28

Distribution of implementation languages over the number of projects

Fig. 29
figure 29

Most used projects in validation of tools/strategies

We have found that the validation processes conducted in the papers in the study may suit one of the following categories: a group of different projects, the same project with different versions, benchmarks, or a quality corpus. Most of the studies were performed using different groups of projects, for example [S14, S26, S45, S56, S106, S107, S116, S118, S122, S147, S154, S163, S176, S278, S302, S316, S320], while the authors of [S71, S113, S186, S218, S225, S234, S237, S260, S387, S392] conducted their studies using several versions of the same project. A limited number of studies, such as [S53, S54, S55, S57, S60, S121, S272, S274, S275], validated the proposed work using a quality corpus (corpus including a large number of projects ranging from 74 to 500 projects). Almost all of them used the same corpus, except [S121]. The main observation was a lack of benchmark definitions and, hence, validation based on benchmarking. From all the selected papers, only the following studies are validated on benchmarks: [S21, S55, S64, S98, S101, S115, S144, S155, S159, S160, S171, S186, S227], while the same benchmark was used in [S98, S187].

5.4.3 Performance indicators and quality measurements reported in validation processes

Several types of indicator or quality measurement are used to evaluate the performance of the proposed tool or strategy. Figure 30 presents the distribution of the most used quality measurements or performance indicators over the selected papers. From the figure, it can be seen that most often Precision, Recall, False Positive (FP), Accuracy, F-Measure, and False Negative (FN), are given as performance indicators. Particularly, Precision and Recall are by far the most commonly used in comparison with the others. The value of precision and recall metrics ranged from zero to one. The highest obtained value of precision is (1) in the studies [S69, S91, S107, S144, S209, S228, S277, S323, S326, S348], and the lowest value is (0.25), as reported in [S45]. On the other hand, the following studies [S2, S80, S91, S99, S107, S134, S135, S144, S209, S228, S259] obtained the highest value of recall, which is (1), while the lowest value is (0.50), reported in [S157].

Fig. 30
figure 30

Distribution of performance indicators or quality measurements used as evidence over the selected paper

A few studies use parametric or non-parametric statistical tests, such as T test, Wilcoxon test, ANOVA test, Mann-Whitney test, Spearman’s rank test, Fisher exact test, and principle component analysis (PCA). Other studies aimed to identify the degree of agreement between different raters, in this type of studies the authors used the Kappa [S287, S288, S290, S308, S337, S338] and Finn tests.

5.5 RQ5

Is Design Smell Detection related to the quality attributes of a quality model?

As a result of the data extraction, we found 662 Design Smells mentioned or detected in the selected papers. A limited number of the selected papers speak about the relationship between some software quality factors and Design Smells. In the conceptual model, we state that a Design Smell can negatively affect a quality factor, have no effect on other quality factors, and positively affect others (for example, the presence of a Design Smell can negatively affect reusability but positively affect efficiency). According to the conceptual model, the information used in this section was obtained from the concepts “QualityFactor” and “Smell” related by the “affects” association, having an association class called “Impact” which stores information regarding the “type” of impact (positive, negative, no impact, or unknown impact).

5.5.1 Mentioned quality models and quality factors

We found 22 software quality factors in the papers analyzed as being affected by different Design Smells. These factors can be classified into internal or external features that belong to popular quality models such as ISO/IEC 9126, Test Specification Model, FURPS Model, and McCalls Factor Model. Figure 31 shows the distribution of the quality factors mentioned in the relevant papers regarding Design Smells grouped into the quality models. The vast majority of software quality factors mentioned in the selected papers related to the ISO/IEC 9126 model.

Fig. 31
figure 31

Distribution of quality factors mentioned regarding the Design Smells in the relevant papers grouped by quality models

The McCalls Factor ModelFootnote 9 was proposed in 1977. It includes three main quality features related to software: revision, transition, and operations. Each feature includes a set of sub-features in a total of 11 internal sub-features for internal quality.

The FURPS ModelFootnote 10 is the acronym of its five main quality factors: Functionality, Usability, Reliability, Performance, and Supportability. Each feature is related to several sub-features in a total of 28 sub-features.

In 1991, the ISO/IEC 9126 (Jung et al. 2004) model was issued. This model includes six main features classified as external software quality, such as Functionality, Reliability, Usability, Efficiency, Maintainability, and Portability. Each external feature involves a set of sub-features related to internal software quality. The total number of sub-features in the whole model is 27.

The author of [S143], in 2008, proposed a Test Specification Model and adopted the ISO/IEC 9126 model. In this model, seven main external features are proposed, which are: Test Effectivity, Reliability, Usability, Efficiency, Maintainability, Portability, and Reusability. Each main feature includes a set of sub-features in a total of 28 internal sub-features.

Figure 32 shows the quality factors most mentioned as being affected by the Design Smells in the relevant papers. As can be seen, the Maintainability factor was the most analyzed as affected by a wide number of Design Smells compared to the other factors. More than 70% of Design Smells were analyzed as influencing Maintainability, Performance, Understandability, Changeability, Reusability, Complexity, Efficiency, and Usability factors.

Fig. 32
figure 32

Quality factors most mentioned as affected by the number of Design Smells that analyze their impact on them

According to the previous quality models, Maintainability, Performance, Efficiency, Usability, Functionality, Reusability, and Reliability are considered external quality factors, and the remainder as internal quality factors. Also, regarding the ISO/IEC 9126 model, the group of internal factors, which include Changeability, Testability, Analyzability, and Stability, is related to the Maintainability external factor, while Understandability is related to the Usability factor.

5.5.2 Impact of design smells on quality factors: analyzing the most cited design smells

Approximately 52% (345 of 662) of the Design Smells present in the study were mentioned in some paper(s) as having a negative impact on quality factors, while less than 1% (4 of 662) were mentioned in some paper recognizing its positive impact on some quality factor. The remaining 47% have no mention of their influence on quality factors. Table 7 presents the relationship between the most cited/detected Design Smells and the set of quality factors that have been analyzed as affected by these smells across the relevant papers. Most of these Design Smells are analyzed as affecting Maintainability, as can be read in [S153, S231, S232, S233, S236, S245, S357].

Table 7 Relationship between Design Smells and software quality factors

6 Discussion

This section discusses the results obtained from this systematic mapping study and the impacts of the said results for researchers interested in Design Smell Detection.

Regarding the study presented in Section 4, and observing the problems related to Design Smell Definition and Detection that still need the attention of the research community, the opportunity of launching a series of specific workshops or symposiums centered in this topic arises. This could be similar to the workshop on Refactoring Tools started in 2007, or the International Workshop on Refactoring, which appeared in 2016. Another possibility could be to define a track on Design Smell Detection as part of the recently launched series of conferences called International Conference on Technical Debt, due to the correlation between the presence of Design Smells and Technical Debt.

Concerning authors and institutions, two important focuses of activity can be observed in the Université de Montréal in Canada (30 papers) and the Universitá degli Studi di Milano-Bicocca in Italy (21 papers), contributing to this topic with important publications and results continuously over time.

The importance of reaching a consensus in several aspects related with Design Smell Detection makes those studies that can be obtained through a collaboration between different institutions particularly interesting, even more so when the collaboration is across different countries and continents. Approximately 47% of the selected papers are authored with collaboration across institutions of different countries and only 14% are cross-continental collaborations. For these reasons, we consider this should be encouraged, specifically with the creation of groups of experts with standardization purposes.

6.1 RQ1

It was confirmed that different concepts are used in the selected papers to describe Design Smells. The variety of Design Smell concepts led the researchers to classify the detected Design Smells into different categories based on different criteria, such as structure, scope, level, variability, behavior, etc., as shown in Sections 5.1.1 and 5.1.2. Most of the Design Smells are included in more than one of these classifications.

To organize this knowledge, we show a joint classification that includes all the others, as well as the most used terms to describe different types of Design Smell, as seen in Fig. 10. In this paper, we have used the concept “Design Smell” as the main concept that unifies all other kinds of smells. It is worth noting that terms and classifications used by different authors are not sound. On the one hand, there are works where different classifications (different terms) are given for the same Design Smell (same Smell name) while, on the other hand, the same Design Smell (if we attend to its definition, i.e., what it means) is presented with different names.

This requires a standardization process to be accomplished that allows the terminology and its precise meaning to be unified. With this standard, cataloging all the Design Smells defined up to the present time should be possible, determining those that refer to the same smell with different names, and presenting (in the same way as the Design Patterns Catalog does) a unique entrance in the catalog enriched with “other names” or “also known as.” Each catalog entrance should present a precise description of the Design Smell, but variants and idioms as well.

We found a total of 662 different Design Smells mentioned in the selected papers. Nevertheless, most of the selected studies focused on some groups of Design Smell more than on others, as shown in Figs. 12 and 13. Feature Envy, Long Method, and God Class (Blob, Large Class) are representative of this. These smells are frequently mentioned, are detected in many projects and by several tools.

There can be several reasons behind this: a better understanding of the smell definition; the identification of adverse impacts of these smells on different software quality factors such as maintainability, understandability, and changeability; the detection of groups of related Design Smells, i.e., the detection of other smells because they are tightly coupled somehow, such as God Class and Data Class, Feature Envy, and God Methods, whose presence is closely correlated with the presence of the other.

6.2 RQ2

Different types of Design Smell Detection approaches were proposed in the selected papers. In [S98], the authors classify the detection approaches into seven categories: manual, symptoms, metric, probabilistic, visualization, search and cooperative approaches. In this systematic mapping, we classify the state of the art approaches based on the technique, method or algorithm used to detect the Design Smells into Metrics-based, Logical/Rule-based, Search-based, Machine learning-based, Graph-based, Visualization-based, Model-based, Clustering Analysis-based, Collaborative-based, Dependency Analysis-based, Context/Feedback-based, Filter-based, Historical information-based, Probability matrix-based, Generative from specification-based, Syntactic-based, and Textual-based. This classification includes the above mentioned, except “manual” and “symptoms.” We think the categories “manual” and “symptoms” are orthogonal to the type of approach because they refer to the degree of automation, manual and computer-aided, respectively.

It seems difficult to classify proposals into a single type of approach. Many authors use more than one diverse technique, method, or algorithm that belong to different approaches, i.e., a combination of different types of approaches to detect similar Design Smells.

Most authors focused their attention on six of these types of approaches: metric-based, logical/rule-based, machine learning-based, search-based, graph-based, and visualization-based, as Section 5.2.1 shows. It is hard to be categorical, but the higher number of papers based on a particular approach could be evidence of better results in detection. In order to measure the effects of a type of approach in detection, we would need the definition of benchmarks with which to obtain different performance indicators.

Design Smell Detection using a metric-based approach combined with rules of knowledge (rule-based, machine learning based) has gained in importance over the period of study. Most researchers tend to work with these approaches. This may be because new metrics, rules, and classifier algorithms can be defined to detect smells in more accurate ways. Therefore, these approaches are combined with each other more than with other approaches.

Regarding the activities related to Design Smell Detection, most of the effort is focused on detection activities. However, as stated above, it is important to increase efforts in the specification in order to help with the standardization of the concepts, which would also allow an increase in detection consistency. Moreover, improving the comprehension of the impact of Design Smells should support the applicability of these results. Impact Analysis should be guided by Quality models, as well as by Technical Debt models. Hence, both Specification and Impact Analysis activities should gain greater importance. Specification should be tackled in a standardized way, and empirical studies should be conducted in this line.

Prioritization deserves special attention. This kind of activity has been observed over the last 5 years. Software projects grow in size and complexity, and therefore, Design Smell Detection reports are huge. Accordingly, these reports should be configured in such a way as to first of all allow an analysis of those smells with a higher priority, i.e., those which, if not removed, could cause such a large technical debt increase as to lead to major problems.

In addition, advances in how to remove the Design Smells are needed to guide refactoring processes.

6.3 RQ3

Several tools/prototypes are proposed for Design Smell Detection in semi-automated or automated ways. Despite the increasing number of proposed tools, the adoption of these tools in the industry is still weak in comparison with refactoring tools.

In our study, we found a group of tools which are cited or used more than others, in particular, DÉCOR and JDeodorant. The reasons might be related to the variety of Design Smells they can detect, their usability, integration with other tools, or the input and output data as major points.

Some of the tools are designed for detecting and refactoring, such as JDeodorant [S49, S216]. Also, some of the detection tools used another metric tool to generate the required metrics for detecting Design Smells or to define and collect their own set of metrics.

We found some limitations related to the existing detection tools that widely affect their adoption for Design Smell Detection and these can be summarized by the following points:

  • Most of the tools and prototypes focused on object-oriented language, especially Java.

  • We did not find any tool that can analyze source code implemented in more than one programming language in the same system, except the Understand detection tool.

  • Only a few tools can analyze very large sized projects (millions of lines of code).

  • Most of the tools did not take into account the expert feedback or the influence of the context, characteristics of the organization, project domain, status, etc.

  • In order to cover a wide range of Design Smells, several tools should be used.

  • The common corpus of Design Smells detected by several tools is very narrow. This hinders tool comparison and, hence, tool choice.

In a production environment, increasing the number of tools that should be used, and therefore integrated into the process workflows and with other tools, increases the complexity and the entropy.

Analyzing the kind of mentions a tool receives in the relevant papers, the fact that “improving” a tool represents just 19% is rather contradictory when facing the huge set of tools/prototypes found and the 17 years of work in this area.

6.4 RQ4

Empirical studies, mainly based on case studies and experiments, are the preferred approaches for validating the proposed tools/prototypes. There is a lack of validation using experts in this field. QA experts from industry and higher education (academic) are needed to assess the results of the tools in terms of detecting false positives and false negatives. It is also important to have expert opinions regarding Smell Prioritization and Smell Impact on product quality and technical debt.

We have observed a set of projects that are frequently used in validating the proposals. Nevertheless, there is a lack of benchmark definitions for Design Smells validated by experts. In fact, only 5% of the works conduct validation based on benchmarks or corpus defined with this purpose. The PROMISES repository is defined as “a research dataset repository specializing in software engineering research datasets.” In this repository, two datasets related to Bad Smells can be found. Incorporating Design Smell Detection benchmarks/corpus to this repository should contribute to advances in research on this topic.

Furthermore, the same set of performance indicators should be used to assess the results of the tools/prototypes. We have noticed that the most used indicators are Precision and Recall, False Positives and False Negatives. Usage of Kappa, ROC Area, Specificity, and Sensitivity indicators should be promoted. It is important to measure and compare the degree of agreement (Kappa based) among different tools and between tools and human experts. Moreover, we have observed that it seems easy to obtain Precision values near 1, which is why other indicators should be given to evaluating the goodness of the “diagnostic test.”

6.5 RQ5

Different software quality models are found in the literature. Each model defines a set of main software quality features, and each feature includes a group of sub-features that, in total, affect the main feature. These sub-features are common to different models. A few studies identify the relationship between the detected types of Design Smell and the quality features. The nature of the relationship is identified from the point view of the authors and varies from one to another.

Most of the Design Smells, in particular, the group of Design Smells defined by (Brown et al. 1998) and (Fowler et al. 1999) affect more than one quality factor, as we showed before. Therefore, some quality factors will be influenced more than others. The most affected quality factors, such as maintainability, performance, and understandability, will have played a major role in the software maintenance cost. In this case, the set of Design Smells related to these quality factors will have the highest degree of priority for removal from the software.

6.6 Smells wisdom app

The conceptual model obtained was captured in relational database design. The data extraction and collection was materialized in populating the database. The compilation of knowledge was put in the hands of the research community interested in this topic by means of a web application. The reason is twofold: (1) the app should allow the community to contribute with their knowledge, as well as to criticize or comment on the information already collected in the app database; (2) the app should allow updating of the knowledge in a collaborative way for the coming years. For this reason, a social but supervised philosophy was chosen. Users should sign up for the app to give their opinions regarding information already collected and also to contribute with new information. Both opinions and contributions are supervised by the app admins in order to authorize insertions and updates to the app knowledge base with the aim of maintaining the soundness, consistency, and quality of the data.

The web application is available at https://smellswisdom.herokuapp.com/.

7 Threats to validity

In our mapping study, the obtained results might be affected by several factors such as the scope of the study, the coverage of the search, trends on study selection, and the accuracy of the data extraction process. In the following subsections, four types of threat to the validity, according to Wohlin et al. (2012), are discussed, including the mentioned factors.

7.1 Conclusion validity

Conclusion validity regarding a systematic mapping study refers to the extent to which the obtained conclusions can be considered as correct and to how rigorous and repeatable the process to obtain the conclusions was.

In this case, reliability is assured due to the selection of the six databases considered the most efficient and popular according to previous systematic mapping studies in related topics. In addition, the Google Scholar and snow-balling technique were included for completeness, despite leading to repetitions. Moreover, the considered period of 17 years includes all the work done up to the present time in the domain of knowledge regarding Design Smell Detection to enforce the validity of the conclusions. Our work also includes concept definitions and the organization of the terminology in order to avoid ambiguities. The process is repeatable and precisely defined, detailing the protocol we follow to conduct the systematic mapping, as can be read in Section 3 and summarized in Fig. 1.

The quality of the conclusions is enriched in a first stage with a definition of facets and, in a second stage, with the application of domain modeling techniques which include the definition of enumerated types that include and enrich previous facets. The articles examined can be classified using these facets and can be precisely described in terms of the concepts and relationships included in the conceptual model representing this knowledge domain.

The data extraction was accomplished by a PhD student with 3 years’ experience in Design Smells and checked by a PhD supervisor with more than 15 years’ experience in Design Smells. This process increased the quality of the data extracted. An expert in Empirical Software Engineering supervised and checked each part of the process.

Data integrity was guaranteed by means of conducting a process of data migration from the initial prototype in an Access database to a fully designed MySQL database with integrity restrictions available through a web application. It was a semi-automated process with more than three iterations, ending when a point of no errors in data migration and integrity checking was reached.

7.2 Construct validity

Construct validity is related to the process followed to select the information to be included in the study and related to the suitability of the research questions. In this systematic mapping study, the different types of Design Smell, approaches in detection, detection tools, and quality attributes related to Design Smell Detection in software are studied, defining and limiting the core concepts in the study. All these are reflected in the research questions. The domain model obtained is readable and represents knowledge that allows the research questions to be answered. Moreover, it allows future researchers to elaborate further research questions on the knowledge modeled in this way.

Our experience in the area is a guarantee that considers the concepts are interpreted in the correct way, and all the relevant papers are completely collected and analyzed.

We use both automatic and manual searches to ensure the completeness of the search string. The search string we use is comprehensive in the automatic search, but we use the manual search to discard the papers that are not related to the topic of this study.

7.3 Internal validity

The internal validity focuses on the analysis of the extracted information. The process we follow is detailed in Section 3. The reproducibility of the study is guaranteed due to the detailed specification of the search engines used, but also of the search string and the inclusion and exclusion criteria. Possible limitations of the search results were overcome by including different terms used by different authors for the same and similar concepts. All documents included are available in the full English text, excluding all those that are not relevant (not guaranteeing peer reviewing), etc.

In this study, the descriptive statistical analysis is used to cover all stages of the systematic planning protocol. Also, formal concept analysis (FCA) was used to extract whether there is some underlying structure and relationships in the collected information.

7.4 External validity

External validity is concerned with the representativeness of the selected papers regarding the main goals of the systematic mapping study. The findings of the study regarding the overall knowledge of the Design Smell Detection domain (approaches, tools, types of Design Smell, quality attributes, validation evidence indicators) were considered. The systematic protocol is helpful to obtain a comprehensive representation of the selected papers in the selected study period. Domain modeling techniques and iterations on the process were helpful in obtaining a good representation of the knowledge in the field. Therefore, the classification schema of the selected papers and the conclusions are valid for all studies related to Design Smell Detection and closely similar topics.

8 Conclusion

This systematic mapping corresponds to the interval of study from 2000 to 2017, 18 years of research from the year 2000 where the seminal chapter by Fowler and Beck was published as part of the well-known Refactoring book. We identified 395 articles published in seven well-known electronic databases in the domain of Design Smells Detection. The number of relevant studies in this domain increased through the determined interval of study. In the first 5 years (including the year 2000), we found 16 articles (considering the five from 2001, 27), in the next 5 years more than 75 articles, and in the last 5 years more than 210 articles.

We accomplished a comprehensive process for extracting data which includes 17 facets of each article in the first steps. We extracted important preliminary results, and then applied conceptual modeling techniques to obtain a domain model. The domain model was the base for designing a relational database. A reviewed and enriched version of the data was extracted to be systematized and stored in this database. The database thus populated was included in a web application to make it available for the scientific community. Researchers could exploit, enrich, and suggest modifications to enhance the data in a collaborative way.

Our systematic mapping distinguishes itself from other closer studies, such as (Rattan et al. 2013; Zhang et al. 2011), on the one hand, by defining five research questions that guide the contributions of the present study and, on the other, by tackling the problem in a broader sense.

In this way, we have identified a large number of terms and concepts in the domain of Design Smells that are used without consistency by different authors. According to several classifications, which also differ between them, these terms and concepts are characterized. As a conclusion, the need to standardize such concepts and classifications is clear, for which international collaboration between research teams is essential. This study shows that this kind of collaboration has been weak until now, and an international panel of experts (standard committee) is desirable.

We have classified the detection approaches in 18 possible categories, for which we show their use over time, identifying those that have reached more interest in recent years and highlighting 6 of them as the most used. As a result of our study, we cannot establish any correlation between the techniques used and the efficiency of the results, among other issues, due to the frequency in which these are not used independently but combined among them. Promising combinations can be identified, such as jointly working with Historical information and Context/Feedback-aware-based approaches with others of the most used.

Despite the high number of Design Smells studied (662), most of the works focus on a dozen of them. In fact, an important problem we have found is the absence of an extensive Design Smell Corpus available in common to several detection tools. Comparing with the refactoring area, most of the refactoring tools implement a common corpus of well-known refactoring, some tools introduce small variations in some common operations and add particular refactoring operations. However, the same does not occur regarding Design Smell Detection tools. As a consequence, this hinders adoption in the industry because it is necessary to incorporate a set of different tools in the production workflows to allow the detection of multiple Design Smells or to work with different languages, which significantly complicates the production environment. The emergence of tools such as SonarQube, which allows the execution and reporting of several tools in the same environment, helps to minimize this problem, which has facilitated its success.

We highlight DÉCOR and JDeodorant from the tools used or proposed for Design Smell Detection whose citations stand out from the others. However, in general, there is a lack of maturity in the tools identified with limitations that restrict their use and adoption by the industry. Despite this, there are few studies focused on improving existing tools.

In all cases, the tools offer only a binary response to the presence of Design Smell, but the current context, with increasingly large projects where all corrections are not possible, makes it necessary to evolve towards classification in a set of values (fuzzy classification) that allow the prioritization of the detected Design Smells, based on their impact on quality or the technical debt generated. In particular, neither is there enough work on the impact of Design Smells on quality attributes. It may be important to take the impact of Design Smells on quality attributes into account when considering prioritization.

In the current context, it is very complicated to compare the results obtained by different proposals in the absence of benchmarks and the lack of homogeneity in the performance indicators. Although it is true, that in many cases, a common set of projects is used, mostly from open repositories such as SourceForge or GitHub. However, the comparative studies we found in the study analyze an average of 4 tools and, what is worse, an average of 4 Design Smells in comparisons.

Validation of tools and prototypes is mainly based on “Empirical Studies,” “Case Studies,” “Experiments,” or all of them. To improve validation processes, it would be appropriate to have a reference repository labeled by experts. However, there are many difficulties that arise from this task, such as the absence of a common corpus of Design Smells in tools or the need to evaluate a significant volume of information by highly specialized staff. These reasons also justify the lack of validation by human experts in most of the works. We encourage the community to drive the research in the next few years in these directions.