1 Introduction

To remain competitive in today’s highly competitive global markets where change is the only constant, organizations have no choice but to excel at transforming themselves [1]. However, organizational transformations are not easy and more than 50% of them fail to provide expected benefits [2]. Accordingly, practitioners and researchers have proposed a number of best practices and tools to help organizations achieve such a feat despite the numerous challenges it implies. Considered by many as the art of organizational design, the enterprise architecture (EA) approach is one of these practices.

The EA approach rests on three key components: (1) a target enterprise architecture that defines how the organization and its technology assets will need to function in the future, (2) a transformation plan that determines and schedules the transformation projects that the organization will need to execute to implement its target enterprise architecture and (3) an enterprise architecture team who is responsible for creating the target enterprise architecture and the transformation plan [3]. Accordingly, the EA approach allows an organization to transform itself effectively, efficiently and with agility. The EA approach has gained a lot of popularity this last decade, especially since the adoption of the Service-Oriented Architecture (SOA) style and the Business Process Management (BPM) approach that have become its pillars. For the Open Group [4], SOA improves the alignment between the business and information technology communities. It facilitates the creation of flexible and reusable assets for enabling end-to-end business solutions. When the SOA style is applied to the EA approach, we talk about Service-Oriented Enterprise Architecture (SOEA).

To help organizations structure and guide their implementation of the EA approach, numerous EA frameworks (EAFs) have been developed over the years. An EAF is defined as a coherent set of principles, methods and models used by practitioners to design, implement and maintain an enterprise’s organizational structure, business processes, systems and infrastructure [5]. EAFs provide organizations with (1) one or more metamodels to describe the architecture, (2) one or more methods to design and maintain the architecture and (3) a common vocabulary and optional reference models used as templates or blueprints [6]. EAFs can also be used as tools to access, organize and communicate various architectures that describe key components of the enterprise [7, 8].

Yet, despite all the frameworks available, most organizations fail to implement the EA approach. One of the main reasons for such failures is that most enterprise architects are unable to select an EAF that will best address the particular needs of their organization. Indeed, because the selection of an EAF is generally one of the first activities conducted when implementing the EA approach, enterprise architects often lack the knowledge and expertise required to make a good decision. To make matters worst, there are over 25 different EAFs available today and current EAF evaluation tools, which should help organizations select the right EAF, have important limits. First, they compare only a limited number of frameworks. Second, their relevance over time is limited as the EAFs they compared continue to evolve. Third, they rely on somewhat different criteria making it difficult for architects to identify the right set of criteria to guide their EAF selection process [9, 10]. Fourth, they limit themselves to the best-known criteria and thus omit several other criteria that may be critical for architects especially when designing the systems and data architectures (e.g., support of the SOA style and the BPM approach). Finally, but not the least, they rely on overly simplistic and subjective operationalizations of the chosen criteria [11].

To help enterprise architects (1) evaluate currently available EAFs, and (2) select the EAF that will best address the particular needs of their organization, the present paper develops, using the design science research (DSR) approach [12, 13], an EAF evaluation artifact. DSR aims to create and evaluate artifacts and tools to solve problems identified in organizations. As evaluating and selecting an EAF is one of the most important issues when implementing the EA approach and having established the lack of appropriate tools to help organizations assess currently available EAFs, our proposed EAF evaluation tool, which identifies, defines and operationalizes a comprehensive set of 14 criteria, should allow EA practitioners to more effectively and efficiently select the right EAF and, in turn, help them design and implement an EA approach that will act as a catalyst for the future transformations of their organizations.

In this paper, we extend the work presented in [11], by (1) including the principles and the design of a proof-of-concept prototype that explains how the artifact is concretely operationalized, (2) refining the proposed EAF evaluation criteria through explicit definitions and metrics and (3) reporting the results of an empirical experimentation that validates the relevance, usability and correctness of the proposed evaluation criteria.

The remainder of this paper is structured as follows. Section 2 presents a literature review on EAF evaluation criteria. Section 3 describes each step and related outputs of the methodology we used to design, develop and evaluate our EAF evaluation criteria. The evaluation addressed three aspects, namely the relevance, usability and correctness of the proposed criteria. Section 4 concludes the article by highlighting the study’s contributions, limitations and directions for future research.

2 Literature review

Among the different types of literature reviews available, we relied on a scoping review as we wanted to examine the extent, range and nature of research activities on EAF evaluation criteria and grids [14] while focusing more on the breadth of coverage of the literature than the depth of the coverage [15]. The findings presented in this section are a summary of a previously published conference article entitled Enterprise Architecture Framework Selection Criteria: A Literature Review ( [11]). Our scoping literature review enabled us to identify nine criteria that are generally used to evaluate and select an EAF.

Table 1 presents a definition of each criterion as well as the articles that discuss it.

Table 1 Synthesis of main EAF evaluation criteria in the literature

Taken as a whole, the literature on EAF evaluation criteria has some strengths as well as important shortcomings. On the plus side, this literature identifies and defines several criteria that might be useful when evaluating and selecting an EAF. On the minus side, past efforts do not provide a comprehensive set of evaluation criteria and most importantly do not provide appropriate scales to evaluate EAFs along these criteria. Indeed, currently available evaluation tools often limit themselves to very few ‘best known’ criteria and thus omit several other criteria that may be critical for organizations, especially when designing the systems and data architectures (e.g., support of the Service-Oriented Architecture style and Business Process Management approach). In addition, among the 18 articles that identified EAF selection criteria and proposed EAF comparison matrices, only nine of them provided corresponding operationalizations [7, 8, 10, 16, 17, 20,21,22,23]. In other words, solely 9 articles provided actual scales to help EA practitioners evaluate currently available EAFs in regard to these criteria. While these operationalizations certainly represent a step in the right direction, these scales are very simplistic and thus fail to provide any real support when evaluating an EAF. Specifically, the operationalization of the selection criteria in these 9 articles is based on a subjective assessment rather than an objective instantiation or threshold. Take for example the comparison matrix proposed by [10] which ranks the main EAFs available at the time along several criteria using a scale ranging from 1 (very poor) to 4 (excellent). While each EAF is given a score, no objective reason is given as to why any given framework receives a high or a low score on a given criterion. The scores are only based on a subjective assessment of the researchers. As such, although past studies have ranked current available EAFs along certain criteria, they make it very difficult to understand the scores and ranking of each EAF. Accordingly, these evaluation criteria do not allow EA practitioners to make their own assessment of an EAF. This important limit, in turn, undermines the relevance over time of these criteria. Indeed, since an EAF continues to evolve over time, it is impossible to know whether modifications brought to an EAF at certain points in time would bring a different evaluation and, most importantly, which score on which criteria would be improved or worsen.

3 Methodology and results

The DSR approach is adopted here to develop and test our new artifact: an EAF evaluation tool. Based on the work of [13], our approach comprised four steps: (1) problem identification and motivation, (2) definition of the objectives for a solution, (3) design and development and (4) evaluation. During steps 1 and 4, members of the industry and experienced enterprise architects were consulted and asked to provide key suggestions, comments and feedback in order to help us develop and evaluate our new artifact. More precisely, experienced architects were consulted to clarify the artifact’s objectives and to assess its relevance, usability and correctness [12, 30]. Semi-structured interviews were used to collect their suggestions, comments and feedback during these steps. The following paragraphs detail what was done and the key findings of each step.

Table 2 Taxonomy—architecture layers
Table 3 Taxonomy—architecture aspects
Table 4 Metamodel—complexity
Table 5 Metamodel—completeness
Table 6 SOA models
Table 7 Reference models
Table 8 Development process
Table 9 Governance process
Table 10 Supporting software
Table 11 Availability of free information and supporting software
Table 12 Architecture practice guidelines
Table 13 Principles
Table 14 Adaptability
Table 15 Usability

3.1 Step 1—problem identification and motivation

During step 1, exploratory interviews were conducted with seven experienced practitioners to assess (1) whether the EA approach was a preoccupation for them, (2) the challenges they faced in regard to implementing such an approach, (3) if they knew which EAF they would use to guide their efforts and (4) if an EAF evaluation tool could help them select an EAF that will best address the needs of their organization. Participants included senior enterprise architects from large organizations and other experienced EA and BPM specialists that had already implemented the EA approach. Among key findings from this step, transcripts from the exploratory interviews indicated that all respondents agreed that digital transformations have rendered the EA approach a central organizational preoccupation. All participants mentioned that some of their initiatives to implement the EA approach had failed in the past. For example, EA consultants explained that their customers’ first attempt to implement the EA approach often failed because they did not know where to start and lacked the knowledge and experience required to properly coach all the stakeholders involved. In addition, most participants mentioned that leaders had a hard time determining which EAF they should use to guide their efforts when implementing the EA approach. These participants emphasized that selecting an EAF is usually one of the first key decisions to make when implementing the EA approach and that having to make such an important decision, at such an early stage, without enough time to get familiar with the strengths and weaknesses of the different EAFs, was extremely difficult. They also mentioned that although they came across a few EAF comparison matrices, these were of little value as they did not provide scales which would have allowed them to make their own assessment of currently available EAFs. Finally, all participants mentioned that an objective EAF evaluation artifact like ours would be of great value to enterprise architects. They perceived our DSR effort as a way to obtain such a tool and a means by which they could make an informed decision when selecting an EAF.

3.2 Step 2—definition of the objectives for a solution

During step 2, findings from our interviews conducted in step 1 and the knowledge gathered during our review of the relevant literature was used to infer the objective of our artifact. In broad terms, we wanted to develop an artifact to help enterprise architects evaluate and select an EAF that best addresses their needs. Specifically, our objective was threefold. First, our artifact had to identify and define a comprehensive set of criteria that not only assess the ‘ best known’ characteristics of EAFs but also assesses the more practical and technical aspects of EAFs. Second, our artifact had to provide objective scales for the criteria. Third, our artifact had to allow the assessment of currently available EAFs (e.g., TOGAF, DODAF, FEAF). With this aim and objective in mind, we elected to design an EAF evaluation tool.

3.3 Step 3—design and development

During step 3, because no such tool had been previously developed, we used a ‘ general solution’ strategy to develop the first version of our EAF evaluation artifact [44]. To do so, we first anchored our efforts on the literature we reviewed previously as well as the interviews we conducted during step 1. In addition, considering the important limits of the EAF evaluation criteria literature, we conducted additional literature reviews on several closely related EA topics (e.g., EA frameworks, metamodels, SOA, BPM). After doing so, we were confident that our EAF evaluation artifact would not omit any important criteria and that it would be compatible with existing EAFs. Since none of the existing EAF evaluation tools identified in the academic and practitioner literature provide a comprehensive set of EAF evaluation criteria and appropriate scales, missing criteria and scales were then developed.

3.3.1 Design of the artifact

To design these criteria and their respective scale, existing EAF criteria and EAF comparison matrices were used. Specifically, we first analyzed different components and particularities of existing EAFs as well as the various EAF evaluation criteria and scales provided in already existing comparison matrices. As mentioned above, we also conducted additional literature reviews on several closely related EA topics to ensure that we would not omit any important criteria. In total, 14 criteria were developed. Furthermore, for each of these criteria, we provided the definitions of key terms required to properly assess them, their metrics, their overarching assessment logic and their respective objective scale. The 14 identified criteria are specified in Tables 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15.

Fig. 1
figure 1

Architecture of the proposed artifact

3.3.2 Prototype architecture: a first sketch

This section discusses the design and implementation of a proof-of-concept prototype that supports the proposed artifact. This first sketch of the artifact was designed and developed to help enterprise architects evaluate EAFs by computing the metrics and assigning a score to each of its criterion.

We designed the core components of the artifact using the SOA style [36]. More precisely, the SOA services were designed as RESTful web services. The SOA services were implemented with the Eclipse Modeling Framework™ (EMF). EMF is a Java-based modeling framework that implements EMOF (Essential Meta-Object Facility). As shown in the UML components diagram of Fig. 1, the artifact is based on an Eclipse plugin, eaf.selection.criteria, representing criteria assessor services.

The plugin consists of four sub-packages (ea.models, ea.metamodel, ea.taxonomy, ea.process) and fourteen components (SOA services). Each sub-package is designed as an SOA-based Eclipse plugin. Each component provides functions to calculate metrics for the corresponding evaluation criterion. Note that the metamodel complexity component, ComplexityAssessor, uses the Archimate metamodel as a reference model to compute the complexity metrics (See Table 4). To do so, it uses EMF Refactor, an Eclipse open-source tool that supports metrics reporting, smell detection and models refactoring. On the other hand, the Usability Assessor component depends on external libraries to compute usability metrics. To assess the completeness of an EAF (i.e., if the EAF supports the core EA layers and aspects), the Completeness Assessor component uses the sub-package ’ea.taxonomy’ components (i.e., EA Layers Assessor and EA Aspects Assessor). Finally, the Reference Models Assessor component provides functions to the SOA Assessor through its interface in order to evaluate the SOA criterion.

3.4 Step 4—evaluation

The evaluation of the proposed artifact required conducting an experiment to assess the relevance, the usability and the correctness of the EAF evaluation criteria. More precisely, this experiment aimed to verify whether the evaluation criteria shown in Table 16 are:

  1. 1.

    relevant when instantiated in the context of EAFs (i.e., criteria ’instantiation’ are meaningful within the context of EAFs),

  2. 2.

    usable when instantiated to evaluate EAFs. In this context, we want to assess whether the criteria are effective and do not require advanced technical skills in enterprise architecture, software design and architecture to be assessed in the context of EAFs.

  3. 3.

    correct (i.e., allow architects to evaluate and classify EAFs effectively).

To conduct this experiment, we followed the Goal Question Metric (GQM) approach [45]. GQM defines three levels [45]: (i) the goal of the experiment (conceptual level), (ii) the set of questions used to characterize the way to attain the specific goal (operational level) and (iii) the set of metrics that provides the needed information to answer the questions (quantitative level). In addition, because the goal of this experiment is to assess the relevance, usability and correctness of the criteria, we needed experts with the appropriate expertise to evaluate the fourteen criteria.

The rest of this section is organized as follows. Sub-section 3.4.1 describes the experiment setup. It presents the participant selection (i.e., the experts) and the set of selected EAFs used as experimental objects. Sub-section 3.4.2 discusses the experimental design to evaluate the relevance, usability and correctness of the criteria. Sub-section 3.4.3 presents the experimental operation and execution. The results of the evaluation are presented in Sect. 3.4.4. Finally, Sect. 3.4.5 discusses the issues that might have affected the validity of the experiment.

3.4.1 Experimental subjects and EAFs

Participant Selection: Three senior enterprise architects volunteered to participate in the experiment and assess the usability, relevance, and correctness of our fourteen EAFs criteria. As a group, the participants had several years of experience in enterprise architecture, software architecture, business architecture and a solid BPM expertise. Table 17 summarizes the profile of the participants.

Experimental Enterprise Architecture Frameworks: To carry out the experiment, we first studied 12 EAFs. From these EAFs, we selected the six that are the most widely used in practice and cited in the literature, namely Zachman, TOGAF (The Open Group Architecture Framework ), FEAF (Federal Enterprise Architecture Framework), EAP (Enterprise Architecture Planning), The Enterprise Architecture IT Project (Urbanization approach) and DoDAF (Department of Defense Architecture Framework). All the experts had experience and a strong knowledge of the experimental EAFs in their Current-1 versions (i.e., latest version and previous stable versions). Table 18 lists the experimental EAFs.

3.4.2 Experimental design

According to the GQM paradigm [45], the goal of this experiment is to evaluate the 14 criteria of the artifact (See Table 16) with the purpose of assessing their relevance, usability and correctness from the point of view of the experts. To achieve this goal, we encoded three questions:

  1. 1.

    RQ1: Are the criteria relevant to evaluate EAFs?

  2. 2.

    RQ2: Are the criteria usable to evaluate EAFs?

  3. 3.

    RQ3: Do the criteria evaluate EAFs correctly?

These questions allowed us to gain empirical evidence on whether the identified criteria are relevant, usable and correct in the context of different EAFs.

The context of the experiment is determined by: (i) the six selected EAFs (see Table 18), (ii) the 14 criteria to be applied for evaluating EAFs (see Table 16), and (iii) the three experts that evaluate the criteria being applied to the EAFs (see Table 17).

To conduct the experiment, we presented the 14 criteria and six EAFs to the experts and asked them to judge whether each criterion (e.g., Reference models), as instantiated for a particular EAF (e.g., TOGAF), was perceived as usable, relevant and correct.

For each aspect (i.e., relevance, usability and correctness) to validate, we had 14 variables that represent the criteria in table 16. When applied to an EAF, these variables can have two possible values (i.e., the usability, the relevance and the correctness): 0, meaning that the criterion was found not usable/irrelevant/incorrect in the context of the EAF; 1, meaning that the criterion, as instantiated, was found usable/relevant/correct for the EAF in question.

The experiment had 42 hypotheses:

\(H1i_0\)::

The criterion Ci was found to be irrelevant/ \(H1i_a= \lnot \,H1i_0\); For i=1 to 14

\(H2i_0\)::

The criterion Ci was found to be not usable/ \(H2i_a=\lnot \,H2i_0;\) For i=1 to 14

\(H3i_0\)::

The criterion Ci was found to be incorrect/ \(H3i_a=\lnot \, H3i_0;\) For i=1 to 14

3.4.3 Experiment operation and execution

The experiment was conducted in one session. Several documents and instruments were designed to introduce the participants to the context of our research project. The materials included: (1) training slides with an overview of the artifact and the proof-of-concept prototype, (2) the description of the criteria and (3) a questionnaire for gathering the data.

The experiment took place in a single room, and no interaction between subjects was allowed. First, the conductors briefly trained the subjects on the evaluation criteria and the artifact and answered their questions. Then, a slot of 120 minutes without a time limit was given to participants to evaluate the three aspects (i.e., relevance, usability and correctness) of each criterion in the context of each experimental EAF. For that, the experts had to, a priori, compute the metrics of each criterion (e.g., Number of Classes, Number of Associations, Number of Generalization Hierarchies and Maximum DIT of the criterion Metamodel Complexity) in the context of each of the selected EAFs (e.g., TOGAF). To assess the usability criterion (Table 15), experts were asked to compute the metrics (i.e., effectiveness, efficiency and satisfaction) based on their most recent experience with the selected EAFs. Questions that arose during the session were clarified by the conductors.

For each criterion of each EAF, the questionnaire was structured to enable experts, after computing a metric, to: (i) give it a rating between 0 and 3, and (ii) answer binary questions (i.e., Yes/No) related to the aspects to validate (i.e., relevance, usability and correctness). Additional space was available for the experts to add comments explaining their answers to each criterion of each EAF.

Evaluation of the relevance The goal of first step of the experiment is to evaluate, from the expert’s perspective, the relevance of the 14 criteria in the context of each of the selected EAFs. A criterion is relevant for a particular EAF when: (i) it is applicable (i.e., can be instantiated) and (ii) is meaningful when instantiated.

For this evaluation, we presented to the experts the 14 questions (Is the criterion Ci relevant to evaluate EAFs? For i=1 to 14) we encoded Q1 to Q14. Then, we asked them to answer the 14 questions in the context of each EAF. For each question (Q1 to Q14), a bi-valuated variable was used to enable participants to provide their perception on the relevance. As mentioned above, the bi-valuated variables had two possible values: 0, which means that the expert finds the criterion irrelevant, and 1, which means that the expert finds the criterion relevant when instantiated in the context of a particular EAF. To collect accurate information about their perception, we included an open-ended question to enable participants to add comments explaining why they think that the criterion is relevant or not.

Evaluation of the usability The goal of the second step of the experiment is to evaluate the usability of the 14 criteria in the context of each of the selected EAFs. Through this evaluation, we wanted to know, for a particular EAF, whether or not the 14 criteria of our artifact require advanced technical skills in enterprise architecture, software design and architecture to be measured.

For this evaluation, we presented to the experts the 14 questions (Is the criterion Ci usable to evaluate EAFs? For i=1 to 14) we have encoded Q15 to Q28. Then, we asked them to answer the questions in the context of each EAF. For each question (Q15 to Q28), a bi-valuated variable was used to enable participants to provide their perception of the usability criterion. The bi-valuated variables have two possible values: 0, which means that the expert finds the criterion not usable, and 1, which means that the expert finds the criterion usable when instantiated in the context of a particular EAF. In addition, we included an open-ended question to enable participants to add comments explaining why they think the criterion is usable or not.

Evaluation of the correctness The goal of the third step of the experiment is to evaluate, from the expert’s perspective, the correctness of the 14 criteria in the context of selected EAFs. A criterion is correct for a particular EAF when the associated metrics are (i) correct and (ii) allows architects to evaluate and classify EAFs effectively.

For this evaluation, we presented to the experts the 14 questions (Does the criterion Ci evaluate EAFs correctly? For i=1 to 14) we have encoded Q29 to Q42. Then, we asked them to answer the questions in the context of each EAF. For each question, a bi-valuated variable was used to enable participants to provide their perception of the correctness criterion. The bi-valuated variables have two possible values: 0, which means that the expert finds the criterion incorrect, and 1, which means that the expert finds the criterion correct when instantiated in the context of a particular EAF. In addition, we included an open-ended question to enable participants to add comments explaining why they think that the criterion is whether correct or not.

3.4.4 Analysis of the results

Analysis of the relevance assessment

Table 19 depicts the assessments of the relevance aspect in the context of the experimental EAFs. The results show that, overall, the experts found the proposed criteria relevant. Indeed, for Expert 1 and Expert 2, the 14 identified criteria were deemed relevant. However, Expert 3 found the criterion Metamodel-Completeness(C3) irrelevant. He argued that the criterion C3 is based on two other criteria: Taxonomy-Architecture Layers (C1) and Taxonomy-Architecture Aspects (C2). Thus, enterprise architects could assess the completeness of the EAF metamodel from the criteria C1 and C2 as an EAF metamodel must implement all the specified layers (C1) and aspects (C2). In other word, Expert 3 perceived that it was not necessary to compute the metrics for the criterion Metamodel-Completeness (C3) because they could be inferred. Thus, he concluded that C3 was irrelevant. While we agree with Expert 3 that the criterion C3 metrics depend on C1 and C2, we argue that C3 is still going to be used by: (i) EA tools providers when implementing EAF metamodels, and (ii) EA architects who need to instantiate the metamodel to create architecture models. Going a step further, EA architects could also infer SOA Models (C6) metrics from those of Reference Models (C5).

Table 16 Proposed evaluation criteria

These observations allowed us to reject the null hypotheses \(H1i_0\) for i in \(\left\{ 1,2,4,5,6,7,8,9,10,11,12,13,14\right\} \) and to accept their alternative hypotheses, meaning that all criteria, except C3, were perceived by the experts as relevant.

Analysis of the usability assessment

As shown in Table 20, the experts found that the criterion Metamodel-Complexity (C4) was not usable. The experts argued that its metrics are difficult to compute despite the use of measurement tools, including the proof-of-concept prototype. Experts pointed out that the metrics used to assess the metamodel complexity are even more complex to compute when EAF metamodels are described textually or in less formal language. It is worth noting, however, that Expert 3 found that the metrics were usable in the context of the TOGAF framework as its metamodel is well specified.

In addition, Expert 2 found that the criterion Usability (C14) was not easy to measure in the context of the six EAFs. A closer analysis of his comments revealed that he perceived the metrics used to assess the Usability criterion as too complex. He suggested using solely the satisfaction attribute and discarded the other two (i.e., effectiveness and the efficiency).

Overall, we believe that the ‘poor performance’ on the criteria C4 and C14 was due to a combination of factors: (1) the metrics do not appear to be ‘intuitive’ and easy to compute, (2) the unavailability of measurement data (UML metamodel for C3 and statistical data for C14), and (3) the conductors explanations during the experiment lacked clarity.

Based on the results, we rejected the null hypotheses \(H2i_0\) for i in \(\left\{ 1,2,3,5,6,7,8,9,10,11,12,13\right\} \) and accepted their alternative hypotheses, meaning that all the criteria, except C4 and C14, were perceived by the experts as usable.

Table 17 Profile of the experts
Table 18 Selected enterprise architecture frameworks

Analysis of the correctness assessment

Table 21 depicts the assessments of the correctness aspect of the proposed criteria in the context of the experimental EAFs. The experts found that, among the 14 criteria, Taxonomy-Architecture Layers (C1), Metamodel-Completeness (C3), Metamodel-Complexity (C4), and Usability (C14) were not always correct.

First, Expert 1 and Expert 2 indicated that, although several EAFs combine the data and application layers into one, the metrics for the criteria C1 and C3 required a clear separation between the data and application. As such, they assessed that these two criteria were incorrect. Second, Expert 2 found the metrics for metamodel Complexity (C4) and Usability (C14) not really accurate. A closer analysis of his comments showed that he proposed other metrics, which he found more accurate for C4 and C14. More precisely, he proposed to use the object-oriented design coupling metrics from [46] to compute model complexity and the single and summated usability metric proposed in [47]. On the other hand, Expert 1 and Expert 3 found that the metrics used for C4 and C14 were correct, but not easily applicable to other frameworks (value 1 with textbold background in table 21). However, unlike expert 2, they did not propose other metrics.

Taking into account the experts comments, we realized that: (1) the metrics for C1 and C3 must be refactored to allow architects to compute them when taxonomy layers are merged, and (2) the metrics for C4 and C14 must be improved to allow precise and generic measurement of the metamodel complexity and usability.

Based on these observations, we reject the null hypotheses \(H3i_0\) for i in \(\left\{ 2,5,6,7,8,9,10,11,12,13\right\} \) and accept their alternative hypotheses, meaning that criteria C2, C5, C6, C7, C8, C9, C10, C11, C12 and C13 were perceived by the experts as correct.

Figure 2 summarizes the results of the evaluation of the criteria from the point of view of the EA experts. Among the 756 evaluations (14 criteria x 6 EAFs x 3 experts x 3 aspects to evaluate), participants confirmed that: (1) 90.87% of the criteria were perceived usable, (2) 97.62% of them were perceived to be relevant, and (3) 90.48% of them were perceived as correct.

Table 19 Evaluation of the relevance criterion
Table 20 Evaluation of the usability criterion
Table 21 Evaluation of the correctness criterion
Fig. 2
figure 2

Results of the evaluation

3.4.5 Threats to the validity

In this section, we explain the main issues that may have threatened the validity of the experiment. We consider threats to internal, external and construct validity as discussed in [48].

Internal validity The threats to internal validity compromise the confidence to confirm a relationship between the independent and dependent variables. This is relevant when the study’s goal is to establish a causal relationship between variables. In the particular context of our experiment, threats to internal validity were mainly related to participants’ experience and possible information exchange between them which both might impact EAF criteria evaluation. To mitigate the threat related to the profile of experts, we defined a minimum skill set to be met by participants. The selection of experts was based on their strong professional experience and knowledge background of selected EAFs. To mitigate the impact of information exchange, the experiment took place in a controlled environment in which the participants were not allowed to communicate with each other.

External validity Threats to external validity might compromise the confidence to determine that the results of an experiment can be generalized. The primary external threat arises from the possibility that selected EAFs could be non-representative of other EAFs. To address this issue, we analyzed 12 EAFs and selected the six that were the most used in the industry and literature. Nonetheless, our results might be valid only for the experimental EAFs, and further replications are needed to improve the generalizability of the results.

Construct validity Construct validity refers to the extent to which the observations or measurement tools actually represent or measure the construct being investigated. In this paper, one possible threat to construct validity arises from the metrics used to compute the criteria, especially the metrics for metamodel complexity and EAF usability as raised by Expert 2. Therefore, conclusions obtained from our correctness evaluation might not be representative of other evaluation methods. To mitigate this concern, mature measurement techniques were used when available.

4 Conclusion and future work

Following the DSR approach, this study designed and tested an EAF evaluation artifact that identifies, elaborates and operationalizes a comprehensive set of 14 criteria. Overall, results of the experiment show that: (i) 90.87% of the criteria were perceived usable, (ii) 97.62% of them were perceived to be relevant, and (iii) 90.48% of them were perceived as correct.

This study contributes to the EA literature in several ways. First, through our review of the literature on EAF evaluation criteria, we present a much-needed and timely overview of the literature on this topic. Indeed, given the growing number of EA studies and the disparity of the EA literature, it was important to conduct such a review to induce and summarize past contributions. Most importantly, our review of the literature allowed us to identify key gaps that could explain why EA practitioners still fail to identify the EAF that best addresses the needs of their organization. Our literature review is thus similar but complementary to other EA-related literature reviews [49,50,51,52] that have brought order and meaning to studies on other aspects of the EA approach (e.g., benefits, methodologies, IT alignment).

Second, through the development of our EAF evaluation artifact, we answer EA practitioners’ requests for a tool that is both theoretically sound and practical. Indeed, while previous studies did present evaluation criteria, our artifact is the first to present a comprehensive set of criteria that not only synthesizes already proposed criteria but also includes new criteria (e.g., SOA models and usability). It is important to mention here that the development of these new criteria was made possible by reviewing other research streams complementary to the one on EA evaluation criteria.

Finally, and most importantly, our artifact is the first to operationalize each of its criteria via a set of objective measures and scales. Indeed, in contrast to previous EAF evaluation studies and available EAF comparison matrices, our measures and scales do not rank EAFs based on subjective assessments of key experts. Instead, our measures and scales rank EAFs based on their objective and tangible characteristics. As such, by using our artifact, EA practitioners can now make their own objective assessment of candidate EAFs. They can thus make a decision that is informed by their own needs and context rather than by the subjective and out of context assessments of unknown experts. Thus, our artifact, by giving EA practitioners much-needed autonomy, goes one step further than the ones proposed in previous studies.

While we believe that our artifact handles the most important criteria to effectively select an EAF, for our tool to be functionally useful and usable, more research will be needed to (i) develop a web tool that supports the enterprise architects in measuring the criteria, (ii) refine certain metrics by taking into account the comments of the experts, especially for the metamodel complexity and usability criteria, and (iii) extend the set of criteria to support the concepts of business architecture as described in [3].