Keywords

1 Introduction

Interoperating with externally developed black-box Web Service or Platform APIs is restricted with their Conceptual interoperability constraints (COINs), which are defined as the characteristics controlling the exchange of data or functionalities at the following conceptual classes: Syntax, Semantics, Structure, Dynamics, Context, and Quality [2]. Hence, to build a successful interoperation, software architects and analysts need to identify and fulfil these conceptual constraints of the external APIs. Otherwise, unexpected conceptual mismatches can prevent the whole interoperation or make its results meaningless. Consequently, this causes resolution expenses at later stages of projects [8]. Therefore, it is necessary to perform effective conceptual interoperability analysis for shared documents about a software API of interest to identify its conceptual constraints. This in turn offers a basis for analyzing interoperability on other levels, which are out of our research scope, like organizational level (e.g., privacy concerns), managerial level (e.g., budget restrictions), and technical level (e.g., network protocols).

Current analysis approaches relies on manual investigation of shared API documents [9]. However, such manual reading and inspection of natural language text in these documents to find constraints is an exhausting, time-consuming, and error-prone task [19]. Add to this, it requires knowledge about the different conceptual constraints along with linguistic analysis skills.

In this paper, we elaborate on and extend our proposed conceptual interoperability analysis framework [2]. In particular, we automate the identification of COINs in API documentations’ text by employing machine learning (ML) techniques. Our goal is to assist software architects and analysts in performing effective and efficient conceptual interoperability analysis. We followed a systematic empirically-based research methodology, which has two main parts. In the first part, we conducted a multiple-case study that yielded our first contribution, which is a ground truth dataset. This dataset is a community-reusable asset in the form of a repository of textual sentences that we collected from multiple API documents and manually labeled them with a specific COIN class. In the second part, we contributed a classification model for the COINs in the ground truth dataset, and we evaluated it through experiments using different ML text-classification algorithms. Our experiments revealed promising results towards automating the identification of COINs in text of API documents. We achieved up to \(70.4\,\%\) precision and \(70.2\,\%\) recall for identifying seven classes of constraints (i.e., Syntax, Semantics, Structure, Dynamics, Context, Quality, and Not-COIN). This increased to reach \(81.9\,\%\) precision and \(82.0\,\%\) recall for identifying two classes (i.e., COIN, Not-COIN). Finally, we developed a tool prototype that demonstrates the value of our ideas in serving software architects during their interoperability analysis task. In specific, the tool allows architects to select sentences from API document webpages, and it checks and reports the existence of COINs along with their types. Such a classification service would enhance the interoperability analysis results, especially for inexperienced architects, as it helps in understanding the constraints’ impact and how to satisfy them.

The rest of this paper is organized as follows. Section 2 introduces a background, Sect. 3 overviews the related works, and Sect. 4 outlines our research methodology. Sections 5 and 6 detail our first and second research parts. Section 7 presents our tool support and Sect. 8 is the conclusion.

2 Background

In this section we present a brief introduction to conceptual interoperability constraints and the used machine learning techniques in our research.

2.1 Conceptual Interoperability Constraints

The presented work in this paper is based on the Conceptual Interoperability Constraints (COIN) model [2], which focuses on the non-technical constraints of interoperable software systems and can be applied to different types of software systems (e.g., information systems, embedded systems, mobile systems, etc.). COINs are the conceptual characteristics that govern the software systems interoperability with other systems. Therefore, missing or wrong understanding of COINs may defect the desired interoperability by leading to conceptual inconsistencies or meaningless results. There are six classes of COINs that we summarize as the following: (1) Syntax COINs that state the constraints packaging (e.g., used terminology or modeling language). (2) Semantic COINs that express meaning-related constraints (e.g., goals of methods). (3) Structure COINs that depict the systems elements, their relations, and arrangements affecting the interoperation results (e.g., data distribution). (4) Dynamic COINs that restrict the behavior of interoperating elements (e.g., synchronization feature). (5) Context COINs that pertain to external settings of the interoperation (e.g., user and usage properties). (6) Quality COINs that capture quality characteristics related to exchanged data and services (e.g., interoperation response time).

2.2 Machine Learning for Text Classification

In order to enable the automatic detection of COINs in text, we employed ML text-classification algorithms (e.g., NaïveBayes [10] and Support Vector Machine [18]). The accuracy results of such algorithms depend on the quality and the size of the dataset [4] that consists of manually labeled sentences with one of the predefined classification classes. Text classification process consists of:

- Building the classification model, in which all features of the sentences in the dataset are identified and modeled mathematically. In our research, we used popular techniques for building our model: (1) Bag of Words (BOWs) [6] that considers each word in a sentence as a feature, and accordingly a document is represented as a matrix of weighted values; (2) N-Grams [16] that considers each N adjacent words in a sentence as a feature, where \((N > 0)\).

- Evaluating the classification model, in which the manually labeled dataset is divided into a training and testing sets. The training set is used for training the ML classification algorithm on the features captured in the model, while the testing set is for evaluating the classification accuracy. For our research, we used k-fold Cross-validation [11], in which our ground truth dataset (i.e., COINs Corpus) is divided into k folds. Then, \((k-1)\) folds are used for training and one fold is used for testing. Finally, an average of k evaluation rounds is computed.

3 Related Work

A number of previous works proposed automating the identification of some interoperability constraints from API documents. Wu et al. [19] targeted parameters dependency constraints, Pandita et al. [13] inferred formal specifications for methods pre/post conditions, and Zhong et al. [20] recognized resource specifications. We complement these works and elaborate on Abukwaik et al. [2] idea of extracting a comprehensive set of conceptual interoperability constraints.

On a broader scope, other works proposed retrieving information to assist software architects in different tasks. Anvaari and Zimmermann [3] retrieved architectural knowledge from documents for architectural guidance purposes. Figueiredo et al. [7] and Lopez et al. [12] searched for architectural knowledge in emails, meeting notes, and wikis for proper documentation purposes. Although, these are important achievements, they do not meet our goal of assisting architects in interoperability analysis tasks.

In general, our work and the aforementioned related works intersect in the utilization of natural language processing techniques in retrieving specific kind of information from documents. However, they used rule-based and ontology-based retrieval approaches, while we explored ML classification algorithms that are helpful for information retrieval in natural language text. Add to this, our systematic research contributed a reusable ground truth dataset for all COIN types that enables related research replication and results’ comparison.

4 Research Methodology

In this research, we systematically revealed the potentials of automating the extraction of COINs from API documents using ML techniques. Our research goal formulated in terms of GQM goal template [5] is: to support the conceptual interoperability analysis task for the purpose of improvement with respect to effectiveness and efficiency from the viewpoint of software architects and analysts in the context of analyzing text in API documentation within software integration projects. We translate this goal into the following research questions:

RQ1: What are the existing conceptual interoperability constraints, COINs, in the text of API documentation?

This question explores the current state of COINs in real API documents. It also aims at building the ground truth dataset (i.e. COINs Corpus representing a repository of sentences labeled with their COIN class). This forms a main building block towards the envisioned automatic extraction idea.

RQ2: How effective and efficient would it be to use ML techniques in automating the extraction of COINs from text in API documentations?

This question explores the actual benefits of utilizing ML in supporting software architects and analysts in analyzing the text. It aims at building a classification model that will be evaluated through well-known ML classification algorithms.

In order to achieve the stated goal and answer the aforementioned questions, we performed our research in two main parts as follows:

Research Part 1 (Multiple-case study). In this part, we systematically explored the state of COINs in six cases of API documentations. The result of this part is a ground truth dataset (i.e., COINs Corpus). We detail the study design and results in Sect. 5.

Research Part 2 (Experiments). In this part, we started with using the ground truth dataset, which resulted from the previous part, in building the COIN Classification Model. Afterwards, we investigated the accuracy of different ML classification algorithms in identifying the COINs in text by using our model. We detail the process and results of this research part in Sect. 6.

Our systematic research provided us with traceability between the different activities and their results. Moreover, it enables future researchers to independently replicate our work and to compare the results.

5 Multiple-Case Study: Building the Ground Truth Dataset for COINs

In this section, we describe our multiple-case study design, execution, and results.

5.1 Study Design

Study Goal. We aim at answering the first research question RQ1 that we stated in Sect. 4. In order to do so, we needed to examine real-world API documentations to discover the state of conceptual interoperability constraints in them.

Research Method. We decided to perform a multiple-case study with literal replication of cases from different domains. Such a method aids in collecting significant evidences and drawing generalizable results.

Case Selection. For systematic selection of cases of API documentations, we considered the following selection criteria:

SC1: Mashup Score. This is a published statistical valueFootnote 1 for the popularity of a Web Service API in terms of its integration frequency into new bigger APIs.

SC2: API Type. This can be either Web Service API or Platform API.

SC3: API Domain. This is the application domain for the considered API document (e.g., social blogging, audio, software development, etc.).

Analysis Unit. Our case study has a holistic design, which means that we have a single unit of analysis. This unit is “the sentences in API documents that include COIN instances”. To document and maintain the analyzed sentences, we designed a data extraction sheet that we implemented as an MS Excel sheet. This sheet consists of demographic fields (i.e., API name, date of retrieval, mashup score, API type, API domain, and no. of sentences) and analysis fields (i.e., case id, sentence id, sentence textual value, and the COIN class).

Study Protocol. Our multiple-case study protocol includes three main activities that are adapted from the process proposed by Runeson [17]. The study activities are case selection, case execution, and cross-case analysis as we summarize in Fig. 1 below and describe in details within the next subsection.

Fig. 1.
figure 1

Multiple-case study process

5.2 Study Execution and Results

Based on our predefined case selection criteria, in August 2015 we chose six API documentations. Four API documents from the Web Services type (i.e., SoundCloud, GoogleMaps, Skype, and Instagram) and two from the Platform type (i.e., AppleWatch and Eclipse-Plugin Developer Guide). These cases cover different application domains (i.e., social micro-blogging, geographical location, telecommunication, social audio, and software development environment). With regards to the mashup criteria, our four cases of Web Service APIs are chosen to cover a wide range of scores starting from 30 for Skype and ending with 2582 for GoogleMaps. After selecting our cases, we executed each case as the following:

Data Preparation. We started this step with fetching the API documentation for the selected case from its online website. Then, we read the documents and determined the webpages that had textual content offering conceptual software description and constraints (e.g., the Overview, Introduction, Developer Guide, API Reference, Summary, etc.). Subsequently, we started processing the text in chosen webpages by performing the following:

- Automatic Filtering. We implemented a simple PHP code using Simple HTML DOM ParserFootnote 2 library to filter out the text noise (i.e., headers, images, tags, symbols, html code, and JavaScript code). Thus, we passed the URL link of the chosen webpage (input) to our implemented code. Then, we got back a .txt file containing the textual content of the webpage (output).

- Manual Filtering. The automatic filtering fells short in excluding specific types of noise (e.g., text and code mixture, references like “see also”, “for more information”, “related topics”, copyrights, etc.). These sentences could mislead the machine learning in our later research steps, so we removed them manually.

Data Collection. In this step, we cut the content of the text file resulted from previous step into single sentences within our designed data extraction sheet (.xsl file) that we described in Subsect. 5.1. We completed all the fields of the data sheet for each sentence except for the “COIN class” filed that we did within the next step. Note that, we maintained a data storage, in which we stored the original HTML webpages of the selected API documentations, their text file, and their excel sheet. This enables later replication of our work by other researchers as documentations get changed so frequently.

Data Analysis. We manually analyzed each collected sentence in the extraction sheet and carefully assigned it a COIN class. This classification was based on an interpretation criteria, which is the COIN Model with its six classes (i.e., Syntax, Semantic, Structure, Dynamic, Context, and Quality). We added a seventh class for sentences with no COIN instance (i.e., Not-COIN class). For example, a sentence like “A user is encapsulated by a read-only Person object.” was classified as a “Structure COIN”. While, “You can also use our Sharing Kits for Windows, OS X, Android or iOS applications” was classified as a “Not-COIN” as it did not express a conceptual constraint, but rather a technical information.

The result of this step was a very critical point towards our envisioned automatic COIN extraction idea. Hence, the data analysis was performed by two researchers, who independently classified all sentences for each case. Then, in multiple discussion sessions, the two researchers compared their classification decisions and resolved conflicts based on consensus.

Obviously, the case execution process consumed time and mental effort, especially in the data analysis step. Table 1 summarizes the distribution of our collected 2283 sentences among the cases along with the effort (in terms of hours) that we spent in executing them. Noticeably, SoundCloud and Instagram have small documents, and consequently they have the smallest share of sentences included in our study (i.e., \(9.5\,\%\) and \(11\,\%\)). Meanwhile, Eclipse documentation is the largest and consequently has the highest share of sentences (i.e., \(28.5\,\%\)).

Table 1. Case-share of sentences and execution effort

Cross-Case Analysis (Answering RQ1: What are the Types of Existing Conceptual Interoperability Constraints, COINs, in the Text of Current API Documentations?). After executing all cases, we arranged the incrementally classified sets of sentences from all cases (i.e., 2283 sentences) into one repository that we call the ground truth dataset or the COINs Corpus as called in ML. We have developed two versions of this dataset as the following:

Seven-COIN Corpus, in which, each sentence belongs to one of the seven classes (i.e., Not-COIN, Dynamic, Semantic, Syntax, Structure, Context, or Quality). Two-COIN Corpus, in which, each sentence belongs to one of two classes rather than seven (i.e., COIN or Not-COIN). In fact, the Two-COIN Corpus is derived from the Seven-COIN Corpus by abstracting the six COIN classes into one class. Table 2 shows the difference between the two Corpora with example sentences.

Table 2. Example of content in the Seven-COIN and Two-COIN Corpus

The aim of building these two versions of the corpus is to better investigate the performance results of the ML algorithms in the later research experiments. We explain this in more details in Sect. 6.

COIN-Share in the Contributed Ground Truth Dataset. In Fig. 2, we illustrate the distribution of sentences among the COIN classes within the Seven-COIN Corpus (on the left) and the Two-COIN Corpus (on the right). It is noticed that the Not-COIN class, which expresses technical constraints rather than conceptual ones, is the dominant among the other six classes (i.e., \(42\,\%\)). The Dynamic and Semantic classes have the second and third biggest shares. Remarkably, the Structure, Syntax, Quality, and Context instances are very few with convergent shares ranging between \(1\,\%\) and \(5\,\%\) of the dataset.

Fig. 2.
figure 2

COIN-share in the ground truth dataset

COIN-Share in the Cases. On a finer level, we have investigated the state of COINs in each case rather than in the whole ground truth dataset. We found that the content of each API document was focused on the Not-COIN, Dynamic and Semantic classes similarly as in the aggregated findings on the complete dataset seen in Fig. 2. For example, in the case of AppleWatch documentation, \(40.8\,\%\) of the content is for Not-COIN, \(26.1\,\%\) for Dynamic, and \(25\,\%\) for Semantic. Add to this, all cases had less than \(10\,\%\) of its content to the Structure, Syntax, Quality, and Context classes (e.g., Eclipse-Plugin gave them \(8.5\,\%\)).

5.3 Discussion

Technical-Oriented API Documentations. The Not-COIN class reserves \(42\,\%\) of the total sentences in the investigated parts of the API documents that were supposed to be conceptual (i.e., overview and introduction sections). A noteworthy example is the GoogleMaps case, which took it to an extreme level of focus on the technical information (i.e., \(63\,\%\) of its content was under the Not-COIN class, \(11.2\,\%\) for Dynamic class, \(13.1\,\%\) for Semantic class, and the rest is shared by the other classes). Accordingly, it is important to raise a flag about the lack of sufficient information about the conceptual aspects of interoperable software units or APIs (e.g., usage context, terminology definitions, quality attributes, etc.). This concern needs to be brought to the notice of researchers and practitioners who care about the usefulness and adequacy of content in API documentations. This obviously has a direct influence on the effectiveness of architects and analysts in the conceptual interoperability analysis related activities.

Considerable Presence of Dynamic and Semantic Constraints. Our study findings reveal that the Dynamic and Semantic classes have apparently big shares in current API documents (i.e., \(25\,\%\) and \(24\,\%\) of the dataset). This reflects the favorable awareness about the importance of proper and explicit documenting of the API semantics (e.g., data meaning, service goal, conceptual input and output, etc.) and dynamics (e.g., interaction protocol, flow of data, pre- and post- conditions, etc.). Nevertheless, based on the tedious work we went through our manual analysis for the six cases, we believe that it would be of great help for architects and analysts to have clear boarders between these two classes of constraints within the verbose of text. For example, it would be easier to skim the text, if the API goal get separated from its interaction protocol, rather than blending them into long paragraphs. This would offer architects and analysts a better experience and it would consequently enhance their analysis results.

COIN-Deficiency in Platform and Web Service API Documents. From our investigated cases, we perceived a convention on assigning insignificant shares for the Structure, Syntax, Quality, and Context classes. Interestingly, the cases varied with regards to what they chose to slightly cover out of these four classes.

On one hand, the cases of Web Service APIs were the main contributors to the Context, Quality, and Syntax classes in the ground truth dataset. That is, the documents of GoogleMaps, SoundCloud, Skype, and Instagram provided \(82.5\,\%\) of the Syntax COINs, \(70.4\,\%\) of the Quality COINs, and \(92\,\%\) of the Context COINs. Such a contribution cannot be related to the nature of Web Service APIs, as Platfrom ones need also to share these COINs explicitly. For example, it is critical for a FarmerWatch application to know the offered response time by the Notification service of AppleWatch APIs.

On the other hand, the Platform API documents participated with \(56.1\,\%\) of the Structure COINs in the ground truth dataset, while the Web Service API documents participated with \(43.9\%\). Note that, this is not related to the larger amount of sentences that these two documents contributed to the dataset, but rather due to the internal case share of Structure COINs. On average, the Platform API documents allocate about \(6\,\%\) of their content to structural constraints, while Web Service API documents allocate about \(3.6\,\%\) for these constraints.

Observed Patterns for the Dominant Classes in the Ground Truth Dataset. From the considerable amount of sentences for the Not-COIN, Semantic, and Dynamic classes, we observed a number of patterns in terms of frequently occurring terms and sentences. We envision that using the patterns in combination with the BOW in future experiments would enhance the results of the automatic COIN identification. Below we describe some of these patterns.

- Patterns of the Not-COIN Class. We observed the presence of “Technical Keywords”, which are abbreviations of software technologies (e.g., XML, iOS, XPath, JavaScript, ASCII, etc.). With further analysis, we found that \(30.7\,\%\) of the Not-COIN instances have technical keywords. Another pattern for this class is variables with special format (e.g., “XML responses consist of zero or more <route> elements.”). Also, sentences starting with specific terms (e.g., “for example”, “for more information”, “see”, etc.) recurred in \(12.8\,\%\) of the Not-COIN instances.

- Patterns of the Dynamic Class. We found a number of recurrent terms related to actions and data/process flow thae we gathered into a list called the “Action Verbs”, which includes: create, use, request, access, lock, include, setup, run, start, call, redirect, and more. In fact, \(35.8\,\%\) of the sentences with Dynamic COINs have one or more of these terms. Furthermore, \(24\,\%\) of the Dynamic COIN sentences contain a conditional statement expressing a pre- or post- condition. For example, the sentence “If a command name is specified, the help message for this command is displayed” has a Dynamic COIN that states a pre-condition.

- Patterns of the Semantic Class. We noticed repeated terms and organized them into: “Input/Output Terms” (e.g., return, receive, display, response, send, result, etc.) that are in \(18.8\,\%\) of the Semantic COIN sentences and “Goal Terms” (e.g., allow, enable, let, grant, permit, facilitate, etc.) that are in \(16.4\,\%\). For example, the sentence “A dynamic notification interface lets you provide a more enriched notification experience for the user” has a Semantic COIN stating a goal.

5.4 Threats to Validity

Case Bias. To obtain significant results and draw generalizable conclusions, we included multiple cases for building the ground truth that plays prominent role in our research. We literally replicated six API documents (i.e., SoundCloud, GoogleMaps, Skype, Instagram, AppleWatch and Eclipse-Plugin Developer Guide) from two different types (Web Service and Platform APIs).

Completeness. Due to resource limitations (i.e., time and manpower), we were unable to analyze the large API documents completely. However, we were careful with respect to selecting inclusive parts of such large documents. For example, out of the huge document of Eclipse APIs, we covered the Plugin part.

Researcher Bias. To build our ground truth dataset in a way that guarantees results accuracy and impartiality, we replicated the manual classification of the cases sentences by two researchers separately based on the COINs Model as an interpretation criteria. In multiple discussion sessions, the researchers compared their classification decisions and resolved conflicts based on consensus.

6 Experiments: Automatic Identification of COINs Using Machine Learning

In this section, we detail the experiments design, execution, and results.

6.1 Experiments Design

Experiments Goal. This part of our research aims at answering the second research question RQ2 that we stated in Sect. 4. In order to do so, we needed to examine ML techniques to discover their potentials in supporting architects and analysts in automatically identifying the COINs in text of API documents.

Research Method. We built a classification model and ran multiple experiments employing different ML text-classification algorithms. This method enables comparing the algorithms results and drawing solid conclusions about the ML advantages in addressing the challenges of manual interoperability analysis.

Evaluation Method and Metrics. We used k-fold Cross-validation, which we explained in the background section, with \(k = 10\). For evaluation metrics of classification accuracy, we used the following commonly used measures [14]:

Precision: the ratio of correctly classified sentences by the classification algorithm to the total number of sentences it classifies either correctly or incorrectly.

Recall: the ratio of correctly classified sentences by the classification algorithm to the total number of sentences in the corpus.

F-Measure: the harmonic mean of precision and recall that is calculated as: \((2*Precision*Recall)/(Precision+Recall)\).

Experiments Protocol. Our experiments protocol includes three main activities that are: feature selection, feature modeling, and ML algorithms evaluation. We illustrate this protocol in Fig. 3, and we describe it in details within the next subsection. We ran this protocol twice, once for the Seven-COIN Corpus and another for the Two-COIN Corpus.

Fig. 3.
figure 3

Experiments process

6.2 Experiments Execution and Results

We performed all our execution on Weka v3.7.11Footnote 3, which is a suite of ML algorithms written in Java with result visualization capabilities. The execution started with processing the textual sentences in our contributed dataset (i.e., COINs Corpus) using natural language processing (NLP) techniques. The processing included tokenizing sentences into words, lowering cases, eliminating noise words (e.g., is, are, in, of, this, etc.), and stemming words into their root format (e.g., encapsulating and encapsulated are returned as encapsulate).

Feature Selection. After processing the text, we identified the most representative features or keywords for the COIN classes within the COINs Corpus using the Bag-of-Words (BOWs) and N-Gram approaches, which we explained in the background section. That is, each sentence was represented as a collection of words. Then, each single word and each n-combination of words in the sentence were considered as features, where N was between 1 and 3. For example, in a sentence like “A user is encapsulated by a read-only Person object”, the word “encapsulate” and the combination “read-only” were considered as two of its features. The output of this step was a set of features for the COINs Corpus.

Feature Modeling. In this stage, the whole COINs Corpus was transformed into a mathematical model. That is, it was represented as a matrix, in which headers contained all extracted features from the previous phase, while each row represented a sentence of the corpus. Then, we weighted the matrix, where each cell [row, column] held the weight of a feature in a specific sentence. For weighting, we used the Term Frequency-Inverse Document Frequency (TF-IDF) [15], which is often used for text retrieval. The result of this was the COINs Feature Model (or the classification model), which is a reusable asset reserving knowledge about conceptual interoperability constraints in API documents.

ML Algorithms Evaluation. We selected a number of well-known ML text-classification algorithms (e.g., NaïveBayes versions, Support Vector Machine, Random Forest Tree, K-Nearest Neighbor KNN, and more). Then, we ran these algorithms on the classification model resulted from the modeling activity.

Table 3. COINs identification results using different ML algorithms

Evaluation Results (Answering RQ2: How Effective and Efficient Would it be to Use ML Techniques in Automating the Extraction of COINs from Text in API Documentations?).

Effectiveness of Identifying the COINs using ML Algorithms. Here we report the effectiveness results in terms of accuracy metrics in two cases:

- Seven-COIN Corpus Case. The evaluation results showed that the best accuracy in automatically identifying seven classes of interoperability constraints in text was achieved by the ComplementNaïveBayes algorithm (see Table 3). It achieved \(70.4\,\%\) precision, \(70.2\,\%\) recall, and \(70\,\%\) F-measure. In the second place came NaïveBayesMutinomialupdatable algorithm with about \(5\,\%\) less accuracy than the former algorithm. The other algorithms had accuracy, F-measure, between \(62.8\,\%\) and \(59.0\,\%\). The worst results were from the KNN algorithms.

- Two-COIN Corpus Case. By applying the same algorithms on the Two-COIN Corpus, we obtained better results. In particular, the accuracy increased with almost \(11\,\%\) compared to the results in the Seven-COIN case with the ComplementNaïveBayes algorithm. That is, the precision increased to \(81.9\,\%\), recall to \(82.0\,\%\), and F-measure to \(81.9\,\%\). Similar to the previous case, NaïveBayesMutinomialupdatable came in the second rank and the 2-Nearest Neighbor algorithm had the worst results as seen in Table 3. Note, we have achieved an improvement in accuracy compared to our preliminary investigation results [1], in which we had F-measure of \(62.2\,\%\) using the NaïveBayes algorithm.

Efficiency of Identifying the COINs Using ML Algorithms. Obviously, the machine beats the human performance in terms of the spent time in analyzing the text. As we mentioned earlier, analyzing the documents costed us about 44 working hours, while, it took the machine way less time. For example, training and testing the NaïveBayesMultinominalupdate took about 5 s on our complete corpus with 2283 sentences). This efficiency would enhance when using machines with faster and more powerful CPU (we ran the experiments on a machine with Intel core i5 460 M CPU with 2.5 GHZ speed).

6.3 Discussion and Limitations

Towards Automatic Conceptual Interoperability Analysis. The achieved effectiveness in the automatic identification of constraints (e.g., \(81.9\,\%\) F-measure) is promising and shows the potentials of our ML classification model in serving architects through their interoperability analysis tasks. We consider this accuracy high, as we compared the algorithms’ results to our complete sentence-by-sentence manual analysis for the API documents, which we did for the sake of building a robust corpus. However, in practice, sentences are not examined in such a heavy way, especially when projects are limited in time and manpower. Hence, our model and its provided results in this work are a step towards achieving a good level of automation intelligence for the classic software engineering practices that are both error-prone and resource-consuming.

Larger Corpus, Better Accuracy Results. It is known in ML that the more classification classes you want to train the machine on identifying, the more training data it requires to be fed with. This explains the higher accuracy we achieved using the Two-COIN Corpus compared to the Seven-COIN Corpus even with the same amount of sentences in both. Therefore, we plan to enlarge our corpus, to achieve better accuracy in identifying the seven COIN classes.

Unbalanced Amount of Instances for Each Class in the Corpus. As noticed, the number of instances for the COIN classes is not balanced in the corpus. That is, dominant classes (i.e., Not-COIN, Dynamic, and Semantic) contribute with the majority of sentences in the data set (i.e., \(91\,\%\)). While, the other classes (i.e., Structure, Syntax, Quality and Context) are smaller and share the left \(9\,\%\) of the corpus. This affects the classification accuracy of the classes with fewer instances. Therefore, in future work we intend to increase the number of instances for these minor classes in the training data to achieve higher accuracy results.

7 Tool Support (A Prototype)

To bring our ideas to practical life, we designed a tool as a web browser plugin that aims at assisting software architects and analysts in their conceptual interoperability analysis task. The tool takes sentences from API documents, recognizes if they have any conceptual interoperability constraint, and reports their COIN classes within seconds. We implemented an easy-to-use prototype for the tool, in which the architect can highlight a sentence in a webpage for an API document to examine if it has any COIN (see Fig. 4).

The tool encapsulates our contributed classification model and mirrors its efficiency and accuracy that we described in Subsect. 6.2. That is, the tool saves time and manual effort by automatically identifying and classifying the conceptual constraints from text in seconds. This functionality offers critical input for architects to understand the impact of the identified constraints and to satisfy them based on their class. Hence, the tool has potentials to improve the effectiveness of interoperability analysis, especially for inexperienced architects.

Fig. 4.
figure 4

Example of the tool identification for a Structure COIN in an API document

We implemented the prototype as a plugin for the Chrome web browser using Java and JavaScript languages. The functionality is offered as a Web Service and all communication is over the Simple Object Access Protocol (SOAP). The tool design includes: (1) Front-End component that we developed using JavaScript to provide the graphical user interface. (2) Back-End component that we developed using Java and Weka APIs to be responsible for locating our service on the server, passing it the input sentence, and carrying back the response.

8 Conclusion and Future Work

In this paper, we have presented our ideas about supporting software architects in performing seamless conceptual interoperability analysis. The contribution pursued by this work was to utilize ML algorithms for effective and efficient identification of conceptual interoperability constraints in text of API documents. Our systematic empirically-based research included a multiple case study that resulted in the ground truth dataset. Then, we built a ML classification model that we evaluated in experiments using different ML algorithms. The results showed that we achieved up to \(70.0\,\%\) accuracy for identifying seven classes of interoperability constraints, and it increased to \(81.9\,\%\) for two classes.

In the future, we plan to automate the manual filtering part of the data preparation. We will also analyze further API documents to advance the generalizability of our results. This would enrich the ground truth dataset as well, allowing better training for the ML algorithms and accordingly better accuracy in identifying the conceptual interoperability constraints. With regards to the tool, we will extend it to generate full reports about all interoperability constraints in a webpage and to collect instant feedback from users about automation results. In addition, we plan to empirically evaluate our ideas in industrial case studies.