Keywords

1 Introduction

Model-Based Development is continuously evolving [1]. In such a development, models play a major role and serve as the main design artifacts that developers aim at achieving. Due to the difficulties in developing high quality models developers aim at starting from an established point in which an existing model can be used for the planned application/system [2]. To facilitate that approach, repositories of models start emerging including for example [8, 21]. There are many challenges in establishing such repositories [7]. These include technical issues in terms of model representation, scalability, and heterogeneity and semantic issues of how to analyze and query such repositories. In this paper, we aim at addressing the challenge of querying model repositories, and in particular class models. Indeed, existing repositories provide searching capabilities, yet, these mainly refer to meta data (e.g., size and type) or to textual and keyword similarity. However, similarity between model artifacts requires consideration of both or at least semantic similarity and structural similarity [26]. In recent work the graph structure is also taken into account when searching model repositories [11]. Nevertheless, the capability provided neglect the semantic similarity. One of the main benefits of measuring semantics similarity between software artifacts (e.g., code, State Chart, Class Diagram...) is the ability to reuse them so to save time during the software development process [10, 14]. In this paper, we aim at addressing this gap by proposing an algorithm that searches for class models that match the developers intention. The algorithm is based on a greedy search approach where in each iteration the algorithm tries to find the most similar class to the next class of the query using similarity function that considers type similarity, structure similarity and label similarity. We also evaluated the performance of the algorithm by various information retrieval metrics and found the result promising.

The paper is organized as follows. Section 2 analyze related studies for searching and measuring similarities between models. Section 3 introduces the sought algorithm and Sect. 4 presents its evaluation. Finally, Sect. 5 concludes and sets plans for future research.

2 Related Work

Several approaches have addressed the challenge we introduced. In this section we review related studies with respect to the following:

  • Type of similarity refers to the way similarity is being calculated.

  • Query form refers to the considered query form (e.g., text, model, type).

  • Experimental settings and datasets refer to the way the suggested solutions were evaluated.

  • Examined metrics refers to the way similarity and related algorithms are being considered.

  • Performance refers to the values of the mentioned metrics.

Robles et al. suggest a new approach for representing knowledge of UML Class Diagrams (UCD) based on Information Retrieval (IR) techniques and calculating their semantic similarity using ontology [22] graphs. The type of similarity is calculated by a formula that considers semantic similarity between the concepts of the diagrams and the topological similarity between the types of elements in the diagrams. The queries are represented as UCD. To evaluate the proposed method, they used fifteen queries of different levels of complexity and performed the search on a corpus that contains sixty-five UCDs. To demonstrate the importance of using ontology similarity they performed their experiment twice, with and without ontology that is similar to the Boolean model in the IR area. The F-score indicated that using ontology leads to significantly better results of 0.83 compared to 0.36 on average.

Similar to [22], Al-Rhman et al. suggest different type of similarity equations, for example: lexical name similarity, attribute similarity, and operation similarity [4]. Al-Rhman et al. show how weighing each of the elements in the equation affects the total similarity between UCDs. They further suggest a greedy-based matching algorithm that finds the similarity of classes of two models in polynomial time [3]. They proposed several different metrics to measure similarity between UCDs. They examined which metric is the most accurate for UCDs in the same domain or from different domains. The first case contains five diagrams grouped up into ten pairs in the same domain. The results indicate that the lexical name similarity that measures the semantic similarity between classes gave the best results in terms of precision, recall, and accuracy. The second case contains four UCD formed into six pairs. In this case, the neighboring similarity that measures the similarity of two classes is superior in terms of precision, recall, and accuracy. Nikpforova et al. [16] suggest an algorithm that converts each class diagram to a vector and then refers to the similarity between them based on distance calculation. The experiment consists of three small similar UCDs.

Even though these methods show promising results, they rely on the fact that similar UCDs have to be in the same context. To overcome this problem, several works have been proposed to incorporate structural similarity. For UML Sequence Diagram, for example, Salami et al. [24] proposed a genetic algorithm that compares two sequence diagrams based on adjacency matrices. They calculate the Mean Average Precision (MAP) of 10 queries to determine how well their method performs for retrieving similar UML Sequence diagrams. The MAP was approximately 0.75 in all ten different repositories that they searched in.

Yuan et al. [26] categorized UML Class Diagrams into two aspects: intra-structure and inter-structure. Intra-structure represents the inner structure of classes like attributes and operations, whereas inter-structure represents the relationships between classes like association and composition. Based on this categorization, they converted UCDs into UML Class Graphs (UCG) and measured the similarity of the inter and intra-structure of the graphs. The results were promising based on the fact that the similarity of pairs of graphs was similar to the expert’s results. Despite that, they only applied the approach to two domains.

López et al. [11] propose a search engine called MAR for retrieving any EMF model [25] focusing on converting a structure of models into an inverted index [5]. The queries take the form of a model that describes the desired results. They convert each model and query to a bag of paths and use the Okapi BM25 measure [20] for ranking the most relevant models. The similarity function considers the number of times a path appears in both query and model graph and whether those paths are unique.

Another option to consider is the similarity flooding algorithm [13] that aims at finding pairs of elements that are similar in two data schemes or two data instances. Yet, in this work we are interested in paths considering also the labels of the vertices and edges. Nevertheless, such algorithm, can be used as another node similarity function.

In Table 1 we summarized the analysis of the methods we mentioned, along the facets we specified before. The methods measure the similarity between queries and models and assume that similar UCDs have much in common. This assumption may lead to poor performance in cases where users are not fully knowledgeable in the domain that they search for. In this study, we aim at addressing this problem by searching for similar sub-diagrams in UCD and measure the similarity between queries and sub-diagrams of the diagrams. We consider the semantic similarity between labels and attributes of the classes, the type similarity of classes and relationships (for example ‘Class’ and ‘Interface’), and consider the structural path between UCDs while other approaches consider only parts of these similarities.

Table 1. Related work summary

3 Greedy Search in UCD

The problem we are aiming to address is the search for relevant UCD fragments within UCDs given a specific query in the form of class diagram. For that purpose we devised a greedy search algorithm in UCDs that addresses the concerns of semantics. The algorithm adopts the similarity considerations appears in [18] and refers to label matching (exact match among labels), structure matching (in terms of links), semantic matching (in terms of labels semantic similarity and links’ semantics), and type matching. In the following we first set the ground for the algorithm and then elaborate on its design and execution.

3.1 Settings

Definition 1

A UML Class Diagram (UCD) is a diagram (D) that consists of classes (C), attributes (A), and relationships (R). D = (C, A, R).

  • C is a set of classes.

  • A is set of attributes \(A=\{A_1,A_2,...,A_k\}\) where \(A_i \) is a set of attributes of class \(c_i\). Each \(c_i\) has a \(A_i=\{a_1,a_2,...,a_j\}\).

  • R is a set of relationships. Each D has a \(R=\{r_1,r_2,...,r_k\}\). Where \(r_i=\{(c_i,l, c_j)|c_i,c_j \in C, l \in ET\} \). \(ET=\{ Association, Composite, Generalization, Aggregation, and \ Interface Realization\}\).

Definition 2

\(sim_{w}\) refers to the similarity of two words. This can be calculated using for example WordNet [15], Sematch [27], or google w2v [17].

Definition 3

\(sim_{l}\) refers the similarity of two classes labels.

$$\begin{aligned} {sim}_l{(}{list}_1,{list}_2)=\frac{{\varSigma _{w_1\epsilon {list_1}}(\arg {max}}_{w_2\epsilon {list_2}}{(}{sim}_w(w_1, w_2)))}{\arg {\max (length(}{list_1}),length({list_2}))} \end{aligned}$$
(1)

the lists are ordered set of words (of the labels of the classes).

Definition 4

\(sim_{att}\) refers to the similarity of two sets of attributes.

$$\begin{aligned} sim_{att}(att_1,att_2) = \sum _{list_1\epsilon {att_1}}\frac{\arg max_{list_2\in {att_2}}(sim_l(list_1,list_2))}{\arg max(length({att_1}),length({att_2}))} \end{aligned}$$
(2)

Definition 5

\(sim_{t}\) refers to a predefined similarity among classes types. Table 2 presents the similarity among UCD class types. The similarity matrix shown in Table 2 is based on similarities suggested in [22].

Table 2. Class type similarity.

Definition 6

\(sim_{c}\) refers to the similarity of two classes.

$$\begin{aligned} {sim}_c{(}c_1,c_2)=\frac{\mathrm{sim}_l{(}c_1,c_2)*0.75\ + \mathrm{sim}_{att}{(}att_{c_1},att_{c_2})*0.25 +{sim}_t(c_1,\ c_2)}{2} \end{aligned}$$
(3)

The search process should consider both the semantic similarity of the elements as well as their types. Thus, we weight these equally. For the semantic similarity of classes we consider the class label of higher importance with respect to its attributes as these are of secondary relevance and besides they may contain “noise” like generic attributes (e.g., id and status). Therefore we weighted \(sim_l\) higher then \(sim_{att}\).

Definition 7

\(sim_{r}\) refers to a predefined similarity among relationship types. Table 3 presents the similarity among UCD relationships. The similarity matrix shown in Table 3 is based on similarities suggested in [22].

Table 3. Relationship type similarity

Definition 8

class-relationship similarity is calculated as follow:

$$\begin{aligned} sim_{ne}(c_1, r_1, c_2, r_2)= \frac{2*\mathrm{sim}_c{(}c_1,c_2)\ +{sim}_r(r_1,\ r_2)}{3} \end{aligned}$$
(4)

This definition captures the similarity of a class along with it incoming relationship.

Definition 9

A query Q is represented as a UCD. Single path query has a single path of classes and relationships (see Fig. 1a). Multi paths query has multiple paths (see Fig. 1b) without cycles.

Fig. 1.
figure 1

Single path and Multi paths queries

Definition 10

Similarity of paths of the query with those of the diagram is calculated as follow (classRelationshipSim is a set of \({sim_{ne}}\)):

$$\begin{aligned} Path-Sim&\left( D^{1},D^{2},classRelationshipSim\right) = \\&sim_c\left( c_1^{D^{1}},c_1^{D^{2}}\right) + \sum _{i=1}^{\left| classRelationshipSim\right| }{classRelationshipSim_{i}} \end{aligned}$$

In the following we elaborate the algorithm for single path and multi-path queries. We differentiate between two types of class similarity. The first relies solely on the class information, we call this class-based similarity. The second also refers to the relationship connecting the class from the previous one, we call this class-relationship based similarity.

  • The function of class-based similarity returns the most similar class in the UCD to a class within the query as calculated by Definition 6.

    Input: \(D=\left( C,A,R\right) \), query diagram, similarity function. The similarity function is the one appears in Definition 2.

    Output: a class from D.

  • The function of class-relationship based similarity is responsible for retrieving up to K most similar neighbors’ classes that exceed a predefined threshold as determined by Definition 8. If there is no single class that its similarity exceed the threshold, the function returns all the neighbor’s classes.

    Input: \(D=\left( C,A,R\right) \), query class, parent class, similarity function, threshold, K.

    Output: list of classes and their similarities from D.

3.2 The Greedy Search Algorithm

To start the execution of the Greedy Search (GS) algorithm, the number of classes (K) the algorithm handles in each round (that refers to the next step of the query) and the threshold for the similarity need to be determined. The algorithm uses the class-relationship based similarity function to find the appropriate list of classes for its execution in each iteration. We also have to determine the UCD(s) in which the search should take place and similarClasses where the search should start (this task can be achieved by the class-based similarity function). For the query, we have the queryDiagram that need to be found in the set of UCDs. Additional required parameters include the queryID that refers to the id of the class in a queryDiagram and is initialized with ‘\(-1\)’ that represents the first class of the query where we want to start search from. isVisited is a list that contains all the classes that GS algorithm already iterates over in the UCD to allow GS algorithm handles diagrams with cycles. The output of the algorithm is a set of sub-diagrams from the UCD that are the most similar to the queryDiagram. Algorithm 1 describes the entire procedure.

The GS algorithm steps are as following: (line 1) Initialize results with an empty set; (lines 2–4) Stop GS if there are no classes in the similarClasses or there are no classes with a given ID in the queryDiagram; (line 5) iterate over all neighbors of the current class; (lines 6–8) skip already visited classes; (line 9) For each class in similarClasses we add it to isVisited list; (line 10) Get up to K most similarClasses by the class-relationship based function described above; (lines 11–12) If there are no classes in similarClasses or the similarity of the classes in the similarClasses is below the threshold the algorithm will run recursively for the selected similarClasses and queryID, without moving on the next class; (lines 12–14) If there are classes in similarClasses and the similarity of the classes in the similarClasses are above the threshold - the algorithm will run recursively for the selected similarClasses (up to K classes) and queryID, each time moving on to the next class – in other words, in order to allow the return of results that are transitive, the class in the query is promoted only when the similarity obtained for the class is higher than the threshold; (line 15) Merge the results that the algorithm retrieved with the results within the function arguments. The algorithm continues until it passes all the classes and relationships of the query. At the end of the algorithm, sub diagrams from the UCD that answer the query will be obtained. The sub diagrams will be sorted by their similarity ranking, as determined by Definition 10 normalized by the number of classes. The ranking follows the weighted similarity of the relationships and classes and is represented by a number between 0 and 1.

figure a

3.3 GS Execution Example

In the following we demonstrate the execution of GS algorithm on a UCD of a machine architecture shown in Fig. 2 with the query shown in Fig. 3a. We set the threshold to 0.65 and the number of lookup classes (K) to 2.

Fig. 2.
figure 2

The machine architecture UCD

Fig. 3.
figure 3

Query and expected output

  1. 1.

    In the first stage, the class-based similarity function for finding the most similar class in the UCD is used to find the first class of the query diagram. The following result demonstrates the output for the class “Machine” in the query diagram:

    figure b

    (for all other classes, the similarity was below 0.3)

  2. 2.

    Following the highest similarity of 0.809, the GS algorithm starts from the class “Machine”. In this stage, the algorithm keeps trying to find the K (K = 2 in our case) most similar adjacent classes to the selected class in the UCD to the next class of the query diagram, that is, “Sounds”. The results are the following:

    figure c
  3. 3.

    This stage splits into three paths of which the algorithm iterates over neighbors of the relevant classes [‘\(-2\)’], [‘\(-9\)’] and [‘\(-11\)’]. Here, we only show the results after iterating over class [‘\(-11\)’]:

    figure d

    Here again, since all the classes \(sim_{ne}\) value (Definition 8) is below the threshold, all the classes advance recursively to the next step without advancing to the next query class.

  4. 4.

    The GS algorithm terminates only when one of the following scenarios occur:

    1. (a)

      The GS algorithm finished iterating over the classes in the

      queryDiagram therefore, GS algorithm found one sub-diagram among all the potential sub-diagrams:

      figure e
    2. (b)

      There are no more classes to iterate over in the UCD. This happens due two reasons: (1) There are no neighbors; (2) The neighbors already visited by the algorithm. Therefore, all the sub-diagrams that end with class ‘\(-18\)’ are not relevant answers:

      figure f
  5. 5.

    Finally, the results paths are calculated for their similarity based on Definition 10 and are sorted accordingly.

    figure g

    Indeed, the expected outcome that appears in Fig. 3b is included within the result set of the GS algorithm. The final similarity of the expected path is calculated by Definition 10 as follow: \(\frac{0.809+0.624+0.497+0.670}{4} = 0.650\).

4 Evaluation

To evaluate the GS algorithm, we conducted an experiment to check the algorithm performance in various settings.

4.1 Settings

We considered nine domains from which we had UCDs: (1) Bank Management; (2) Library; (3) Machine; (4) Pizza store; (5) Restaurant (6) University; (7) Climate; (8) Tank; and (9) Store. Diagrams (1)–(6) were adjusted from GenMyModelFootnote 1, (7) and (8) were adjusted from [19], (9) was adjusted from [12]. The domains were chosen in order to check whether the algorithm can distinguish between domains that have similar semantic like Pizza store (5) and Store (9), and similar structure like Pizza store (5) and Library (3). Each UCD contains in average 9.44 classes and 9.55 relationships. Each class has in average 1.37 attributes and 2.04 neighbors. The label that might be a class name or an attributes name has 1.12 words in average.

We were also interested in exploring how the query complexity affects the algorithm performance. For that reason we defined three complexity categories based on two parameters:

  1. 1.

    Path: the number of paths from the first class.

  2. 2.

    N-hop: supporting hop relationship in the map according to the expected result. That is, the number of skips on the diagram relationships required to get the desired result.

The categories were the following:

  1. 1.

    Simple the query has one path and n-hop equals to 0

  2. 2.

    Medium the query has one path and n-hop is more then 0

  3. 3.

    Complex the query has more then 1 path

We set 20 queries that were classified into three categories of complexity: simple - 7 queries, medium - 7 queries, and complex - 6 queries.

Examples of a simple, medium, and complex queries are shown in Fig. 4. The names of the classes of the queries are matched to the class names in Fig. 2 to simplify the demonstration of how the parameters influence the difficulty of the categories. Figure 4a is a simple query because the expected result is \(Machine{-}{>}Memory\) therefore, the Path and N-hope results in zero. Figure 4b is a medium query because the expected output is \(Machine{-}{>}Processor{-}{>}CPU\) and as a result N-hope’s value is 1 and Path’s value remains 0. Finally Fig. 4c is a complex query due to the number of Paths in the expected output resulting in two, \(Machine{-}{>}Memory\ and\ Machine{-}{>}Processor{-}{>}CPU\).

Fig. 4.
figure 4

Three complexity query level

The experiment material and results can be found in an online appendix [6]. To evaluate the performance of the GS algorithm we used the following metrics:

  • Domain Mean Reciprocal rank (Domain MRR) indicates whether the right model was retrieved:

    $$\begin{aligned} MRR=\frac{1}{\left| Q\right| }\sum _{i=1}^{\left| Q\right| }\frac{1}{{ran k}_i} \end{aligned}$$
    (5)

    where Q is the set of the queries, and \(rank_i\) is the rank position of the first relevant domain result to the i-th query.

  • Diagram Similarity (D-sim) that measures the similarity between a query diagram and a sub-diagram from the set of UCDs as defined in Definition 10. The D-sim takes into account the highest D-sim score of each query and ignores cases in which no answer was found.

  • Recall@k that determines whether the expected result appears within the top k results.

  • The number of failures in which the search did not succeed.

We executed the GS algorithm with K = 2 and searched for a threshold that maximized the Domain MRR. Figure 5 shows how the Domain MRR metric is affected by the threshold. Following the examination, it seems that a value of 0.65 leads to the best results. Therefore, we set the threshold to 0.65 and measured the metrics for each complexity category.

4.2 Results

Table 4 presents the results of the experiment. The numbers in each cell indicate the average result after running each of the relevant set of queries. Examining the Recall@1, Recall@3, Recall@5, Domain MRR and similarity metrics. The results indicate that the algorithm’s performance decreases as the query complexity increases.

Table 4. Results
Fig. 5.
figure 5

‘th’ influencing on Domain MRR

4.3 Discussion

In this section, we discuss the results with respect to the parameters of the GS algorithm and the input characteristics.

In the experiment we set the number of neighbours (K) to 2. It might be beneficial to increase K in correlation with the actual neighbours of each class (In our case, it was 2.04 in average). This might increase the accuracy of the search and allow the exploration of additional paths.

The threshold parameter also affects the results as it determines the relevant classes from which the algorithm moves forward. It might be that the threshold should be changed based on the domain and the \({sim_w}\) function. The semantic of two classes (\(sim_l\)) depends on two parameters: similarity between two words (\(sim_w\)) and the length of a class’s label in terms of how many words the label contains. For \(sim_l\) and \(sim_{att}\) we noticed that it is affected by the number of words. In the case of long labels the similarity decreases.

4.4 Threats to Validity

The initial evaluation we performed should be taken with caution and should consider the following threats to validity:

  • Small UCD: The diagrams in the experiment are of limited size (9.44 classes and 9.55 relationships in average). We should explore the algorithm with much larger diagrams.

  • Self authoring: We as the authors of the paper, developed the queries, so some biases might exist. Nevertheless, our aim in this evaluation was to challenge the algorithm, so the queries we devised aimed to accomplish that.

  • Simple queries: Both the domains and the queries are quite small. There is a need to incorporate more complex domains and queries.

  • Lack of tuning: Based on informal experience, we set the algorithm parameters as constants. Yet, there is a need to perform sensitivity analysis to learn about the impact of these parameters.

  • Comparing to other works: Indeed, the results should be compared to other alternatives. Yet, such alternatives need to be adjusted for our UCDs repository. For example, converting UCDs into corresponding ontology graph and then run an ontology matching algorithm.

5 Summary

In this paper, we propose a greedy algorithm for searching UCDs. The algorithm takes into structure, semantic, and type similarity compering to other works shown in Table 1. The initial evaluation we preformed shows a promising results in terms of accuracy.

Nevertheless, in the future, we intend to improve the \(sim_{l}\) function using text mining techniques like RNN [23] and LSTM [9] for sentence similarity task. Moreover, we want to test how well the algorithm performs with other domains and examine alternatives for calculating the similarity of the classes and relationships. This includes the tuning of the algorithm’s parameters, either a-priory or during its execution.

Although we evaluate the algorithm on UCDs, we believe it can be used for searching other models. Thus, we also plan to test our conjecture regarding this matter.