1 Introduction

Software reuse can save development costs and time to improve software development process [1]. With the increasing complexity of software, software reuse has been involved in each phase of software life cycle, including design, testing or even maintenance, not just limited to code [2, 3]. Software design has an enormous influence on the following development process [4, 5], so the reuse of software design is promising. Class diagrams produced in design phase can clearly show the static structure of a system by modeling objects and relationships between objects [6]. Currently, the reuse of class diagrams has received more attention [7, 8]. The reuse architecture of class diagrams is shown as Fig. 1.

Fig. 1
figure 1

The reuse architecture of class diagrams

It is shown in Fig. 1 that the reuse architecture of class diagrams contains four stages. The original class diagrams are retrieved, adjusted and then applied for new projects. The newly developed class diagrams are finally added into the repository for future reuse. Among them, the retrieval that is based on similarity measure is a key. The existing works on similarity measure focus on semantics [9]. However, class diagram contains not only semantics but also structure [10]. Class diagrams for modeling a software system are generally created by a team of developers who may have different experiences and knowledge backgrounds. It is a common case that the created class diagrams are not exactly consistent even for the development of the same project.

Let us look at an example. Suppose that we have a query class diagram shown in Fig. 2a as input. Then, with a semantics-based retrieval, the class diagrams containing Fig. 2a, b should be retrieved in the reuse repository. It can be seen that the retrieved class diagrams may have different structures due to their different developing concerns. Here, Fig. 2a is a student-centered design and Fig. 2b is a lesson-centered design. However, it is possible that only the class diagrams containing Fig. 2a are required in an application, including the related artifacts of these class diagrams. At this point, the class diagrams containing Fig. 2b would not appear in the retrieval list with respect to the structural information of the query class diagram. Let us look at another example. For the query class diagram shown in Fig. 3, which is used to model the composition of a computer, there may not be any class diagrams that model the same project as the query class diagram in the reuse repository. As a result, no class diagrams would be retrieved if a semantics-based retrieval is applied. However, there may be some structurally similar class diagrams from different projects in the reuse repository (e.g., the class diagram modeling a vehicle composition in Fig. 4), which can be applied as a useful reference to construct new related class diagrams. Therefore, in addition to the semantics of class diagrams, the retrieval of class diagrams needs to consider the structures of class diagrams also for structural reuse. The key of structural retrieval for structural reuse is the structural similarity measure.

Fig. 2
figure 2

UML class diagram examples

Fig. 3
figure 3

A class diagram modeling a computer composition

Fig. 4
figure 4

A class diagram modeling a vehicle composition

So far, while more attention has been paid to the semantic similarity measure of class diagrams, little work has been carried for the structural similarity measure of class diagrams. In this paper, we concentrate on the structural similarity measure of class diagrams. For this purpose, we propose a graph model named UCG (UML class graph) to represent class diagram. On the basis of the UCG model, we propose the algorithms for the structural similarity measure of class diagrams. The main contributions of this paper are summarized as follows.

  1. (1)

    We propose to consider the reuse of class diagrams from a structural perspective.

  2. (2)

    We propose the structural similarity measure method for the structural reuse, where an UCG is proposed to represent a class diagram, an algorithm based on UMCSS is proposed for the inter-structure similarity measure and UCG edit distance is proposed for the intra-structure similarity measure.

  3. (3)

    We carry out an experiment to show the effectiveness of the proposed method.

The rest of this paper is organized as follows. The related work is presented in Sect. 2. Section 3 presents the generic procedure of model transformation, formally defining UML class diagram and UML class graph and providing the transformation rules. The structural similarity measure between UML class graphs is proposed in Sect. 4. Section 5 presents an experiment and analyzes the experimental results. Section 6 concludes this paper.

2 Related work

The advance is mainly reflected in semantic similarity since the reuse of software artifacts (e.g., code, component and design model) has been valued [11,12,13,14,15,16,17,18,19,20]. The most commonly used approach is that, a reusable artifact is described as a few features, each feature is assigned, and then the similarity between artifacts is calculated using the difference between features [11, 13, 16,17,18, 20]. The definition and assignment of features is generally a manual process that requires more domain knowledge and searching artifacts for reuse is based on keyword. In [21], a method called case-based reasoning is proposed, in which previous experiences are described as cases (problem and solutions) stored in a case library. Given a query condition, the most similar cases are received and then adapted for reuse in new project. With the development of Semantic Web, more ontologies (e.g., WordNet) [22] are developed and applied to some fields such as knowledge engineering and information retrieval [23]. Ontology-based similarity measure is proposed [24, 25], in which domain and application ontologies are combined to improve the accuracy of semantic similarity measure [15]. A relationship is usually represented as a vector of end class and type in [15, 19, 20], then the distance between vectors is used to measure the similarity between relationships, which can be essentially viewed as a kind of semantic measure and only applied to the same projects. Certainly, still a few methods have been proposed for the structural similarity measure [19, 26,27,28,29,30]. In [19, 28], the neighborhood information is used to measure the similarity between relationships. A sequence diagram is represented as a conceptual graph for the similarity measure in [29], in which object name corresponds to vertex and message corresponds to edge. Then the matching is based on the labels of vertices and name of edges, which falls into a semantic similarity category. In [30], the state machine diagram is represented as a digraph for the similarity measure and the similarity measure is based on an adjacency matrix representation of different edges. In [27], a model query language is designed to rewrite a class diagram for the structural matching, where a depth-first algorithm is applied for searching the maximum common parts. Note that, when the number of relationships contained in the class diagrams is small, this approach can work well because few common substructures exist among them. As the size of class diagrams increases, the number of common substructures may be more than one and it is inaccurate to use this method for calculating the structural similarity. In addition, the text-based representation is inappropriate to represent class diagram because the structure of class diagram is not represented intuitively. So, a graphical and accurate approach is desirable for the structural similarity measure between class diagrams.

The structure of class diagram can be categorized into two aspects: intra-structure and inter-structure. The intra-structure refers to the composition of each class, and the inter-structure is represented as relationships between classes. Both the intra-structure and inter-structure are all within the scope of consideration in this paper. We apply a graph [29, 30] to represent a class diagram for the structural similarity measure. The vertices and edges of an UCG are classified into different types, and the structural matching is based on the edge tags rather than vertices. An UMCSS-based algorithm is proposed for the inter-structure similarity measure, and UCG edit distance is proposed for the intra-structure similarity measure. The feature vector method [11, 13, 16,17,18, 20, 24, 25] and the vertex label method [29, 30] pay their attention on the semantics rather than the actual structure. Compared with the semantics-based method, the method proposed in the paper does not care for the semantics (end class) and the matching is just based on the tags of edges. This can be viewed as a structural matching in nature, and it can also be applied to the structural reuse of the same domain and across domains. In [27], a model query language method is proposed. Our method considers more common substructures in addition to the maximum common substructure, and this can improve the accuracy. It is especially true for the similarity measure between class diagrams with a large size. Additionally, the graphical representation of a class diagram’s structure is more intuitive than the text representation.

3 Model transformation

OMG (Object Modeling Group) defines standard DTD (Document Type Definition) for UML model file. Then an UML model is described in an XMI (Extended Mark-up Language Interchange) document based on DTD standard [31]. The structural similarity measure between class diagrams can be attributed to model matching. There are two strategies to solve the issue of model matching. The first one is to put forward algorithms on the model, and the second one is to transform the model into another model and then put forward algorithms on the new model. Here we chose the latter. A graph called UCG is proposed to represent an UML class diagram (denoted as UCD) for the structural similarity measure in this paper. The procedure is described in Fig. 5.

Fig. 5
figure 5

Procedure of using UCG to measure the structural similarity between UCD

Obviously, this process consists of three steps. Among them, parsing XMI is to obtain all elements of class diagram. Any XML parser based on SAX (Simple API for XML) can be used to parse XMI model file and then obtain the elements (i.e., class, attribute, operation and relationship) [32]. All these elements obtained by parsing provide a preparation for formalizing class diagram. To transform UCD to UCG, the transformation rules need to be defined and the structural information of UCD must be fully reflected in UCG. On the basis, the structural similarity between UCD is converted to the structural similarity between UCG. Finally, algorithms are proposed for the structural similarity measure.

UCD and UCG are formally defined, and then, the transformation rules from UCD to UCG are summarized in the following subsections.

3.1 UML class diagram

An UML class diagram is used to model the static structure of a system, which consists of classes and relationships between classes [6]. Being an abstract representation of a set of objects with the same properties, a class shown in Fig. 6 is composed of attributes and operations. A relationship existing between classes is mainly classified into six categories: association, generalization, dependence, aggregation, composite and realization. An example shown in Fig. 7 is a fragment of a class diagram from an education domain. It contains two classes named “Teacher” and “Professor,” and one relationship of generalization, indicating class “Professor” inherits from class “Teacher.”

Fig. 6
figure 6

A class composition

Fig. 7
figure 7

An example of UML class diagram

Definition 1

We use a 5-tuple to formally define an UML class diagram and have UCD = (C, A, O, P, R).

  1. (1)

    C is a set of classes, where C = {c1, c2, c3,…,ck} and ci is a class;

  2. (2)

    A is a set of attribute sets, where A = {A1, A2, …, Ak}, Ai is a set of attributes contained in class ci, Ai ={ai1, ai2, …, aim}, and aij is the jth attribute of class ci;

  3. (3)

    O is a set of operation sets, where O = {O1, O2, …, Ok}, Oi is a set of operations contained in class ci, Oi = {oi1, oi2, oi3, …, oin}, and oik is the kth operation of class ci;

  4. (4)

    P is a set of all the parameters, where P ={P1, P2, …, Pk}, Pi is a set of parameters contained in all the operations of class ci. Pi = {Pi1, Pi2,…, Pim}, Pij is a set of parameters contained in the operation oij, Pij = {p1ij, p2ij, p3ij, …, ptij}, and ptij is the tth parameter of operation oij;

  5. (5)

    R is a set of relationships, where R = {rij|1  i, j ≤ |C| and i ≠ j}, rij = (ci, tx, cj) is a relationship between class ci and cj, txT is the type of rij, and T = {t1, t2, t3, t4, t5, t6} is a set of relationship types. Here t1, t2, t3, t4, t5 and t6 corresponds to association, generalization, aggregation, composition, dependency and realization, respectively.

For the class diagram in Fig. 7, two classes “Teacher” and “Professor” are denoted as c1 and c2, respectively; for class “Teacher,” attribute “ID” is denoted as a11, attribute “name” is denoted as a12, operation “teach” is denoted as

o11, and parameter “class” is denoted as p111; similarly, the attributes “degree” and “title” of class “Professor” are denoted as a21 and a22, respectively; the generalization relationship between class “Teacher” and “Professor” is then denoted as r21, r21 = (c2, t2, c1).

3.2 UML class graph

A graph is an ordered pair (V, E), where V is a set of vertices, E  V × V is a set of edges, and an edge exists between two vertices [33]. As a powerful modeling tool, a graph is applied to a series of fields, ranging from computer network to biomedical science [34]. A core in graph applications is the issue of model matching [35]. The structure of an UCD is similar to a graph: Classes of an UCD correspond to vertices of a graph and relationships of an UCD correspond to edges of a graph. So, a graph is chosen to represent an UCD for the structural similarity measure. In this section, we propose an UCG to represent an UCD. Being different from a general digraph, an UCG consists of various types of vertices and edges to correspond to different elements in an UCD.

Definition 2

An UML class graph is defined as UCG = (V, E, L).

  1. (1)

    V denotes all vertices of an UCG, where V = CV ∪ AV ∪ OV ∪ PV.

    • CV is a set of class vertices and CV = {cv1, cv2, …, cvk}, where cvi is the ith class vertex.

    • AV is a set of sets of attribute vertices and AV = {AV1, AV2,…, AVk}, where AVi= {avi1, avi2, …, avim} is a set of attribute vertices connecting to class vertex cvi and avij is the jth attribute vertex.

    • OV is a set of sets of operation vertices and OV = {OV1,OV2,, OVk}, where OVi = {ovi1, ovi2, …, ovin} is a set of operation vertices connecting to class vertex cvi and ovij is the jth operation vertex.

    • PV is a set of all parameter vertices and PV = {PV1,PV2,…,PVk}, where PVi = {PVi1, PVi2,…,PVin} is a set of parameter vertices connecting to all operation vertices that are connected to class vertex cvi, PVij= {pv1ij,pv2ij,…, pvfij} is a set of parameter vertices connecting to the operation vertex ovij, and pvtij is the tth parameter vertex.

  2. (2)

    E denotes all edges of an UCG, where E = AE U OE U PE U RE.

    • AE  CV × AV is a set of attribute edge sets and AE = {AE1, AE2,…, AEk}, where AEi = {aei1, aei2, …, aeim} denotes a set of attribute edges connecting class vertex cvi and aeij= (cvi, avij) is an attribute edge from cvi to avij.

    • OE ⊆ CV × OV is a set of operation edge sets and OE = {OE1, OE2, …, OEk}, where OEi= {oei1, oei2, …, oein} denotes a set of operation edges connecting class vertex cvi and oeij = (cvi, ovij) is an operation edge from cvi to ovij.

    • PE ⊆ OV × PV is a set of parameter edges and PE = {PE1, PE2, …, PEk}, where PEi= {PEi1, PEi2, …, PEin}, PEij ={pe1ij, pe2ij,…, pefij}, and petij = (ovij, pvkij) is a parameter edge from ovij to pvkij.

    • RE ⊆ CV × CV is a set of relationship edges and RE = {reij|1  i, j ≤ |CV|  and i ≠ j}, where reij = (cvi, ex, cvj) is a relationship edge from cvi to cvj, exET is a tag of reij and ET = {e1, e2, e3, e4, e5, e6} is a set of relationship edge tags.

  3. (3)

    L is a label function, which denotes the label of a vertex, L = LC + LA + LO + LP. LC(cvi), LA(avij), LO(ovij) and LP(pvkij) denote the label of class vertex cvi, attribute vertex avij, operation vertex ovij and parameter vertex pvkij, respectively.

In a general digraph, the differences among vertices are based on labels and all edges are seen to be identical except for different weights. The vertices and edges of an UCG, however, are identified as different types (as mentioned above). Each type of elements plays a different role in an object that is composed of several different types of elements. These different types of vertices and edges are denoted as different tags in Table 1 to distinguish each other.

Table 1 Element tags of UCG

In the real world, these elements that make up an object are usually multiple types instead of single type, so the modeling tools like UCG have a wide range of applications. Let us look at an application example of UCG in network topology design. In Fig. 8, a higher bandwidth is designed between two key nodes as the backbone, say e1, and a relatively low bandwidth is assigned between a key node and a general node, say ea and eo, shown. A class vertex is a key node, and an attribute vertex and an operation vertex are considered as general nodes, which are different from each other and marked with different colors. In addition, different bandwidths are denoted as edges with different pounds. The same idea can be applied to highway construction planning, where higher-quality roads should be built between key cities and the standards among other cities are less demanding.

Fig. 8
figure 8

An UCG application case

3.3 Transformation rules

Transformation rules from UCD to UCG are proposed in this section. Here the UCG is applied for measuring the structural similarity instead of a complete matching. So, we do not consider the multiplicity of relationship here. The related permissions (e.g., public, private, and protected) of attribute and operation are also ignored in this paper. In the following, we present the detailed transformation rules.

  • Rule 1: class  class vertex

    Class ci in an UCD is transformed into a class vertex cvi in an UCG and the name of class ci becomes the label LC(cvi) of cvi.

  • Rule 2: attribute  attribute vertex and attribute edge

    Attribute aij of class ci in an UCD is transformed to an attribute vertex avij in an UCG and the name of aij becomes the label LA(avij) of avij. Then an attribute edge aeij between cvi and avij is created and the direction is from cvi to avij. The type of attribute aij is assigned to the tag ea of attribute edge with a mark (e.g., ta1, ta2, …, tan).

  • Rule 3: operation (parameter)  operation vertex and operation edge (parameter vertex and parameter edge)

    Operation oij of class ci in an UCD is transformed to an operation vertex ovij in an UCG. Then an operation edge oeij between cvi and ovij is created and the direction is from cvi to ovij. The name of oij becomes the label LO(ovij) of the operation vertex ovij and the return type of operation oij is assigned to the tag eo of operation edge oeij with a mark (e.g., rt1, rt2, …, rtn). Being different from an attribute, an operation may contain some parameters. A parameter is defined by both name and type. A parameter can be handled in a similar way as an attribute, but a parameter edge is created between operation vertex and parameter vertex. So, parameter ptij in an UCD is transformed into a parameter vertex pvtij in an UCG. Then a parameter edge petij between pvtij and ovij, is created and the direction is from ovij to pvtij. The name of parameter ptij becomes the label LP(pvtij) of parameter vertex pvtij and the type of parameter ptij is assigned to the tag ep of parameter edge petij with a mark (e.g., tp1, tp2, …, tpn).

  • Rule 4: relationship  relationship edge

    Relationship rij between class ci and cj in an UCD is transformed into a relationship edge reij between class vertex cvi and cvj in an UCG. Regarding the direction and tags of relationship edge, Fig. 9 presents the details.

    Fig. 9
    figure 9

    The direction setting of relationship edges

With the transformation rules, the UCD in Fig. 7 is converted into an UCG in Fig. 10. Here different types of vertices are denoted with different colors for distinguishing each other.

Fig. 10
figure 10

UCG transformation sample

All the elements from an UCD can be transformed into corresponding vertices and edges of an UCG based on the above transformation rules. The structure of an UCD is represented as the structure of an UCG. The following is a summary of the model transformation.

for UCD = (C, A, O, P, R)

$$\begin{aligned} & {\forall c_{i} \in C\left( {1 \le \, i \, \le n} \right) \Rightarrow \exists cv_{i} \in CV + L^{C} \left( {cv_{i} } \right)} \hfill \\& {\forall a_{ij} \in A_{i} \left( {1 \le \, i \, \le n} \right) \Rightarrow \exists av_{ij} \in AV_{i} + L^{A} \left( {av_{ij} } \right) \, }\\&\quad{+ ae_{ij} \left( {e_{a} } \right) \in AE_{i} } \hfill \\& {\forall o_{ij} \in O_{i} \left( {1 \le \, i \, \le n} \right) \Rightarrow \exists ov_{ij} \in OV_{i} + L^{O} \left( {ov_{ij} } \right) \, }\\&\quad{+ oe_{ij} \left( {e_{o} } \right) \in OE_{i} } \hfill \\& {\forall p_{ij}^{f} \in P_{ij} \left( {1 \le \, i \, \le n, \, 1 \le \, j \, \le |O_{i} |} \right) \Rightarrow \exists pv_{ij}^{f} \in PV_{ij} }\\&\quad{+ L^{P} \left( {pv_{ij}^{f} } \right) \, + pe_{ij}^{f} \left( {e_{p} } \right) \in PE_{ij} } \hfill \\& {\forall r_{ij} \left( {t_{m} } \right) \in R\left( {1 \le \, i,j \, \le n} \right) \Rightarrow \exists re_{ij} \left( {e_{m} } \right) \in RE} \end{aligned}$$

Then,

$$\begin{array}{*{20}l} {AV = \left\{ {AV_{1} ,AV_{2} , \ldots ,AV_{n} } \right\}} \hfill \\ {OV = \left\{ {OV_{1} ,OV_{2} , \ldots ,OV_{n} } \right\}} \hfill \\ {PV = \left\{ {PV_{1} ,PV_{2} , \ldots ,PV_{n} } \right\}\;\text{and}\;PV_{i} = \left\{ {PV_{i1} ,PV_{i2} , \ldots ,PV_{in} } \right\}} \hfill \\ \end{array}$$

and

$$\begin{array}{*{20}l} {AE = \left\{ {AE_{1} , AE_{2} , \ldots ,AE_{n} } \right\}} \hfill \\ {OE = \left\{ {OE_{1} ,OE_{2} , \ldots ,OE_{n} } \right\}} \hfill \\ {PE = \left\{ {PE_{1} , PE_{2} , \ldots ,PE_{n} } \right\}\;\text{and}\;PE_{i} = \left\{ {PE_{i1} , PE_{i2} , \ldots ,PE_{in} } \right\}} \hfill \\ \end{array}$$

So,

$$\begin{array}{*{20}c} {CV \cup AV \cup OV \cup PV \Rightarrow V} \\ {AE \cup OE \cup PE \cup RE \Rightarrow E} \\ \end{array}$$

and

$$L^{C} + L^{A} + L^{O} + L^{P} \Rightarrow L$$

Let,

$$\left( {V, \, E, \, L} \right) \Rightarrow UCG$$

4 Structural similarity measure

The inter-structure of an UCG can be thought of as the structure after deleting attribute vertices (edges), operation vertices (edges) and parameter vertices (edges), corresponding to the mainframe of a class diagram. The inter-structure of an UCG plays a decisive role in the structural similarity measure. The intra-structure of an UCG is expressed by these elements (i.e., attribute vertices, operation vertices and parameter vertices) connecting to a class vertex, corresponding to the composition of a class existing in an UCD.

The structural similarity measure is to quantify the structural difference. The similarity value is limited to [0, 1], where 0 means completely different and 1 means identical. Due to the characteristics that an UCG consists of different types of vertices and edges, the matching and comparing of structure can only be carried out among the elements with the same types. We have some correspondences: class vertex is to class vertex, attribute vertex (edge) is to attribute vertex, operation vertex (edge) is to operation vertex, parameter vertex (edge) is to parameter vertex and relationship edge is to relationship edge. The structural matching is based on the tags of edges, instead of vertices: the same tag indicates the same structure and vice versa. The structural similarity measure between UCG is defined as bellows.

$$Sim\left( {g_{1} , \, g_{2} } \right) = \, \theta *simInter\left( {g_{1} , \, g_{2} } \right) + \left( {1 - \theta } \right)*simIntra\left( {g_{1} , \, g_{2} } \right)$$
(1)

Here simInter and simIntra denote the similarity of inter-structure and the intra-structure, respectively, and θ is the weighting factor (θ is limited to [0, 1] and usually close to 0.9).

4.1 Preliminary knowledge

Maximum Common Subgraph (denoted as MCS) and Edit Distance (denoted as ED) are frequently used methods for graph isomorphism [36, 37]. UCG maximum common subgraph and UCG edit distance are first proposed in this section and then applied to the inter-structure similarity measure and intra-structure similarity measure, respectively.

4.1.1 UCG maximum common subgraph

Here UCG Maximum Common Subgraph is from the inter-structure of UCG, which is only applied to the inter-structure similarity measure. Obtaining UCG Maximum Common Subgraph is based on the tags of relationship edges, instead of class vertices. Firstly, UCG Maximum Common Subgraph is defined and then UCG Maximum Common Subgraph List and UCG Maximum Common Subgraph Tree are proposed, respectively.

Definition 3 (UCG Maximum Common Subgraph)

Let ucg1 and ucg2 be two UCG. Suppose that there exists an UCG g and there is not an UCG g′, where g ⊆ ucg1, g ⊆ ucg2, g′ ⊆ ucg1, g′ ⊆ ucg2, and |g′| > |g| (|g| is used to denote the number of relationship edges existing in g). Then g is called UCG Maximum Common Subgraph (denoted as UMCS) between ucg1 and ucg2.

Here, the size of an UMCS can be measured by the number of relationship edges existing in UMCS. The number of UMCS may be more than one, especially for UCG with larger size. It is assumed that g1, g2, …, gm are UMCS between ucg1 and ucg2. Then, these UMCS constitute a list called UMCS List (denoted as UMCSL) and we have UMCSL1 = {UMCS11, UMCS12, UMCS13, …, UMCS1m}, where gi is denoted as UMCS1i. Based on each UMCS1i existing in UMCSL1, we can obtain UMCSL2 between (ucg1–UMCS1i) and (ucg2–UMCS1i). That is, UMCSL2 = {UMCS211, UMCS212, …, UMCS2m1, UMCS2m2, …, UMCS2mn}. This process is repeated until there is not any UMCS between the remainders of ucg1 and ucg2. All these UMCSL are inserted into an UMCS Tree shown in Fig. 11. UMCS Tree is initialized as a root node and it is empty.

Fig. 11
figure 11

UMCS tree

4.1.2 UCG edit distance

The basic idea of graph edit distance comes from string edit distance [38], which is used to find the minimum operation distance while transforming one graph to another. The edit distance between two graphs g1 and g2 is defined as follows.

$$\text{GED}\left( {g_{1} , \, g_{2} } \right) \, = \mathop {\hbox{min} }\limits_{1 \le j \le m} \sum\limits_{i = 1}^{k} {_{{e1, \ldots ,ek \in p_{j} (g_{1} ,g_{2} )}} \cos t(ei)}$$
(2)

Here, cost (ei) denotes the cost of edit operation ei and pj (g1, g2) denotes an edit path for transforming g1 into g2. There may be multiple edit paths for transforming g1 to g2 and the edit distance is to find the path whose edit cost is the least. A standard set of edit operations generally includes insertion, deletion and substitution of both vertices and edges. In this paper, UCG edit distance is proposed and applied to the intra-structure similarity measure, in which only two operations are allowed: insertion and deletion. The label of vertex is ignored when the edit distance is calculated. The reason is that we are talking about structure, not semantics. The edit operations of UCG are summarized in Table 2.

Table 2 UCG editing operations

On the basis of Table 2, we define the UCG edit distance as follows.

$$\text{UCGED}\left( {g_{1} ,g_{2} } \right) = x_{1} *\text{IC}_{1} + x_{2} *\text{IC}_{2} + x_{3} *IC_{3} + y_{1} *\text{DC}_{1} + y_{2} *\text{DC}_{2} + y_{3} *\text{DC}_{3}$$
(3)

Here, x1, x2, x3, y1, y2 and y3 are some coefficients, which are the times of the corresponding edit operation. Note that the insertion and deletion operations that are applied to the same object are assigned to the same edit cost, that is, IC1 = DC1, IC2 = DC2 and IC3 = DC3. Then the formula above can be further stated as follows.

$$\text{UCGED}\left( {g_{1} ,g_{2} } \right) = \left( {x_{1 + } y_{1} } \right)*IC_{1} + \left( {x_{2 + } y_{2} } \right)*IC_{2} + \left( {x_{3 + } y_{3} } \right)*IC_{3}$$
(4)

Let us look at an example shown in Fig. 12, where the UCG in Fig. 12a is matched to UCG in Fig. 12b. We calculate the edit distance from UCG in Fig. 12a to UCG in Fig. 12b based on the formula (4).

Fig. 12
figure 12

UCG edit distance case

Obviously, after deleting an operation vertex ov11 and its corresponding operation edge oe11, inserting an attribute vertex av12 and its attribute edge ae12 to cv1, and adding two operation vertices ov21 and ov22 and their corresponding operation edges oe21 and oe22 to cv2, the UCG in Fig. 12a becomes the UCG in Fig. 12b in the structure. The edit path is shown from Step (1) to Step (4) in Fig. 13, where UCG edit distance is UCGED (a, b) = IC1 + 3IC2.

Fig. 13
figure 13

Editing path from UCG in Fig. 12a to UCG in Fig. 12b

4.2 Similarity measure

The Similarity is based on the common parts of objects that are matching one another. Let us see an example. Two UCG g1 and g2 are transformed from UML class diagrams in an education domain, shown as Fig. 14, they have similar structures. We only show the inter-structure of g1 and g2 and the labels of the vertices are removed for saving space. Note that the same tags of class vertices from g1 and g2 (e.g., cv1, cv2, …, cv6) do not mean that these vertices are identical. Again, to save space, we do not show the intra-structures and the distributions of attribute vertices (edges) and operation (parameter) vertices (edges) connecting to each class vertex existing in g1 and g2 are shown in Tables 3 and 4, respectively. In this section, the inter-structure similarity and the intra-structure similarity are discussed, respectively.

Fig. 14
figure 14

UCG examples for the structural similarity measure

Table 3 Distribution of attribute vertices and operation (parameter) vertices in g1
Table 4 Distribution of attribute vertices and operation (parameter) vertices in g2

4.2.1 Inter-structure similarity

UMCS Tree provides a solution for using common parts to measure the inter-structure similarity. Each path from the root to a leaf node constitutes an UMCS Sequence (denoted as UMCSS). A preorder traversal of UMCS Tree can obtain all UMCSS. We have UMCSSi = {UMCS1j, UMCS2jp, …, UMCSwjp….k}, where |UMCS1j| ≥ |UMCS2jp| ≥ … ≥ |UMCSwjp….k|. Then UMCSSi with the largest number of elements is chosen to measure the inter-structure similarity between two matched UCG, which is defined as follows. Of course, there may be more than one like UMCSSi.

$$SimInter(ucg_{1} ,ucg_{2} ) = \frac{{\hbox{max} \left( {\left| {\text{UMCSS}_{1} } \right|,\left| {\text{UMCSS}_{2} } \right|, \ldots ,\left| {\text{UMCSS}_{n} } \right|} \right)}}{{\hbox{min} \left( {\left| {ucg_{1} \left| , \right|ucg_{2} } \right|} \right)}}$$
(5)
$$\left| {\text{UMCSS}_{i} } \right| = \sum\nolimits_{{\text{UMCS} \in \text{UMCSS}_{i} }} {\left| {\text{UMCS}} \right|}$$
(6)

Now, an important task is to create the UMCS Tree. The algorithm of creating UMCS tree is described in Algorithm 1.

figure a

UMCS Tree t is initialized as a root node and it is NULL. The mcsl is used to store UMCSS between g1 and g2 in Step 1. The construction of UMCS tree is a process of repeatedly obtaining UMCSL and inserting it into UMCS tree from Step 1 to Step 7 until there is not any UMCSL in Step 10. This process is a recursion. It can be seen from Algorithm 1 that, to create UMCS tree, we need to achieve UMCSL first and we propose Algorithm 2 to deal with the issue.

figure b

Algorithm 2 performs a depth-first searching. Here S is a state space that stores common subgraph between g1 and g2 under construction and is a fragment of UMCS to be formed. We may have more than one UMCS and so mcsl is used to store all UMCS. S and mcsl are initialized as empty (Step 1 and Step 2). Then a relationship edge reij from g1 is added to S. It is necessary to check if it is possible to extend the common subgraph represented by an actual state S by the means of adding the relationship edge reij to S. If this extension is successful, a new state space S replaces the old one. If the current partial solution is larger than the stored solution, it becomes the new stored solution and is inserted into mcsl (Step 4 to Step 11). saveCurrentMCS, clearMCSL and insertMCSL are three functions, which save UMCS to mcsl, clear mcsl and insert UMCS to mcsl, respectively. If the size of current partial solution is equal to the stored solution and the current partial solution is not contained in mcsl, it is appended to mcsl as another UMCS (Step 12 to Step 13) and then next UMCS is continuously searched. backState(S) is used to restore the previous state of S in Step 17.

It is well known that obtaining MCS between two graphs is a NP problem, but the actual computation time is still acceptable in many applications. The reason is based on the fact that the graphs encountered in practice are usually different from the worst cases existing in general graphs. For an UCG, the characteristics of nodes and edges can be used very often to reduce the searching time dramatically [39]. Figure 15 gives the best and worst cases that may occur in the inter-structure similarity measure.

Fig. 15
figure 15

The inter-structure similarity cases

In a best case, each relationship edge of G1 is perfectly matched only to the relationship edge of G2, which is shown in Fig. 15a, and UMCS is easily obtained. A worst case shown as Fig. 15b is that all relationship edges existing both in G1 and G2 have the same tags. At this point, an UCG is evolved into a general digraph and obtaining UMCS becomes a NP problem. It should be noted that it is almost impossible that such a worst case could occur. This is because that UCG is transformed from UCD, and it is impossible that all relationships of UCD are the same. Generally, the average number of class vertices of an UCG is not more than 30 [40]. So, an UCG is not a large graph and the time complexity of the worst case is not too bad. The basic idea of obtaining UMCS in this paper mainly comes from McGregor [36]. The difference of our approach is that our searching UMCS starts from edge instead of vertex.

Now, we begin to calculate the inter-structure similarity between g1 and g2 in Fig. 14 based on the proposed algorithm. We need to create an UMCS tree. An UMCS tree is initialized as a root node, and it does not contain any vertices and edges. The specific process is as follows:

  1. (1)

    Obtaining UMCSL1 between g1 and g2

Two UMCS between g1 and g2 can be obtained, which are shown in Fig. 16 as (a) UMCS11 and (b) UMCS12 circled with a dotted rectangle and ellipse, respectively. We have UMCSL1 = {UMCS11, UMCS12}. All these elements in UMCSL1 are inserted into UMCS tree.

Fig. 16
figure 16

UMCSL1

  1. (2)

    Searching UMCSL2 between the remainders of g1 and g2

Then g1—UMCS11 and g2—UMCS11 as well as g1—UMCS12 and g2—UMCS12 are shown in Fig. 17, respectively.

Fig. 17
figure 17

The remainders of g1 and g2

The vertices marked by dotted lines become the part of the exited UMCS, such as cv1 and cv5 in Fig. 17a. The existence of a relationship edge depends on two class vertices at each end. Obviously, there is not a complete relationship edge in g1—UMCS11, but there are still a few relationship edges to be not matched, which emerge in g2—UMCS11 and are shown in Fig. 17b. So, UMCS between g1—UMCS11 and g2—UMCS11 does not exist. UMCS between g1—UMCS12 and g2—UMCS12 can be easily found, it is circled with a dotted rectangle and denoted as UMCS221 in Fig. 18. That is, UMCSL2 = {UMCS221}. Then, the searching process can finally stop because there is not a relationship edge in the remainders of g1—UMCS12—UMCS221. As shown in Fig. 19, the element in UMCSL2 is also inserted into UMCS tree.

Fig. 18
figure 18

UMCS221

Fig. 19
figure 19

UMCS tree

Obviously, two paths exist in the UMCS tree: UMCSS1 = {MCS11} and UMCSS2 = {UMCS12, UMCS221}, where |UMCSS2| > |UMCSS1|. That is, the inter-structure similarity between g1 and g2 can be measured by UMCSS2. We use the formulas (5) and (6) to calculate the inter-structure similarity as follows.

$$SimInter\left( {g_{1} , \, g_{2} } \right) = \frac{{\left| {\text{UMCS}_{2}^{1} \left| + \right|\text{UMCS}_{21}^{2} } \right|}}{{\hbox{min} \left( {\left| {g_{1} \left| , \right|g_{2} } \right|} \right)}} = (3 + 1)/5 = 0.80$$

The corresponding class vertices matching pairs in the inter-structure similarity are described in Table 5.

Table 5 Class vertices matching pairs in the inter-structure similarity

Here the same tag emerges in the relationship edges re21 and re31 of g1. So, the matching pair 2 and 3 can be adjusted from g1.cv2 to g2.cv7 and from g1.cv3 to g2.cv4.

4.2.2 Intra-structure similarity

Frequently, there are more than one UMCSS that satisfies the same inter-structure similarity values. For example, there are umcss1 and umcss2 between ucg1 and ucg2 and the same values can be obtained by using umcss1 and umcss2 to calculate the inter-structure similarity, shown as Fig. 20, where |umcss1| = |umcss2|. At this point, choosing which one of umcss1 or umcss2 as the final answer of the inter-structure similarity is decided by the intra-structure similarity.

Fig. 20
figure 20

MCSS cases

In this paper, we introduce UCG edit distance discussed in Sect. 4.1.2 to the intra-structure similarity measure. The intra-structure similarity is based on the inter-structure similarity. The intra-structure similarity is captured from three aspects: attribute vertex (edge), operation vertex (edge) and parameter vertex (edge). To limit the intra-structure similarity value to [0, 1], the intra-structure similarity is defined as follows.

$$\begin{aligned} SimIntra\left( {g_{1} ,g_{1}^{\prime } } \right) & = \alpha *\left( {1 - \frac{{\left( {x_{1} + y_{1} } \right)*\text{IC}_{1} }}{{\sum_{{mcsg_{i} \in g_{1} ,mcsg_{j} \in g_{1}^{\prime } }} \sum_{{\text{AV}_{i} \in mcsg_{i} , \text{AV}_{j} \in mcsg_{j} }} \hbox{max} \left( {\left| {\text{AV}_{i} } \right|,\left| {\text{AV}_{j} } \right|} \right)}}} \right) \\ & \quad + \,\beta *\left( {1 - \frac{{\left( {x_{2} + y_{2} } \right)*\text{IC}_{2} }}{{\sum_{{mcsg_{i} \in g_{1} ,mcsg_{j} \in g_{1}^{\prime } }} \sum_{{\text{OV}_{i} \in mcsg_{i} , \text{OV}_{j} \in mcsg_{j} }} \hbox{max} \left( {\left| {\text{OV}_{i} } \right|,\left| {\text{OV}_{j} } \right|} \right)}}} \right) \\ & \quad + \,\gamma *\left( {1 - \frac{{\left( {x_{1} + y_{1} } \right)*IC_{1} }}{{\sum_{{mcsg_{i} \in g_{1} ,mcsg_{j} \in g_{1}^{\prime } }} \sum_{{\text{OV}_{i} \in mcsg_{i} , \text{OV}_{j} \in mcsg_{j} }} \sum_{{\text{PV}_{ik} \in \text{OV}_{i} ,\text{PV}_{jw} \in \text{OV}_{j} \hbox{max} (\left| {\text{PV}_{ik} } \right|, |\text{PV}_{jw} |)}} }}} \right) \\ \end{aligned}$$
(7)

Here, g1 and g1 are a matching pair in UMCSSi and they are from ucg1 and ucg2, respectively. Parameters α, β and γ are the weighting factor (α + β+γ = 1), identifying the weight of each part in the intra-structure similarity. Generally, α is close to β and they are all above γ. They are determined by the importance of attributes, operations and parameters contained in a class. The edit cost of all these operations is set to 1, IC1 = 1, IC2 = 1 and IC3 = 1. That is, the edit distance is measured only by the times of the specified edit operation.

In the following, we use the formula 7 to calculate the intra-structure similarity of UMCSS2 of Fig. 19, we have the following results.

$$simIntra\left( {g_{1} , \, g_{2} } \right) = 0.4*0.8065 + 0.5*0.8571 + 0.1* \, 0.8500 \, = \, 0.8362$$

Here, α, β and γ are set to 0.4, 0.5 and 0.1, respectively. When the matching pair 2 and 3 is adjusted according to the above statements, another intra-structure similarity value can be calculated, and it is 0.7895. Obviously, the matching pair that is combined with a larger similarity value 0.8362 is accepted. The final structural similarity value between g1 and g2 is:

$$Sim\left( {g_{1} ,g_{2} } \right) = 0.90*0.8000 + 0.10*0.8362 = 0.8036$$

Here, the weighting factor θ is set to be 0.9.

5 Experiment

In this section, we design an experiment to evaluate our proposed approach. A prototype system was developed, which was implemented using Java and run on a computer (CPU I5 2.5G, RAM 8G) using Windows 7. We use Microsoft SQL Server 2008 to store UML class diagrams for our experiment. We use the experiment to prove that:

  1. (1)

    our proposed approach is suitable for UML class diagrams with various sizes,

  2. (2)

    our proposed approach is not limited by the modeling field, and

  3. (3)

    our proposed approach is more accurate than other methods.

5.1 Experimental Data

The class diagrams used in the experiment are from projects developed by software companies, which are divided into two parts: query class diagrams and target class diagrams. We calculate the structural similarity values between query class diagrams and target class diagrams. The description of the class diagrams used in the experiment is shown in Table 6.

Table 6 The description of class diagrams used in the experiment

All query class diagrams are from the same domain “Education,” and they are classified into two categories based on the size. The sizes of the query class diagrams existing in the first category denoted as QC1 vary from 10 to 15, and the size of each query class diagram in the second category denoted as QC2 is limited to 20–25. The number of query class diagrams in both categories is 5. The target class diagrams are partitioned from two different perspectives. Viewed from the modeling field, the target class diagrams are divided into two categories and the number of the class diagrams is 15 in each category. In the first category denoted as TFC1, all target class diagrams are from “Education” and describe the same or similar projects as query class diagrams. In the second category denoted as TFC2, the modeling field of target class diagrams is from “Company,” which is completely different from the first category but still similar in structure. Viewed from the size of the target class diagrams, they can be divided into two categories and the number of class diagrams in each category is 15. The size of each target class diagram from the first category denoted as TSC1 is limited to 10–15, and the sizes of target class diagrams from the second category denoted as TSC2 vary from 20 to 25.

5.2 Results analysis

In the experiment, we applied three structure (relationship) similarity measure methods, which are semantics-based relationship matching (Semantics for short), model query language-based pattern matching (Query Language for short) and our proposed approach (MCSS for short), respectively. The first two methods have been mentioned in [15, 27]. Each query class diagram is matched to all target class diagrams, and all the structural similarities are calculated by these three methods. In our proposed MCSS, the weighting factors θ, α, β and γ are set to 0.9, 0.4, 0.5 and 0.1, respectively. In the semantics-based method, the weights of relationship type and end class are set to 0.5 and 0.5 when the relationship is matched.

To assess these three methods, we also invited five experts who are software engineers with rich experience in software design. The experts were requested to compare the query class diagrams and target class diagrams and then answer the same problem for each comparison between a query class diagram and a target class diagram: “how structurally similar are these two class diagrams?”. Each expert provided a certain value in [0, 1] for a comparison to identify the structural similarity degree of two compared class diagrams. Here 0 means that two compared models are completely different and 1 means the completely identical. Given that there are two categories of query class diagrams with total 10 query models and 30 target models, each expert made 300 comparisons. Finally, we compared the results obtained by the three methods with the results given by the experts. To avoid listing large amounts of data, the similarity values that a set of query class diagrams are matched to a target class diagram are averaged.

For the query class diagrams and the target class diagrams from the same modeling field, shown in Figs. 21 and 22, the results obtained by these methods are close, except for individual values, which is easy to be understood because query class diagrams and target class diagrams describe the same or similar projects, the most structural similarity values are high (≥ 0.5), and only few structural similarity values are low (≤ 0.3). In particular, it is shown in Fig. 21 that the structural similarity values are almost same, which can be explained by the small size of query class diagrams resulting in no common substructures in addition to maximum common substructure in the same modeling field.

Fig. 21
figure 21

Structural similarity between QC1 and TFC1

Fig. 22
figure 22

Structural similarity between QC2 and TFC1

It is shown in Figs. 23 and 24 that, however, the results obtained by these three methods have significant differences for different modeling fields. The results obtained by the semantics method are significantly smaller than the results obtained by other two methods. The reason is that the semantics method considers both relationship type and end class when a relationship is matched, the low semantic similarity between two class names from different modeling domains results in low similarity values and most structural similarity values obtained by the semantics method are low (≤ 0.5). Therefore, the semantics method is severely affected by the modeling field, but the semantics method gives the almost same results as query language method when query class diagrams and target class diagrams are from the same domain, regardless of the size of the class diagram being matched.

Fig. 23
figure 23

Structural similarity between QC1 and TFC2

Fig. 24
figure 24

Structural similarity between QC2 and TFC2

However, the query language method is affected by the size of the class diagrams being matched. When the size of the matched class diagrams is small and close, it is shown in Fig. 25 that the results obtained with query language and MCSS method almost has the same results. It is shown in Fig. 26 that, however, the results obtained with these two methods have significant differences for the matched class diagrams in large size, and the values obtained with MCSS are higher than the results obtained with the query language method in some matching class diagrams pairs. The reason is that the more common substructures existing between the matched class diagrams are considered in MCSS, in addition to the maximum common substructure which is considered in the query language method. Here the results by the semantics-based method are not shown and the reason is that the semantics-based method is affected by the modeling domain rather than the size of class diagrams.

Fig. 25
figure 25

Structural similarity between QC1 and TSC1

Fig. 26
figure 26

Structural similarity between QC2 and TSC2

It is shown from the above experimental results that our proposed algorithm is applicable for UML class diagrams with any size and modeling field. As shown in Figs. 27 and 28, no matter which way you look at it, the results obtained by our proposed MCSS are closer to the results given by the experts.

Fig. 27
figure 27

Structural similarity between QC1 and (TFC1 + TFC2)

Fig. 28
figure 28

Structural similarity between QC2 and (TFC1 + TFC2)

6 Conclusions

In software reuse, the reuse of UML class diagram produced in design phase becomes a major concern. The existing works on the reuse of class diagram mainly focus on its semantic reuse, and its structural reuse is rarely noticed. This paper proposes reusing class diagrams in another light, namely, structure. The core of the structural reuse is the structural similarity measure. In this paper, we propose to use UML class graph to represent UML class diagram for the purpose of structural similarity measure. The structure is considered from two aspects: inter-structure and intra-structure. An algorithm-based UMCSS is proposed for the inter-structure similarity, and the UCG edit distance is proposed and applied to the intra-structure similarity. The experimental results show that our proposed method is effective and closer to the results given by experts. Note that here we do not mean that this can become a paradigm in conceptual modeling, which is only a way available for conceptual modeling.

In our future work, we will investigate several issues. First, how to improve the efficiency of measuring similarity is one important concern. In this direction, filtering some feature values may help us to do less comparison because of the characteristics of UML class diagram consisting of various relationships. Second, trying other methods (e.g., unit structural matching) is a problem we will consider. UML class graph can be split into pieces of unit structures. On the basis of unit structures, we can obtain the final structural similarity through merging unit structure similarity. Third, transforming UML class diagram into other data models (e.g., XML model) may be a possible way for the structural similarity measure. Finally, in order to improve the matching accuracy, we will consider combining the structural similarity and the semantic similarity together for the reuse.