Keywords

1 Introduction

Recent advances in scientific and technical areas, including computer and information technologies, have led to the fact that one of the main trends in the modern science development is a significant increase of experimental data volumes and the associated problems of storage and processing. It is clear that further successful development of research projects is possible only if the scientific community will learn to process and analyze extra-large amounts of data, and extract from them new knowledge. Formalization of unstructured data is one of the solutions for solving the problem of big data processing. So we may apply the formal ontology - a modern paradigm of computing resources that describe the knowledge of the world and subject areas.

Many Russian and foreign researchers investigated the problem of application of ontologies to the processing and analysis of big data. Nevertheless, the growth of unstructured information flows, the need to improve the quality of its analysis and processing in information systems requires the development of new methods for effective processing of big data from various domains. Shared ontologies accumulation is seen as a mechanism of unlimited accumulation of knowledge about the world. Currently the problem of comparing and matching ontologies at the level of alignment, i.e. finding semantic correspondences between the elements of two independently developed ontologies, is not solved yet. The problem of ontology alignment is to find such a structure and permissible parameters that provide the optimal values by one or more quality criteria.

The purpose of this paper is to analyze the use of the ontological approach for big data processing and development of method for integration heterogeneous ontological models based on an evolutionary approach.

In this paper we propose a combined method for ontology alignment based on semantic similarity of concepts and multi-objective optimization of similarity weights. Modification of this method is the application of swarm intelligence algorithm for finding the weighting factors. The main advantages of the proposed approach are: finding the key concepts, eliminating of the subjectivity of their descriptions and dependence from the point of view of ontology developers. Generalized operation of concepts comparison along with the parsing and sorting algorithm will improve the quality of ontology alignment procedure. Therefore the interaction of heterogeneous information systems is provided. The fundamental difference of the proposed approach is that it allows obtaining optimal weights on the basis of which the optimal alignment of ontologies is carried out. Performed calculations validate the productivity of the proposed method.

2 The Problem of Unstructured Information Integration

The concept «BigData» refers to the data sets of extremely large volume and complexity that standard tools are not able to carry out their capture, storage, management and processing within a reasonable time for practice. Big data is characterized by parameters such as [1]:

  • volume: big data doesn’t sample; it just observes and tracks what happens;

  • velocity: big data is often available in real-time;

  • variety: big data draws from text, images, audio, video; plus it completes missing pieces through data fusion;

  • validity: the property of being genuine, a true reflection of attitudes, behavior, or characteristics.

Unstructured information (NI) is information that either does not have a predetermined data structure or not organized in the given order [1]. Ontology is a formal explicit description of classes (concepts) in the domain, the properties and attributes of each concept (slots), and the restrictions imposed on slots (facets) [3]. Domain structure (ontology) development is the first step to bring the NI to a structured form. Each individual domain is only a subset of unstructured data set, so for the best possible coverage of the data and, as a consequence, a more complete analysis is necessary to allocate the maximum possible number of different domains to be analyzed [4].

Nowadays the heterogeneous information systems accumulate a considerable amount of knowledge. While integration of these systems a problem of classifying and structural representation of knowledge from different domains appears [5]. Different contexts of ontologies created by different communities are reflected in the different approach to the concepts of the specification that has become one of the causes of heterogeneity. As a result, the semantics of the concepts in the contexts described by different ontologies may be similar in different approaches of description of their structure: the structure, constraints and level of detail.

Linguistic approach for ontology integration involves the creation of a formal ontology of the upper level, the interaction of which with other ontologies is implemented on the basis of linguistic relations. Linguistic relations of such ontological models such as synonym_of (synonym), hyponym_of (gipernim), overlap_of (overlap) and other linguistic relationships allow formally implement the mapping of terms [6]. The disadvantage of this approach is that the linguistic relationships are not always adequately reflect the semantics because of the ambiguity of the language variables.

Ontology integration approach based on shared vocabulary allows to build an integrated model of the different domains of knowledge due to the fact that almost any notion of a vocabulary can be associated with any other term. However, in this case the integration of ontologies typically performed with some ontology developer limits and tips. This approach is implemented by viewing the two ontologies, finding in them synonyms, as well as by conflicts resolution and the creation of a third ontology.

Integration of heterogeneous ontologies may be performed based on alignment of instances: the semantic relationships between two classes of heterogeneous ontologies are merged on the basis of the intersection of sets of instances. Typically, ontology classes described multiple instances that allow better define the semantics of a class. Therefore, the association of ontologies based on instances is more effective [7].

The main disadvantage of the majority of unstructured data fusion methods is the need to engage an expert to confirm the correctness of the detection of the similarities and differences of semantic concepts. Thus, ontological approach provides a new level of information integration. For semantically correct interconnection of heterogeneous information systems it is necessary to compare ontology and to find out their differences and similarities. This problem is solved by semantic similarity techniques of concepts of ontonologies.

3 Ontology Integration Based on Semantic Similarity

An approach for integrating unstructured data based on a comparison of the results of concepts, their attributes and relationships between concepts on the level of ontology alignment is suggested [8]. Each concept of the domain ontology is defined as a unit of knowledge and identified by a name and a type. We define concept as [8]

$$ C_{i} = (N_{i} ,T_{i} ), $$
(1)

where

  • N i – a unique name (identifier) of i-th concept;

  • T i – a type of i-th concept.

Let’s \( C = \{ C_{i} |i = 1,2, \ldots ,n\} \) be a set of concepts and \( {\text{R}} = \left\{ {{\text{R}}_{1} ,{\text{R}}_{2} ,{\text{R}}_{3} } \right\} \) a set of relations between concepts. At that,

  • R1– relation of inheritance (relation of «class-subclass»), R1(C1,C2), where C1 – is a superclass of C2;

  • R2 – relation of aggregation (relation of «whole-part»), R2(C1,A′): attributes of concept C1 are included in a set of attributes of all concepts A′;

  • R3 – relation of association (semantic relations), having transitive relation.

Let’s consider the following expression of formal ontology [9]:

$$ {\text{OHT}} = (C,P,R,A), $$
(2)

where

  • C – denotes concept (or classes) set for a specific domain;

  • P – set of concepts attributes. Property is a component of the relation p(c,v,f), where c ∈ C – ontology concept, v – property value, associated with c and f defines restrictions for facets in v. One of restriction is a type, capacity, and range.

  • R = {r | r ⊆ C × C × R t} – set of binary relations between concepts in C. There is the following variety of relation types: 1:1, 1:many, many:many. The basic set of relations are: synonymOFF, kindOFF, partOFF, instanceOFF, propertyOFF.

  • A – axioms’ set. Axiom is a rule that specify cause-and-effect relationship.

The problem of heterogeneous ontology combination is formulated as follows: given two regular ontologies create a third regular ontology, which is the concept of the input ontologies, as well as additional restrictions and relationships, if they are required. Building an ontology mapping O 1 on O 2 ontology is to find for each concept of ontology O 1 similar to it concept of ontology O 2.

Different ontologies may have overlapping sets of attributes, relations and concepts. The resulting ontology, maintaining the specifications in such a way as to include all the possible relations between concepts and did not contain equivalent (duplicate) concepts is developed on the basis of multiple source ontologies. So mappings on the same concepts of ontologies match. The resulting ontology defines the concepts of compliance and interpretation of the rules that can successfully establish their interaction. The purpose of the integration of unstructured data is to maintain compliance of the set of ontologies to the defined set of semantic relations. Semantic relations defined on the ontology O is taken as z-predicate set on O′. If there is a semantic relation z in ontology O, we write z (O).

Initially heterogeneous ontologies are not associated between each other. Therefore we need to find semantically similar elements of ontologies. For the numerical evaluation of semantic similarity of ontology concepts an approach based on the results of studies of A.F. Tuzovskiy was chosen [10]. In the proposed method similarity measure consists of three components: attributive, taxonomic and relational measures. This method has been adapted for the calculation of the semantic similarity of two heterogeneous ontologies. Modification of this method is an application of particle swarm algorithm for finding the weights. The definition of lexical component is calculated as the ratio of the intersection of sets of words (synonyms) in terms of their association.

Let’s \( S^{T} \left( {c_{i} , c_{j} } \right), S^{R} \left( {c_{i} , c_{j} } \right),S^{A} \left( {c_{i} , c_{j} } \right) \) be a semantic measure of two concepts based on their position, attribute concept attribute value. Weights t, allow to control commutation process of semantic similarity of two concepts.

To estimate lexical similarity of two concepts \( S^{T} \left( {c_{i} , c_{j} } \right) \) sets of concept terms are \( {\text{PL}}_{p} \left( {c_{i} } \right) \) and \( {\text{PL}}_{p} \left( {c_{j} } \right) \) compared, common and different components are found [8]:

where \( PL_{p} \left( {c_{i} } \right) = \left\{ {L_{i} \in L|P_{c} \left( {c_{i} } \right) = L_{i} } \right\} \) – is a set of lexical terms of a concept c i .

To estimate relation similarity it is assumed that if two concepts have similar relation with the third concept they are more similar than two concepts having different relations. Let’s assume that [8]

$$ {\text{C}}_{r} \left( {c_{i} } \right) = \left\{ {c_{j} \in C|R_{1} \left( {c_{i} ,c_{j} } \right) \vee R_{2} \left( {c_{i} ,c_{j} } \right) \vee R_{3} \left( {c_{i} ,c_{j} } \right) \vee c_{j} = c_{i} } \right\} $$
(4)

− is a set containing concepts with relations R 1, R 2, R 3.

Define association relation of concepts such as [8]

$$ R_{A} (c_{j} ) = \left\{ {c_{i} :c_{i} \in C_{r} (c_{j} )} \right\}. $$
(5)

Calculate the sum lexical similarity values for the set of concepts (c j ) and R(c i ).

$$ S_{RA} \left( {R_{A} \left( {c_{i} } \right),R_{A} \left( {c_{j} } \right)} \right) = \mathop \sum\nolimits_{{c_{i} \in R_{A} \left( {c_{i} } \right), c_{j} \in R_{A} \left( {c_{j} } \right)}} S^{T} \left( {c_{i} ,c_{j} } \right) $$
(6)

Relation similarity measure \( S^{R} \left( {c_{i} , c_{j} } \right) \) allows to evaluate similarity of two concepts based on concept similarity from a set (c i ) [8].

Compare attributes of two concepts. A set of attributes pertaining to a concept:

$$ A^{Ci} = \left\{ {A_{k}^{Ci} , k \in \left[ {1 \ldots n_{1} } \right]} \right\}, $$
(7)

where n 1 – number of attributes in a concept c i .

$$ A^{Cj} = \left\{ {A_{k}^{Cj} , k \in \left[ {1 \ldots n_{2} } \right]} \right\}, $$
(8)

where n 2 − number of attributes in a concept c j .

Attributive similarity measure \( S^{A} \left( {c_{i} , c_{j} } \right) \) of concepts c i and c j is calculated by matching of common attributes: \( A^{{C_{i}}} \cap A^{{C_{j}}} \). Attributive similarity measure \( S^{A} \left( {c_{i} , c_{j} } \right) \) satisfy axioms of independence and resolvability, and is defined by the expression

$$ S^{A} \left( {c_{i} , c_{j} } \right) = \frac{{\left| {A^{{C_{i}}} \cap A^{{C_{j}}} } \right|}}{{\left| {A^{{C_{i}}} \cup A^{{C_{j}}} } \right|}}, $$
(9)

where \( A^{{C_{i}}} \) – is a set of attributes of a concept c i ;

\( A^{{C_{j}}} \) − is a set of attributes of a concept c j .

Similarity measure S \( \left( {c_{i} , c_{j} } \right) \) of concept c i of ontology O and concept c j of ontology O′ is defined

$$ S\left( {c_{i} , c_{j} } \right) = t \cdot S^{T} \left( {c_{i} ,c_{j} } \right) + r \cdot S^{r} \left( {c_{i} ,c_{j} } \right) + \alpha \cdot S^{A} \left( {c_{i} ,c_{j} } \right) $$
(10)

where t, r, a – are the coefficients, defining importance of similarity measures \( S^{T} \left( {c_{i} , c_{j} } \right), S^{R} \left( {c_{i} , c_{j} } \right),S^{A} \left( {c_{i} , c_{j} } \right) \), respectively,

$$ t,r,\alpha \in \left[ {0;1} \right], t + r + \alpha = 1, S\left( {c_{i} ,c_{j} } \right) \in \left[ {0;1} \right]. $$
(11)
$$ \left\{ {\begin{array}{*{20}c} {S\left( {c_{i} ,c_{j} } \right) = 1,\quad if\, concepts\, are\, equivalent,} \\ {S\left( {c_{i} ,c_{j} } \right) = 0,\quad if\, concepts \,are \,different.} \\ \end{array} } \right. $$

Heterogeneous ontology integration problems belong to a class of NP-hard optimization problems, and can be solved by evolutionary algorithms.

4 Multi-objective Optimization of Similarity Weights Calculation

Consider the modified swarm intelligence method for ontology alignment, using multi-objective optimization approach [11]. Algorithm for optimization of similarity weights calculation by particle swarm intelligence is depicted on Fig. 1 [12].

Fig. 1.
figure 1

Algorithm for optimization of similarity weights by particle swarm intelligence

In this work we propose to apply PSO calculation based on multi-objective optimization. Multi-objective optimization or parallel programming is the process of simultaneous optimization of two or more conflicting objective functions in a given domain. Multi-criteria optimization task is formulated as follows [13]:

$$ \mathop {\hbox{min} }\limits_{{\vec{x}}} \left\{ {f_{1} \left( {\vec{x}} \right),f_{2} \left( {\vec{x}} \right), \ldots ,f_{k} \left( {\vec{x}} \right)} \right\}, \quad \vec{x} \in S $$
(12)

where \( f_{i} :R^{n} \to R \) – is a \( k\left( {k \ge 2} \right) \) of objective functions. Solution vectors \( \vec{x} = \left( {x_{1} ,x_{2} , \ldots ,x_{n} } \right)^{T} \) belong to a non-empty domain set \( S. \)

Multi-objective optimization task is to find a vector of target variables satisfying cash constraints and optimizing the function of the vector whose elements correspond to the objective function. These functions form a mathematical description of the satisfactory test [14].

Consider the set of data, wherein the data lines are different similarity coefficients and columns - the relations between two different ontologies. For subsequent combining these similarity coefficients in one metric optimum weights were obtained. The proposed approach is possible to find the set of weights that meet the criteria of similarity which allows obtaining an optimal alignment. In the process of evaluation of swarm intelligence generalized function was calculated finteg:

$$ f_{\text{integ}} \left( {O1_{i} ,O2_{i} } \right) = \mathop \sum \limits_{k = 1}^{7} w_{k} \times F_{k} \left( {salign_{ij} } \right), $$
(13)

where \( \mathop{\sum}_{k = 1}^{7} w_{k} = 1. \)

If \( f_{integ} \left( {O1_{i} ,O2_{i} } \right) \) exceeds a threshold then \( salign_{ij} \) is a valid alignment. In such a way all valid alignments are defined. Subsequently, using these valid alignments and reference alignments objective functions are calculated.

The method consists of the following stages.

  1. (A)

    Initialization. The population is called a swarm, and it is composed of m number of appropriate solutions or particles. Each particle has n positions or cells comprising n weighting coefficients corresponding to n different similarity measures.

    Initially, for each cell of particles the value from 0 to 1 is selected randomly. Once selected primary swarms, calculated the corresponding values of fitness. The initial velocity of the particles of each cell is zero. The inputs to the proposed method are a swarm of 50 and weighting factors c1 and c2. The threshold value is chosen to be 0.5. The algorithm performed within 30 iterations.

  2. (B)

    Objective function. The proposed approach works with multiple objective functions: the accuracy and recall of the search. Accuracy is the criterion of correct alignment found in the resulting alignment. Recall is the criterion of finding the right alignment found from a given reference alignment. The criterion of «accuracy» is calculated by the following formula:

    $$ P = \frac{{\left| A \right| - \left| {A \cap R} \right|}}{\left| A \right|} $$
    (14)

    The criterion of «recall» (quantitative parameter of the results of information retrieval, which is determined by dividing the amount granted as a result of the search of relevant concepts to the total number of relevant concepts present in the ontological model) is calculated by the following formula

    $$ R = \frac{{\left| A \right| - \left| {A \cap R} \right|}}{\left| A \right|} $$
    (15)

    As proposed multi-objective Particle Swarm Optimization is implemented as minimization problem so first objective is computed as (1-precision) and second objective is computed as (1-recall).

  3. (C)

    Next Generation Swarm is Produced by Evaluating the Position and Velocity. Each cell or position represents the weight (normalized value of the cell) with respect to the similarity measure. The cells inside the particles contain values from 0 to 1, and the speed of each gene is given zero values. Using the information obtained in the previous step, the position and velocity of each particle of each cluster are updated. Each particle keeps track of the best position it has reached, which is also called pbest. In terms of multi-criteria approach, the position is selected for pbest, whose adaptation of the particle dominates the other devices. And the best position among all particles called global best or gbest. When the particle moves to a new position at a rate that its position and velocity changes in accordance with Eqs. (16) and (17) [13]:

    $$ v_{ij} \left( {t + 1} \right) = w \times v_{ij} \left( t \right) + c_{1} \cdot r_{1} \cdot \left( {pbest_{ij} \left( t \right)x_{ij} \left( t \right)} \right) + c_{2} \cdot r_{2} \cdot \left({gbest_{ij} \left(t \right) - x_{ij} \left(t \right)} \right) $$
    (16)
    $$ x_{ij} \left({t + 1} \right) = x_{ij} \left(t \right) + v_{ij} \left({t + 1} \right) $$
    (17)

    where t is a time stamp, j-th cluster of i-th particle. Velocity \( v_{ij} \left({t + 1} \right) \) is calculated by usage of previous velocity \( v_{ij} \left(t \right) \), pbest and gbest. Then a new position \( x_{ij} \left({t + 1} \right) \) is obtained by adding new velocity with current position \( x_{ij} \left(t \right) \). c1 and c2 are set to 2, r1 and r2 are random values from the range from 0 to 1.

After applying non-dominated sorting and crowding distance sorting to the archive, a Local Search is conducted for obtaining the better approximation of weights regarding optimal alignment. In the Local-Search algorithm, the best particle replaces the worst particle of the new generation.

5 Experimental Research

Experimental researches performed with different number of ontology entities have shown that the algorithm has polynomial time complexity O(n 2 ). Diagram of time complexity of the algorithm is shown on Fig. 2.

Fig. 2.
figure 2

Diagram of time complexity of the algorithm

We’ve compared the suggested approach with single objective optimization by accuracy and recall (Table 1) [7].

Table 1. Results of experiments

Efficiency of the suggested approach for the criterion accuracy is 0,81428 (high). Again with respect to the f-measure the table shows that our method outperforms other single objective versions. Therefore, the proposed method is effective.

6 Conclusion

The ontological approach for big data processing is considered. The approach for integrating unstructured data is based on comparison of the results of concepts, their attributes and relationships between concepts on the level of ontology alignment. Each concept of the domain ontology is defined as a unit of knowledge and identified by a name and a type. The purpose of the integration of unstructured data is to maintain compliance of the set of ontologies to the defined set of semantic relations. Heterogeneous ontology integration problems belong to a class of NP-hard optimization problems, and can be solved by evolutionary algorithms. In this work we propose to apply PSO calculation based on multi-objective optimization. Experimental researches performed with different number of ontology entities have shown that the algorithm has polynomial time complexity.

The main advantages of the proposed approach are: finding the key concepts, eliminating of the subjectivity of their descriptions and dependence from the point of view of ontology developers. Generalized operation of concepts comparison along with the parsing and sorting algorithm will improve the quality of ontology alignment procedure. Therefore the interaction of heterogeneous information systems is provided. The fundamental difference of the proposed approach is that it allows obtaining optimal weights on the basis of which the optimal alignment of ontologies is carried out. Performed calculations validate the productivity of the proposed method.