Keywords

1 Introduction

Presently, the relationship between the graph structure of strong connections and the dynamics of data puts an increasingly high load on storage technology [1]. The succession of problems accompanying the processing of massive amounts of data has prompted the development of large-scale knowledge graphs [2], and with their powerful semantic processing capabilities, new solutions to the problems have been made possible [3]. Combined with the structural characteristics of dynamic fragmentation knowledge, this study focuses on the advantages of a graph database in dealing with strong relational data and fuses loose data fragments to establish a knowledge graph network with strong links.

The core technology of distributed data storage is graph partitioning, which divides the graph into several subgraphs and then assigns these subgraphs to different computing nodes [4]. A large number of domestic and foreign experts have researched and improved on the partitioning methods. Combining the characteristics of the resource description framework (RDF) [5] graph data structure in the Semantic Web and parallel computing in a distributed environment, EAGRE [6] compresses the RDF data graph into an entity graph and uses the METIS algorithm to divide it. However, because the current distributed storage method of graph data is mostly based on static horizontal segmentation of the file, a query on it may require a large number of jump accesses between the cluster partitions when the graph is traversed, affecting the query performance.

This study aims to minimize the jump access between partitions, achieve an optimal storage to enable fast queries, improve the overall retrieval efficiency, and complete the real-time distributed storage of the dynamic graph database. For the demonstration, we constructed an urban safety knowledge graph in this study, taking into consideration the characteristics of the graph data and the load balancing requirements in distributed clusters. We applied the Metis+ algorithm to optimize the distributed storage process and the segmentation and storage of the original graph data. We then performed a balance strategy to store the real-time inflow of knowledge fragments. Finally, the effectiveness of the storage and retrieval algorithm was verified by experiments.

2 Proposed Method

This paper describes the application of the Metis+ algorithm on the constructed urban safety knowledge graph, which is mainly divided into three phases: (1) RDF data graph to graph database mapping; (2) graph data distributed partitioning; and (3) distributed dynamic knowledge fragment storage. The distributed partitioning of the graph data is further divided into two parts: the roughening combined with heavy edge matching (HEM) [7], an edge fusion algorithm, and the segmentation algorithm combined with the weighted leveled nested dissection (LND) algorithm.

2.1 Graph Roughening Combined with HEM Edge Fusion Algorithm

Suppose there are k partitions in the Neo4j [8] distributed cluster. The storage capacity of each partition is M, and thus, the total cluster capacity is kM. P = |P(1), P(2), …, P(k)| is the sum of all current partition storage states, and |P(i)| indicates the total number of nodes in the partition with subscript i (1 ≤ i ≤ k).

When the graph Gi = (Vi, Ei) is roughened to the next level graph Gi+1 = (Vi+1, Ei+1), the matching is performed by selecting a larger weight, which can reduce more weights in the roughening graph. The method proceeds to find the maximum matching of edge weights, that is, to find the vertex V in all adjacent unmatched vertices of the vertex U to maximize the weight of the edge euv, and the algorithm complexity of the method is O(|E|). Figure 1 shows an example where the label graph GL is initialized to the weighted undirected graph GL0.

Fig. 1.
figure 1

GL conversion to GL0

In Fig. 1, the left part shows a partial label graph GL, and the right side shows a weighted undirected graph GL0. The total number of instance nodes is the weight W(Vi) of the node Vi in the weighted undirected graph, and the total number of out-degrees and in-degrees of instances between the tags Vi and Vj is the weight W(eij) of the edge eij in the weighted undirected graph. The steps of the HEM edge fusion algorithm for the graph GL0 = (V0, E0) are as follows: (1) Fusion operation is performed on the vertices: found max(W(eij)) in all vertices of GL0. At this time, the vertex Vi, Vj are merged into a new vertex Vij. (2) When the vertices are merged, the vertex weight is transformed as: W(Vij) = W(Vi) + W(Vj). (3) The edge eij connecting the vertices Vi, Vj is deleted. (4) Given the threshold θ, the vertex fusion operation is iteratively performed until |Vm| < θ.

2.2 Multi-objective Weighted LND Segmentation Algorithm

After the vertexs roughening process, the original graph GL0 = (V0, E0) is coarsened to GLm = (Vm, Em) by k steps. Because the hierarchical nesting partitioning LND algorithm does not consider the weight of the vertices, this study hopes that the total weight of the vertices of the graph can be divided into k partitions as evenly as possible when k-way partitioning is performed on the graph GLm and the sum of the boundary weights in the sub-area is as large as possible.

Definition 3-1.

(average vertex weight, Average_W) represents the vertex weight ideally assigned to each partition. It is calculated as follows:

$$ Average\_W = [\sum\limits_{i = 0}^{n} {W(V_{m\_i} )} ]/k $$
(1)

Definition 3-2.

(sum of vertex weights, Sum_W(Vi)) represents the sum of all vertex weights from 0 to label i (i ≥ 0). It is calculated as follows:

$$ Sum\_W(V_{i} ) = \sum\limits_{{{\text{k}} = 0}}^{\text{i}} {W(V_{k} )} $$
(2)

Definition 3-3

(the sum of the maximum edge weights, Sum_BorderW(ei,i+1)): To obtain the minimum cut edge, when the next hop label from the label i is selected as the vertex of i + 1, the accumulation with the largest edge weight should be selected first. It is calculated as follows:

$$ {\text{Sum}}\_{\text{BorderW(e}}_{{{\text{i,i}} + 1}} )= {\text{max(}}Border\_W(e_{i,i + 1\_j} )) $$
(3)

Definition 3-4

(multi-objective optimization formula, APP(i, i + 1)): To minimize the difference between the sum of the accumulated vertex weights and the average vertex weights. It is calculated as follows:

$$ APP({\text{i}},{\text{i}} + 1 )= \begin{array}{*{20}c} {{ \hbox{min} }\{ \{ Sum\_W[Sum\_BorderW(e_{i,i + 1} )]} \\ { + \,Sum\_W(V_{i} )\} - Average\_W\} } \\ \end{array} $$
(4)

2.3 Dynamic Knowledge Fragmentation Storage Strategy

The dynamically influxed knowledge fragment is mapped to the corresponding label set to determine if there is a corresponding label in the k partitions of the distributed cluster. If so, the knowledge fragments are stored in the partition corresponding to the label. Otherwise, the balancing policy for storage is implemented.

Definition 3-5

(Balance strategy): min(|P(i)|) is chosen. If there are multiple partitions that meet the requirements, one of them is randomly selected. The partition number index is returned according to the following formula.

$$ Index = random(\{ i|\hbox{min} (|P(i)|),i\, \in \,|k|\} ) $$
(5)

3 Experimental Settings and Result

The experiment used the open domain dataset provided by OpenKG.CN. This study used an urban area for the relevant resource sets. The basic parameters of the dataset are described in Table 1 and the search examples used in the experiment are shown in Table 2. The dataset covers urban air quality, meteorology, related diseases, traffic safety, risk hazards, and many other aspects. The experiment proves that |Vm| < 100 is a standard value suitable for ending the roughening process, so the threshold value θ = 100 is selected for the edge fusion algorithm. After the dataset training, the key relationship contribution coefficients are selected as α = 0.7, β = 0.3.

Table 1. Basic parameters of urban datasets
Table 2. Search examples

This paper presents a comparative experiment, where the Metis+ algorithm was compared with file horizontal segmentation and the Metis algorithm on the keyword set scale Q1–Q5 with selected data sizes of 3 GB and 14 GB. To eliminate the influence of accidental factors, 10 results were retrieved for each example, and the highest and lowest values were removed to obtain the search average.

It can be seen from Figs. 2 and 3 that when the dataset size and the keyword set size are the same, the average search response time of the Metis+ algorithm is the least. When the data size is 3 GB, the average retrieval response time of the Metis algorithm is 1.12 times that of Metis+, and the file level division is 1.47 times that of Metis+. For a data size of 14 GB, the Metis algorithm is 1.7 times slower than Metis+ and file level division is nearly 2.5 times that of Metis+.

Fig. 2.
figure 2

Comparison of average retrieval response time for 3 GB data scale

Fig. 3.
figure 3

Comparison of average retrieval response time for 14 GB data scale

When the dataset sizes are the same, the average search response time of the Metis algorithm for Q5 is nearly 1.3 times that of Q1 and the horizontal file division is nearly triple. The Metis+ algorithm used in this study is only doubled. It can be seen from the experiment that Metis+ shows a slower rise in the average retrieval response time as the keyword set size is gradually increased.

4 Conclusion

The algorithm not only realizes the maximum allocation of the same type of nodes and closely related nodes to the same partition, but also minimizes the jump access between partitions during retrieval and improves the retrieval efficiency. In future research, we propose to improve the algorithm by considering the correlation between the semantics in the distributed storage stage of the graph data.