Keywords

1 Introduction

The Semantic Web is an effort to make knowledge on the Web both human-understandable and machine-readable [1]. Semantic Web data is normally structured in triple formats called Resource Description Framework (RDF). By emerging RDF/S, OWL and SPARQL standardization, the number of large KBs such as YAGO, DBpedia and FreebaseFootnote 1 is growing so fast. Although these KBs suffer many issues such as incompleteness and inconsistencies, they already contain millions of facts which raise new opportunities for data mining community. In recent years, researchers have been working on developing methods and tools for mining hidden patterns from Semantic Web data that promise more potential for Semantic Web applications [2]. In this regard, association rule mining is one of the most common Data Mining (DM) techniques for extracting frequent patterns.

There are several methods in mining associations from large RDF-style KBs. Most existing methods focus on Inductive Logic Programming (ILP) to mine association rules. ILP usually requires counterexamples. AMIE [3, 4] is a multi-threaded approach where the KB is kept and indexed in the memory. High memory usage is one of the drawbacks of this approach. This method is restricted to a complete ontology structure. To compute support and confidence values, the method only considers knowledge in the instance-level and removes rdf:type relations from datasets. A recent statistical approach for mining association rules in RDF data is [5]. It automatically generates three forms of \(s_{i}\) \(\Rightarrow \) \(s_{j}\), \(p_{i}\) \(\Rightarrow \) \(p_{j}\), and \(o_{i}\) \(\Rightarrow \) \(o_{j}\) rules. This approach does not require counterexamples. However, the method discovers the sequence of subjects, predicates, or objects which are correlated independently. Additionally, the rule quality factors (Support and Confidence) only assess instance-level data.

In comparison with [5], we propose a statistical approach to automatically mine rules from RDF data. Our approach is based on the methodology that adapts association rule mining to RDF data. The rules reveal common behavioural patterns associated with knowledge in the instance-level and schema-level. Consider the RDF triples shown in Table 1. From the triples, our approach generates the following rule:

$$\begin{aligned} {\{Person\}}{:}\, (instrument,\, {Guitar}) \Rightarrow (occupation,\, {Songwriter}) \end{aligned}$$
Table 1. RDF triples from DBpedia

The above rule shows that most of the time persons who play a musical instrument such as Guitar, they are probably Songwriters. Mining such regularities help us to gain a better understanding of Semantic Web data. In order to elaborate the semantics of the rules, our approach considers rdf:type and rdf:subClassOf relations in the ontology. As seen in Table 1, both John Lennon and George Harrison are guitarists and songwriters. Consider Fig. 2 as a small fragment of DBpedia ontology. George Harrison Footnote 2 is an instance of Musical Artist while John Lennon Footnote 3 belongs to the Person class. Regarding the concept of hierarchy in the ontology, if Musical Artist class is a subclass of Artist class and the Artist class is a subclass of Person class, then George Harrison belongs to the Person class as well. But John Lennon is not an instance of Musical Artist class. In the context of Semantic Web data, it is not reasonable to interpret the discovered rules without considering such relationships between instance-level and schema-level. As far as we know, the proposed approach in [35] do not cover such issues on mining Semantic Web data.

Under this motivation, in this paper, we proposed a novel approach called SWARM (Semantic Web Association Rule Mining) to automatically mine and generate semantically-enriched rules from RDF data. The main contribution of this paper is threefold:

  1. 1.

    The SWARM is an approach that automatically mines association rules from RDF data without the need of domain experts.

  2. 2.

    The SWARM measures the quality of rules (Support and confidence) by utilizing knowledge not only in the instance-level but also schema-level.

  3. 3.

    The SWARM reveals common behavioural patterns associated with knowledge in the instance-level and schema-level.

The remainder of this paper is organized as follows. Section 2 gives a general overview of the related works in this context. In Sect. 3, the SWARM approach is introduced in detail. Both framework architecture and algorithms are presented in this section. Section 4 shows the experimental results. Finally, the conclusion and future work are presented in Sect. 5.

2 Related Works

In the following, we discuss state-of-the-art approaches in the context of Semantic Web data mining.

Logical Rule Mining. Most related research on mining Semantic Web data relies on ILP techniques. ALEPH [6] is an ILP system implemented in the Prolog. WARMeR [7] used a declarative language to mine association rules correspondent to conjunctive queries from a relational database. Galárraga, et al. [3] proposed a multi-threaded approach called AMIE for mining association rules from RDF-style KBs. Galárraga, et al. [4] extends AMIE to AMIE\(^{+}\) using pruning and query rewriting techniques. Similar to AMIE, [8] proposed another approach for extracting horn rules. The proposed methods in [3, 4, 8] measures the quality of the discovered rules using instance-level data. However, instances in a rule might belong to different classes of the ontologies. To express a broader meaning of the rules, the SWARM considers instance-level data along with rdf:type and rdf:subClassOf relations in the schema-level.

Association Rule Mining. Association rule mining was originally proposed for shopping basket problems [9]. It reflects high correlation between multiple objects and extracts interesting relationships between data [10]. Nebot and Berlanga [11] proposed a rule mining approach over RDF-based medical data. Transactions have been generated using mining patterns developed by SPARQL queries. This approach heavily relies on the domain experts. Namely, the user should have background knowledge of vocabularies used in the ontology. In comparison to [11], SWARM approach does not require the domain experts.

Abedjan and Naumann [5, 12] developed an approach to identify schema and value dependencies between RDF data using six different Configurations. Any part of Subject-Predicate-Object (SPO) statement can be considered as a context, which is used for grouping one of the two remaining parts of the statement as the target of mining. This approach mines three forms of \(s_{i}\) \(\Rightarrow \) \(s_{j}\), \(p_{i}\) \(\Rightarrow \) \(p_{j}\), and \(o_{i}\) \(\Rightarrow \) \(o_{j}\) rules. The discovered rules shows the correlation among subjects, predicates or objects independently. In comparison with this approach, SWARM generates common behavioural patterns associated with knowledge in instance-level and schema-level.

3 The SWARM Approach

In this section, we describe a detailed view of SWARM approach along with the definitions. The overall framework is shown in Fig. 1. The main goal of SWARM approach is to tie instance-level to schema-level to attach more semantics to the rules. The SWARM generates Semantic Association Rules from RDF data.

Fig. 1.
figure 1

The SWARM framework

The RDF triples are automatically processed via Pre-processing Module consisting of two sub-modules: Semantic Item Generation and Common Behaviour Set Generation. The Mining Module receives Common Behaviour Sets to generate Semantic Association Rules. The SWARM approach evaluates the importance of rules by using rdf:type and rdf:subClassOf relations in the ontology. The proposed rule quality factors (Support and confidence) consider knowledge not only in the instance-level bust also schema-level.

3.1 Pre-processing Module

The concept of association rule mining was first introduced in [9]. Let \( I=\{i_{1}, i_{2},...,i_{n}\} \) be a set of items and \( D=\{t_{1},t_{2},...,t_{m}\} \) be a set of transactions. Each transaction contains a subset of items in I. An association rule represents a frequent pattern of the occurrence of some items in transactions. In addition, association rules reveal behavioural patterns of some particular entities. For example, in the traditional shopping basket problem, the rule \(\{\text {butter},\text {bread}\} \Rightarrow \{\text {milk}\}\) shows a behavioural pattern of customers. Namely, if a customer buys butter and bread together, she is likely to buy milk as well.

Traditional association rule mining algorithms are suited for homogeneous repositories, where items and transactions play significant roles in the mining process [13]. However, most Semantic Web data are not transactional data, and there exists no items or transactions. To generate association rules in the context of Semantic Web data, we need to model such notions.

As mentioned earlier, Semantic Web data are normally structured in triple format. The assertion of a triple (i.e., subject, predicate, object) indicates a meaningful relationship between entities (subject and object) provided by the predicate. A triple can also be considered as the description of one particular behaviour of entities.

For example, suppose we have a triple t1 in Table 1: (John Lennon, instrument, Guitar). If we consider the subject in t1 (i.e., John Lennon) as the entity, the other two elements in the triple (i.e., instrument Guitar) can be considered the description of a particular behaviour of John Lennon. Based on this concept, in the SWARM approach, we target at exploring behavioural patterns among entities. Under this motivation, we define Semantic Item and Common Behaviour Set to summarize common behaviours of entities.

3.2 Semantic Item Generation

Consider the example presented in the previous paragraph with triple t4 in Table 1: (George Harrison, instrument, Guitar). These two subjects in t1 and t4 {John Lennon, George Harrison} have a common activity, i.e., (instrument Guitar). Namely, playing guitar is a common behaviour taken by a group of entities, i.e., {John Lennon, George Harrison}. In the SWARM approach, such combinations are represented as Semantic Items.

Definition 1

(Semantic Item) . A Semantic Item si is a 2-tuple, i.e., \( si= (es, pa)\). es is an Element Set of si. It contains a list of subjects, i.e., \(\{s_{1},s_{2},...,s_{n}\}\). pa is a Pair of si. Corresponding with the content in es, pa contains a combination of predicate-object, i.e., (po).

According to Definition 1, triples in a triple store can be converted to a set of Semantic Items i.e. \(SI=\{si_{1}, si_{2}, ..., si_{n}\}\). Each Semantic Item contains a Pair, which can be considered as a common behaviour taken by entities in the Element Set.

Example 1

Consider the triples shown in Table 1. Table 2 shows some of the Semantic Items generated by Definition 1. For example, the Element Set of \(si_{2}\) including {JohnLennonGeorgeHarrison} represents all subjects that contain (occupationSongwriter) as a Pair.

Table 2. Semantic Items

3.3 Common Behaviour Set Generation

As introduced in the previous subsection, a Semantic Item indicates a common behaviour (i.e., the Pair) taken by a group of entities in the Element Set. We define a Common Behaviour Set that represents all common activities taken by similar groups of entities in the Element Sets.

Definition 2

(Common Behaviour Set) . A Common Behaviour Set cbs contains a set of Semantic Items with similar Element Sets, i.e., \(\{(es, pa)_{1},(es, pa)_{2},... ,(es, pa)_{n}\}\). Items can be aggregated into the same cbs, if the similarity degree of their Element Sets are greater than or equal to Similarity Threshold SimTh. The Similarity Degree of Element Sets can be calculated by using Eq. 1.

$$\begin{aligned} sim(es_{a}, es_{b},..., es_{m})= \frac{|es_{a} \cap es_{b} \cap ... \cap es_{m}|}{|es_{a} \cup es_{b} \cup ... \cup es_{m}| } \end{aligned}$$
(1)

According to Definition 2, the cbs is a set of Semantic Items aggregated through the similarity of entities in their Element Sets. Namely, a cbs shows a collection of common occurrence of some activities taken by entities in the Element Sets.

Example 2

Table 3 shows Common Behaviour Sets generated by Semantic Items in Table 2. Semantic Items \(si_{1}\), \(si_{2}\), and Semantic Items \(si_{3}\), \(si_{4}\) generate Common Behaviour Sets \(cbs_{1}\) and \(cbs_{2}\), when the \(\textit{SimTh}\) among Element Sets is greater than or equal to \( 50\,\% \).

Table 3. Common Behaviour Sets

3.4 Mining Module

To generate Semantic Association Rules, we need to have a notion of frequency. As we discussed in the previous subsection, each particular Common Behaviour Set cbs is a unique set and reveals the common occurrence of some activities taken by entities (subjects) in its Element Sets. In fact, it is a particular form of transaction in the context of Semantic Web data. Under this motivation, we first generate Semantic Association Rules from Common Behaviour Sets and then we evaluate the quality of rules (Support and confidence measures) by extracting knowledge encoded in the ontology.

Definition 3

(Semantic Association Rule) . A Semantic Association Rule r is composed by two different sets of Pairs \(pa_{ant} \) and \(pa_{con}\), where \(pa_{ant}\) is called Pairs of Antecedent and \(pa_{con}\) is called Pairs of Consequent. \(pa_{ant}\) is a set including the number of Pairs in \(cbs_{j}\), i.e., {\(pa_{1}, ... , pa_{n}\)}. \(pa_{con}\) is a set including the remaining number of Pairs in the \(cbs_{j}\), i.e., {\(pa_{n+1} ,...,pa_{m}\)}. Rule r contains a common Rule’s Element Set res where res is a set including union of the Element Sets in \(cbs_{j}\), i.e., {\(es_{1}\cup {...}\cup es_{m}\)}. Each Element Set \(es_{i}\) is a set of instances, i.e., \(es_{i}=\{ins_{1},ins_{2},...,ins_{k}\}\). We indicate a rule r with the antecedent and consequent by an implication

$$\begin{aligned} res{:}\, pa_{ant} \Longrightarrow pa_{con} \end{aligned}$$

where res is a common Rule’s Element Set containing \(\bigcup \limits _{si_i \in cbs_j}si_i.es\) and \({pa_{ant}}\), \({pa_{con}} \in cbs_{j}\) and \({pa_{ant}} \cap {pa_{con}}=\emptyset \).

Table 4 shows two examples of rules generated from Common Behaviour Sets in Table 3. For example, rule \(r_{1}\) contains a common Rule’s Element Set res generated by union of Element Sets in \(cbs_{1}\), i.e., {\( John \; Lennon \), \( George \; Harrison \)}. The antecedent and the consequent of \(r_{1}\) holds the Pair (instrument, Guitar) and (occupation, Songwriter), respectively.

Our goal is to measure the quality of rules by using knowledge in the instance-level and schema-level. rdf:type is basically an RDF property that ties an instance to a class in the ontology. In the traditional association rule mining, all instances often have one type (class) of actors, i.e., shopping customers. Namely, particular activities have always done by customers. However, here instances in the Rule’s Element Set may belong to different types/classes in the ontology. Consider all instances in the Rule’s Element Set of rule \(r_{1}\), i.e., John Lennon and George Harrison. Figure 2 shows a small fragment of DBpedia ontology. In this ontology, George Harrison belongs to the Musical Artist, while John Lennon is an instance of Person class. As we discussed in the introduction section, in the context of Semantic Web data, it does not make sense to measure the quality of rules by only considering knowledge in the instance-level. This observation leads us to assess rdf:type and rdfs:subClassOf relations in the schema-level. In this paper, we focus on interpreting rules through having a single ontological structure, e.g. DBpedia ontology. Furthermore, we assume that each instance belongs to a single class in the ontology.

Table 4. Semantic Association Rules
Fig. 2.
figure 2

A fragment of the DBpedia ontology

Figure 3 shows three different hierarchical structures of an ontology. As previously mentioned, in Fig. 3(a), if Class \(c_{1}\) is subclass of Class \(c_{3}\) through middle Class \( c_{2}\) (\(c_{1} \subseteq c_{3}\)), then the Instance \(I_{a}\) belongs to \(c_{3}\) as well. However, in Fig. 3(b), Class \(c_{1}\) and Class \(c_{5}\) are not in the same hierarchy (\(c_{1} \not \subseteq c_{5}\)). Even if we consider Class \(c_{3}\) as a lowest common class for \(c_{1}\) and \(c_{5}\), we reduce their semantics. Because classes on the upper levels illustrate more general descriptions to compare with Lower level classes which provide more special descriptions. Therefore, in case that classes are not in the same hierarchy, we just consider the Lowest Level Class (LLC) for each instance in a Rule’s Element Set. For example, in Fig. 3(b), \(I_{a}\) and \(I_{b}\) belong to \(c_{1}\) and \(c_{5}\), respectively. In Fig. 3(c), in the LLC, \(I_{a}\) and \(I_{b}\) belong to \(c_{2}\), while \(I_{c}\) is an instance of \(c_{9}\). Consider again instances in the Rule’s Element Set \(r_{1}\). Based on our assumption that each instance belongs to a single class, the LLC for both George Harrison and George Harrison is the Person class.

Fig. 3.
figure 3

Examples of different hierarchical structures of an ontology

Support. Consider the Semantic Association Rule r in the form of \( res: {pa_{ant}}\Longrightarrow {pa_{con}}\). The support Sup (r) is defined as:

$$\begin{aligned} Sup (r)=\frac{|\bigcup \limits _{ins_j \in c_i \wedge ins_j \in res_{k} \wedge ins_j.pa_{ant_{k}}}c_i|}{|\bigcup \limits _{ins_j \in c_i \wedge ins_j \in res_{k}}c_i|} \end{aligned}$$
(2)

The numerator of support fraction is the total number of instances of Class \(c_{i} \) that contains \({pa_{ant}}\) as the Pairs. The denominator is the total number of instances of \(c_{i}\).

Example 3

Regarding three different schemas shown in Fig. 3, rules generated from Schema a, b, and c are \( r_{a}=\{I_{a}, I_{b}\}\): \(pa_{ant}\) \(\Rightarrow \) \(pa_{con}\), \( r_{b}=\{I_{a}, I_{b}\}\): \(pa_{ant}\) \(\Rightarrow \) \(pa_{con}\), and \( r_{c}=\{I_{a}, I_{b}, I_{c}\}\): \(pa_{ant}\) \(\Rightarrow \) \(pa_{con}\), respectively. The supports of rules can be calculated by the following fractions:

$$\begin{aligned} Sup (r_{a})&=\frac{|c_{3} \; \cap \; pa_{ant}|}{|c_{3}|}\\ Sup (r_{b})&=\frac{|(c_{1}\cup c_{3}) \; \cap \; pa_{ant}|}{|c_{1}\cup c_{3}|}\\ Sup (r_{c})&=\frac{|(c_{2}\cup c_{9}) \; \cap \; pa_{ant}|}{|c_{2}\cup c_{9}|} \end{aligned}$$

Example 4

The support of rule \(r_{1}\) in Table 4 can be calculated by the following fraction. The numerator of support fraction shows the total number of instances belong to Person class that contain instrumentGuitar as a Pair. Based on the existing ontology shown in Fig. 2, there is only two instances that contain instrumentGuitar as a Pair. The denominator of the fraction also is total number of instances belong to the Person class which is six in this example (\(Sup.=0.33\)).

$$\begin{aligned} Sup (r_{1})=\frac{|Person \; \cap \; instrument \; Guitar|}{| Person|} \end{aligned}$$

Confidence. Consider the Semantic Association Rule r in the form of \( res: {pa_{ant}}\Longrightarrow {pa_{con}}\). The confidence Conf (r) is defined as:

$$\begin{aligned} Conf (r)=\frac{|\bigcup \limits _{ins_j \in c_i \wedge ins_j \in res_{k} \wedge ins_j.pa_{ant_{k}}\wedge ins_j.pa_{con_{k}}}c_i|}{|\bigcup \limits _{ins_j \in c_i \wedge ins_j \in res_{k} \wedge ins_j.pa_{ant_{k}}}c_i|} \end{aligned}$$
(3)

The numerator of confidence fraction is the total number of instances of Class \(c_{i} \) that contains \({pa_{ant}}\) and \({pa_{con}}\) as the Pairs. The denominator of the fraction is the total number of instances of \(c_{i} \) that contains \({pa_{ant}}\) as the Pairs.

Example 5

The numerator of confidence fraction of rule \(r_{1}\) shows the total number of instances belong to the Person class that contain instrumentGuitar and occupationSongwriter as the Pairs. The denominator of the fraction also is the total number of instances belong to the Person class along with instrumentGuitar as a Pair (\(Conf.=1.0\)). The rule shows that most of the time persons who play Guitar, they probably work as Songwriters. The rule shows that at least 50 % of instances in the Rule’s Element Set satisfy the rule.

$$\begin{aligned} Conf (r_{1})=\frac{|Person \; \cap \; instrument \; Guitar \; \cap \; occupation \; Songwriter|}{|Person \; \cap \; instrument \; Guitar|} \end{aligned}$$

Example 6

Rule \(r_{2}\) in Table 4 shows that most of the time people who are President of the USA, they are probably members of Democratic party (\(Sup.=0.5\), \(Conf.=0.66\)).

4 Experiments

4.1 Overview

Dataset. As a proof of concept, we ran the SWARM on DBpedia (3.8)Footnote 4. The DBpedia datasets usually provide the A-Box and T-Box in two separate files: Ontology Infobox Properties and Ontology Infobox Types. The Ontology Infobox Properties provides instance-level data and the Ontology Infobox Types contains triples in the form of (subject, rdf : type, ClassName). The ClassName declares the name of classes for each subject in the DBpedia ontology. For example, \(Anton\; Drexler\) belongs to the Politician, Person, and Agent classes. In this paper, we filtered out the Ontology Infobox Types based on the person class and its subclasses. In the DBpedia ontology, the Person class contains 26 subclasses. By using triples filtered from Ontology Infobox Types, we extracted about 50,000 triples of Ontology Infobox Properties. We also removed some triples with literals (numbers and strings) from the subset dataset. Literal values such Birthdate information are less interesting for rule mining.

Goal. The main goal of this research is to automatically tie instance-level to schema-level to attach more semantics to the rules. To the best of our knowledge, this issue has not yet been considered by the existing methods. In comparison to [5, 12] that mentioned their approach is more granular in considering predicate correlations and object correlations independently, our approach is able to automatically mine common behavioural patterns associated with knowledge in the instance-level and schema-level.

Table 5. Semantic Association Rules from Person class (SimTh = 60 %)
Table 6. Semantic Association Rules from Person class (SimTh = 80 %)

Evaluations. Table 5 represents some Semantic Association Rules of Person class generated by SimTh = 60 %. For example, rule \(r_{1}\) shows that the Scientists who are known for Natural selection theory, they were probably awarded with the Copley and Royal Medals. Note that in this table, at least 60 % of instances of Element Sets satisfy rules. Rule \(r_{2}\) illustrates that Politicians who are residents of Amsterdam and works in the Labour Party of Netherlands, they are probably residents of Netherlands. Rule \(r_{3}\) shows that the Saints who venerate in Lutheranism which is a major branch of Protestant Christianity, they are more likely to be venerated in Anglican Communion as well. Rule \(r_{4}\) represents that people who were Kings of Macedon, they probably had Ancient Greek religion.

Table 6 also shows some rules generated by SimTh = 80 %. Based on the DBpedia ontology, we identify some inconsistent patterns. Rule \(r_{1}\) represents that some members of British Royal family who were born in Ribeira Palace and whose parents are Luisa of Guzman and John IV of Portugal, they more likely to be buried in the Royal Pantheon of the House of Braganza. Although the instances of Rule’s Element Set \(r_{1}\) satisfy the rule, none of them belongs to the British Royal family. In fact, they are members of Portugal Royal family. Rule \( r_{2} \) also suffers from the same issue as \(r_{1}\) does. In the DBpedia ontology, all royalties belong to the BritishRoyalty and PolishKing classes. The ontology does not define any other classes for these instances. Such inconsistencies between ontology definitions and underlying data lead to an ambiguous interpretation. Sometimes the ontology has been created independently before actual data usage. In the case of DBpedia project, revising existing class definitions might be helpful to obtain a better understanding of data.

We observe that the generated rules tend to have low support rates. The intuition behind this is that the denominator of support fraction usually contains the total number of instances of particular classes. In the real world KBs, the number of instances is too large and it leads to low support rates. Figure 4 shows the number of strong Semantic Association Rules with different minimum Similarity Thresholds. The SimTh has a direct effect on the number of generated rules. As seen in Fig. 4, the number of high confidence rules from 0.6 to 0.8 has been decreased by increasing the SimTh. The reason that SWARM discovers a large number of rules with confidence 1.0 is because of filtering mechanism for generating Common Behaviour Sets. Note that this approach is implemented in the Eclipse Java with 3.20 GHz Intel Core i5 processors and 16 GB memory. The time complexity of SWARM algorithm belongs to the O(\(n^{2}\)) class (including the time for generating the Semantic Items, Common Behaviour Sets, and mining Semantic Association Rules by utilizing the instance-level and schema-level knowledge).

Fig. 4.
figure 4

Number of strong Semantic Association Rules with different minimum Similarity Thresholds

5 Conclusion

In this paper, we propose an approach to automatically mine Semantic Association Rules from Semantic Web data by utilizing knowledge in the instance-level and schema-level. We believe that this type of learning will become important in the future of Semantic Web data mining especially for re-engineering ontology definitions. In comparison with the existing methods [5, 12] that evaluate the quality of rules by only using instance-level data, the SWARM approach takes advantage of rdf:type and rdfs:subClassOf relations to interpret the association rules. Future work will aim to test the approach on different classes of DBpedia Ontology. In order to produce more precise rules, we target at developing this approach wherein each instance of a Rule’s Element Set belongs to the multiple classes in the ontologies.