1 Introduction

Information extraction is the task of automatically extracting knowledge from text and constructing knowledge bases. A knowledge base usually contains entities such as Tom Brady, Winnipeg, or Manchester United FC. It also contains information about these entities, such as the fact that Tom Brady plays football, Winnipeg is a city in Canada, or Manchester United FC is a sports team. These facts are essentially linguistic in nature and referred to commonly as nouns (or categorical instances) or relations (which are ordered pairs of nouns). Examples of nouns are Winnipeg, Warsaw, New Delhi which belong to the category of City. Examples of relations which are ordered pairs of nouns are: (Winnipeg, Canada), (Warsaw, Poland), (New Delhi, India), belonging to the category of CityInCountry. Knowledge bases are utilized in many applications such as machine translation, question answering, and semantic search [28]. The growth of the web has spawned knowledge bases from web corpora where the construction of these bases is performed in a either semi-supervised or unsupervised manner. These methods require minimal or no human intervention and can recursively discover new relations, attributes, and instances in a fully automated, scalable manner. The focus of some of the current systems such as YAGO3 [17, 35, 36], KnowItAll [11] and TextRunner [1] is on machine reading. Other systems such as NELL [4, 19] successfully learn facts over a prolonged period of time and are called never ending learners.

Typical web corpora include Wikipedia, DBpedia, WordNet [18], NELL [19], and Google’s Knowledge Vault [9]. In this work, inspired by NELL, we focus on a small subset of the problem where facts are learned using machine learning techniques in an iterative fashion rather than machine reading or the extraction process itself. In other words, we are given facts that have already been extracted and new facts are learnt. Content for such learning can be thought of as a triple (subject, predicate, object). For example, a relational fact can be represented as:

$$\begin{aligned} \left\langle \text{ Tom } \text{ Brady } \right\rangle \left\langle \text{ actually } \text{ led } \right\rangle \left\langle \text{ Patriots } \right\rangle \text { to a win} \end{aligned}$$

where “Tom Brady” is a subject (noun), “actually led” is a predicate, and “Patriots” is an object (noun). This representation permits the learning of a relational fact (Tom Brady, Patriots) as belonging to a category (Athlete-team) using the predicate information. Unary relations (nouns) capture memberships in a semantic type. On the other hand, binary relations (a pair of nouns) capture semantics between entities. The predicate is referred to as a contextual pattern which is an arbitrary phrase such as “actually led” providing a context for the relation. Hence, we have two main facts: relations and contextual patterns. A matrix is then formed based on the co-occurrence of relations and their contextual patterns. Using this co-occurrence matrix, if a relation \(r_1\) (e.g., Tom Brady, Patriots) regularly co-occurs in context (e.g., “actually led”) along with a relation \(r_2\) (e.g., Peyton Manning, Colts) and if \(r_1\) belongs to category C (e.g., Athlete_Team), then it is likely that \(r_2\) might also belong to category Athlete_Team. In this paper, we address three typical issues that arise from this form of learning: i) the number of training examples are few, i.e., relations and their known categories, ii) a relation may belong to more than one category depending on its contextual patterns, and iii) new relations end up being miscategorized (also known as concept drift [7, 12]) in the process of never ending learning.

To address the first issue, semi-supervised approaches are preferred where a small number of labeled examples are used as seed (training) examples and a large number of unlabeled instances are learnt in an iterative fashion. In every iteration, a few examples from the unlabeled instances are “promoted” as trusted, thus growing the knowledge base. This process is also known as bootstrapping. Significant improvements in accuracy for learning categories of nouns and relations have been demonstrated using bootstrapping methods highlighted in [3, 4, 19, 39]. The second issue translates to the challenge of using the co-occurrence information and scoring the degree of belonging where a relation can belong to more than one category based on its context. In CPL [4], simultaneous training of many extractors for learning relations using mutual exclusion and type-checking relationships were employed. In CBS [39], a probabilistic score for every candidate noun fact belonging to a category was used. In TPL [31], a scoring mechanism based on tolerance approximation spaces [33] derived from rough set theory [24] was used. The foundation for TPL is a tolerance relation which is symmetrical and reflexive, but need not be transitive, and this property makes it possible to represent overlapping classes and well-suited for constructing knowledge bases of linguistic entities. In the proposed research with FRL, instead of a crisp co-occurrence matrix used by the above three methods, our method uses a fuzzy (graded) co-occurrence matrix. The motivation for using a hybrid rough set method was to take advantage of the strengths of both fuzzy and rough set methods. Fuzzy rough model for categorizating nouns which permits the modeling of the degree of belonging in a more powerful manner was successfully used in [2]. In this paper, a scoring mechanism is used based on fuzzy rough sets [10], where fuzzy sets [42] are first applied to the co-occurrence information. Rough set operators are then applied to the graded co-occurrence information to obtain the overall score. Such a technique was successfully employed in [8] for web query expansion for document retrieval.

The third issue of the concept drift problem is common to all iterative semi-supervised learning, where the limited number of initial labeled examples tends to be insufficient to reliably constrain the learning process, thereby causing the prediction of new relational facts to become less accurate with each iteration. CPL relies on three types of constraints to handle concept drift. In CBS, Bayesian learning is used to learn several categories for nouns simultaneously using a coupled Bayesian sets algorithm with mutual exclusion constraints. With the tolerance rough set-based learner (TPL) algorithm [29], no external constraints were required for learning nouns and relations. In the proposed FRL method, mutual exclusion constraints are defined to handle concept drift.

In this paper, we present a new hybrid fuzzy rough model inspired by [6, 8] to learn relations. We introduce a semi-supervised learning algorithm (FRL) with ontology derived from a subset of examples from NELL. The proposed FRL algorithm is experimentally compared with CPL and TPL algorithms. Experimental results demonstrate that FRL performs better than TPL based on the ranking method and outperforms CPL and TPL using the promotion-based method. The contributions of this paper are: (i) a formal fuzzy rough model for relations, (ii) a semi-supervised learning algorithm, (iii) experimental comparison with other machine learning algorithms: TPL and CPL, and (iv) a novel application of fuzzy rough sets with some insights into the strengths and weakness of integration of these technologies for categorization of linguistic entities.

The paper is organized as follows: in Sect. 2, we discuss research related to structured (document) and unstructured text (linguistic) categorization pertinent to this paper. In this section, we describe TPL, CBS, and CPL in some detail. In addition, we also discuss the fuzzy rough sets model for documents. In Sect. 3, we present the proposed fuzzy rough sets framework for nouns and relations. In Sect. 4, we describe our proposed FRL algorithm and experiments followed by a trace of FRL with examples. In Sect. 5, we discuss the experimental setup and analysis of the results. Here, we illustrate the problem of concept drift with FRL. In Sect. 6, we discuss scalability and complexity of FRL. We conclude the paper in Sect. 7.

2 Related works

In this section, we briefly introduce CPL and CBS methods since they are used as benchmarks for comparison. Tolerance rough set method is discussed in considerable detail, and the TPL framework is introduced to give the reader an insight into the lower and upper approximation framework. We also discuss the fuzzy rough set method which was first proposed as a model for document representation and retrieval and uses a graded (fuzzy) thesaurus.

2.1 CPL and CBS methods

CPL is a bootstrapping algorithm based on logistic regression that uses mutual exclusion and type-checking constraints [4]. Nouns and relations are first filtered to enforce mutual exclusion and type-checking constraints. Next, for each category, CPL ranks the linguistic entities using the number of promoted patterns that they co-occur with, so that candidates that occur with more patterns are ranked higher.

CBS is based on Bayesian sets [13] and uses the co-occurrence statistics between noun phrases and contextual patterns. CBS calculates a probabilistic score by using those co-occurrence statistics for every candidate category. In CBS, the learned functions can be considered as classifiers that enforce mutual exclusion constraints using positive examples of one category as negative examples for other ones to learn high-precision instances for all categories defined in an initial ontology. In CBS, instances are first filtered to enforce mutual exclusion. Then, the top ranked ones are promoted as trusted instances for that category. The promoted instances are used as seeds in subsequent iterations in a semi-supervised iterative manner. CBS learns several categories simultaneously. Readers can refer to [39] for more details.

2.2 Tolerance rough sets method

Rough set theory consists of an approximation space characterized by an equivalence relation [24]. Learning in rough set theory is typically accomplished using two operators: lower and upper approximation. However, it has been demonstrated that a tolerance relation [25, 33] is more appropriate for document and named entity classification rather than an equivalence relation [16, 27, 40]. This is because a tolerance relation is symmetrical and reflexive, but need not be transitive, and this property makes it possible to represent overlapping classes which is ideal for text categorization. Tolerance rough sets became the basis for document representation and clustering starting from 2000 [14, 16, 20,21,22, 38]. In [32], a tolerance-based semi-supervised two-class ensemble classifier for documents was proposed. In [37], a weighting scheme for the tolerance rough sets model based on neural networks was introduced. A lexicon-based document representation (LBDR) with tolerance rough sets was introduced by Virginia et al. [40] and subsequently elaborated in [41]. In [27, 29,30,31], tolerance rough sets were used for the first time in categorizing nouns and relational facts.

A tolerance approximation space \({\mathcal {A}}\)  [33] is denoted by \({\mathcal {A}} = (U, I, \nu )\) where U is the universe of objects, I defines the tolerance class of an object in the universe. In other words, it defines the neighborhood of each object, and \(\nu \) measures the degree of inclusion between two sets. These objects can be documents, nouns, relations or contextual patterns. We now briefly show how these parameters are used with the lower (\({\mathcal {L}}_{{\mathcal {A}}}\)) and upper approximators (\({\mathcal {U}}_{{\mathcal {A}}}\)) in the similarity scoring mechanism for nouns by the tolerant pattern learner (TPL) algorithm.

2.2.1 Tolerant pattern learner (TPL) framework

TPL learns by separating the approximation space for nouns into four zones numbered one through four shown in Fig. 1. Each zone contributes toward a different degree of similarity for an instance \(n_{i}\) to be promoted as trusted. These instances are shown as white dots. Every trusted noun instance \(n_{i}\) of a given category \(\mathrm{cat}\) is associated with three descriptor sets: \(C(n_{i})\), \({\mathcal {U}}_{{\mathcal {A}}}(n_{i})\) and \({\mathcal {L}}_{{\mathcal {A}}}(n_{i})\) where \(C(n_{i})\) is a function that is used to determine all the co-occurring contexts for noun \(n_{i}\). The fourth zone (not named) represents one that does not contribute to the similarity determination and may be considered as a negative zone.

Fig. 1
figure 1

Four zones of recognition for contexts emerging from approximations of \(n_{i}\) [29]

The lower and upper approximators are calculated as follows using the inclusion function \(\nu \) and uncertainty function I which defines the neighborhood using a threshold of \(\theta \):

$$\begin{aligned} {\mathcal {L}}_{{\mathcal {A}}}(n_{i})&= \{c_j \in \mathcal {C} : \nu (I_{\theta }(c_j),C(n_{i})) = 1\}\\ {\mathcal {U}}_{{\mathcal {A}}}(n_{i})&= \{c_j \in \mathcal {C} : \nu (I_{\theta }(c_j),C(n_{i})) > 0\} \end{aligned}$$

A micro-score for the candidate noun phrase \(n_{j}\) against the trusted instance \(n_{i}\) of the category \(\mathrm{cat}\) is calculated as follows:

$$\begin{aligned} \hbox {micro}(n_i,n_j)= & {} \omega (C(n_i),C(n_j)) \alpha \\&+\,\omega ({\mathcal {U}}_{{\mathcal {A}}}(n_i),C(n_j)) \beta + \omega ({\mathcal {L}}_{{\mathcal {A}}}(n_i),C(n_j)) \gamma \end{aligned}$$

where \(\omega \) is an overlap function and the parameters \(\alpha \), \(\beta \), \( \gamma \) are weights that reflect the degrees of similarity contribution of objects belonging to specific zones.

2.3 Fuzzy rough sets method

Combining rough and fuzzy sets as an efficient soft computing strategy for machine learning was explored in [23]. In fuzzy set theory [42], the relational counterpart generates soft similarity classes which permits partial overlap, even when the fuzzy relation is reflexive, symmetrical and \(\mathcal {T}\)-transitive, i.e., a so-called fuzzy \(\mathcal {T}\)-equivalence relation [8]. This is the property that lies at the heart of fuzzy rough set models and lends itself to text categorization applications [5, 10]. Fuzzy rough set method was first proposed as a model for structured text representation (document) and retrieval [34]. A fuzzy rough set is a pair \(\left( A_1,A_2 \right) \in \left( X,R \right) \) where A is a fuzzy set in X such that \(R \downarrow A = A_1\) and \(R \uparrow A = A_2\). R is a fuzzy relation in X [6] where \(\downarrow \) represents the lower approximator and \(\uparrow \) represents the upper approximator. We now briefly describe the fuzzy rough set framework for documents.

In [8], a thesaurus which consists of web pages (as documents) is generated using a query of two terms \(t_1\) and \(t_2\). Let \(D_{t_1}\) and \(D_{t_2}\) be the number of pages that contain \(t_1\) and \(t_2\) terms, respectively. Then, the following measure is used to construct an initial thesaurus:

$$\begin{aligned} \frac{| {D_{t_1}}\cap {D_{t_2}}|}{\min |{D_{t_1}}|,|{D_{t_2}}|} \end{aligned}$$
(1)

This initial thesaurus is then normalized by using an S-function which generates the fuzzified thesaurus. Working with fuzzy sets, it is often assumed that the relation characterizing the approximation space is transitive. Therefore, the t-norm \(\mathcal {T}\) is used to construct a transitive fuzzy thesaurus where the documents with membership value of 0.5 or above are considered.

A document retrieval is then viewed as a query expansion problem where the operators of rough sets (upper and lower) are combined with a graded (fuzzy) thesaurus. The thesaurus is a fuzzy relation, and the query is a fuzzy set. Specifically, a query is expanded by adding all the related terms (upper approximation) and then pruned using the lower approximation in what is termed as a tight upper approximation. The intuition behind this operation was to counter the disadvantage of using transitive fuzzy thesaurus where certain terms are added to the original query with a high degree of similarity that have no relevance to the original query.

3 Proposed fuzzy rough sets framework for relations

One can view a knowledge base as a thesaurus consisting of linguistic entities such as categorical nouns, relations, and their contextual patterns. In this section, we introduce the model that is used by FRL. For the sake of completeness, we have included definitions for nouns introduced in [2].

3.1 Formal model

  • \(\mathcal {N} = \{n_1,n_2,\ldots ,n_M\}\) is the universe of nouns.

  • \(\mathcal {C} = \{c_1,c_2,\ldots ,c_P\}\) is the universe of categories.

  • \({\mathcal {R}} = \{r_1,r_2,\ldots ,r_Q\}\) is the universe of contextual patterns.

  • \({\mathcal {H}} = \{h_{ij} = (n_i,n_j) \in {\mathcal {N}^2}: \exists {r_k} \in {\mathcal {R}} | f_{\mathcal {H}} (h_{ij},r_k)> 0 \}\) is the universe of relations described via the relational co-occurrence function \(f_{\mathcal {H}} (h_{ij},r_k)\) where \(f_{\mathcal {H}} (h_{ij},r_k) = \{k \in \mathbb {N} : h_{ij} \ \text {occurs } k \text { times within the context }r_k\}\).

  • \(\mathcal {TN} = \{n_{1},n_{2},\ldots ,n_{Y}\}\) is a set of trusted nouns such that \(\mathcal {TN} \subset \mathcal {N}\) and index \(Y<M\).

  • \(\mathcal {TR} = \{h_{ab},h_{bc},\ldots ,h_{cd}\}\) is a set of trusted relations such that \(\mathcal {TR} \subset {\mathcal {H}}\) and index \({cd}<{ij}\).

We define the following cross mapping functions which form the basis for the co-occurrence matrix.

  • \(C: \mathcal {N} \rightarrow \mathbb {P}(\mathcal {C})\) denotes a mapping of each noun to its set of co-occurring noun contexts such that \(C(n_i) = \{c_j : f_{C}(n_i,c_j)>0\}\) where \(f_{\mathcal {C}}(n_i,c_j) = \kappa \in \mathbb {N} \) denoting that \(n_i\) occurs \(\kappa \) times within context \(c_j\).

  • \(R: {\mathcal {H}} \rightarrow \mathbb {P}({\mathcal {R}})\) denote mapping of each relation to its set of co-occurring relational contexts: \(R(h_{ij}) = \{r_k : f_{R}(h_{ij},r_k)>0\}\)

Note that the function \(f_{R}(h_{ij},r_k)\) is used in the thesaurus construction process.

3.2 Relation thesaurus construction

Based on the definitions introduced in Sect. 3.1, let \({\mathcal {H}}\) be a set of relations and \({\mathcal {R}}\) be a set of co-occurring contextual relational patterns. Let \(\mathcal {TR}\) be a set of trusted relations such that \(\mathcal {TR} \subset {\mathcal {H}}\). Similar to the procedure given in [2, 8], the first step toward creation of a fuzzy thesaurus is by normalizing the co-occurrence information using \(\vartheta \):

$$\begin{aligned} \vartheta (h_{ij},r_k) = \frac{{f_{R} (h_{ij} ,r_k)}}{{f_{R} (h_{ij},r_k), \forall k:1\ldots Q}} \end{aligned}$$
(2)

The next step involves fuzzifying the co-occurrence function with the S-function where \(\alpha = 0.001\) and \(\beta = 0.02\) are constants determined experimentally.

$$\begin{aligned} S(\vartheta ;\alpha ,\beta ) = {\left\{ \begin{array}{ll} 1 &{} \quad \text {if} \ \vartheta \ \ge \ \beta \\ \frac{\vartheta - \alpha }{\beta - \alpha } &{} \quad \text {if} \ 0.005 \ \le \ \vartheta \ <\ \beta \\ 0, &{} \quad \text {otherwise} \end{array}\right. } \end{aligned}$$
(3)

3.3 Lower and upper approximation of the relation thesaurus

The initial thesaurus I for relations is defined as an approximation space \(I = ({\mathcal {H}},{\mathcal {R}},\hbox {CO}_F)\), where \({\mathcal {H}}\) denotes the universe of relations, \({\mathcal {R}}\) denotes the co-occurring contextual patterns. The fuzzy relation \(\hbox {CO}_F\) is defined as a fuzzy set in \({\mathcal {H}} \times {\mathcal {R}}\). Let \(\mathcal {H_F}, \mathcal {TR_F}\) represent the fuzzy sets of relations and trusted relations, respectively. The upper and lower approximations of the fuzzy set \(\mathcal {H_F}\) in I is denoted by \(\mathcal {H_F}\uparrow \hbox {CO}_F\) and \(\mathcal {H_F} \downarrow \hbox {CO}_F\). The notation for upper and lower approximation operators follows the style from [8, 15, 26]. The upper approximation is defined as:

$$\begin{aligned} \begin{aligned} \mathcal {H_F}\uparrow \hbox {CO}_F = \mathop {\text{ sup }}\limits _{{h_{ij} \in {\mathcal {H}}},\ {h_{xy} \in \mathcal {TR}}} \left( \hbox {CO}_F(R(h_{ij}),h_{xy}),\mathcal {H_F}(h_{xy}): \hbox {CO}_{\mathcal {F}}({h_{ij}}) \ge \hbox {CO}_{\mathcal {F}}({h_{xy}})\right) \end{aligned} \end{aligned}$$
(4)

The upper approximation will make it possible to select candidate relations \(h_{ij}\) having a membership co-occurrence value either equal to or more than that of the trusted relations \(h_{xy}\). The lower approximation is defined as:

$$\begin{aligned} \mathcal {H_F} \downarrow \hbox {CO}_F= & {} \mathop {\text{ inf }}\limits _{h_{ij} \in {\mathcal {H}},\ {h_{xy} \in \mathcal {TR}}} (\hbox {CO}_F(R(h_{ij}),h_{xy}),\mathcal {H_F}(h_{xy}) \nonumber \\&\quad {:}((h_{ij},h_{xy}) | R({h_{xy}}) \cap R({h_{ij}}) \ne \emptyset )) \end{aligned}$$
(5)

Here, the lower approximation will make it possible to select candidate relations \(h_{ij}\) as trusted, when there is at least one common context with a trusted noun \(h_{xy}\). The micro-score for a candidate relation \(h_{ij}\) from \({\mathcal {H}}\) is calculated as:

$$\begin{aligned} \hbox {micro}(h_{ij}) = \omega _1\left( \mathcal {H_F}\uparrow \hbox {CO}_F\right) + \omega _2\left( \mathcal {H_F} \downarrow \hbox {CO}_F\right) \end{aligned}$$
(6)

where \(\omega _1\) and \(\omega _2\) are application dependent and in this experiment are set to 50% and 5%, respectively.

3.4 Tight upper approximation for relations

In [8], the benefit of using tight upper approximation was discussed where a term y, will only be added to a query, if all the terms related to y are related to at least one keyword of the query. A similar problem occurs with learning relations, where candidate relations with high membership degrees are promoted and added to the set of trusted relations, irrespective of whether the candidate relation is related to the trusted relation. To counter this problem, we take the upper approximation of all candidate relations \(h_{ij}\) in \({\mathcal {H}}\) followed by lower approximation for the remaining candidate relations which results in the following tight upper approximation operation:

$$\begin{aligned} \hbox {CO}_F \downarrow \uparrow {{\mathcal {H}}_F} (h_{ij}) = \hbox {CO}_F \downarrow (\hbox {CO}_F \uparrow {{\mathcal {H}}_F} (h_{ij})) \end{aligned}$$
(7)

We then calculate \(\hbox {micro}(h_{ij})\) given in Eq. 6 for each candidate relation. We sort those relations identified as trusted based on their scores, and the top five relations are promoted as trusted for the next iteration.

4 FRL learning framework

The high-level flow for learning with FRL is shown in Fig. 2. The input to FRL is an ontology of trusted relations (seeds), unlabeled relational facts, and co-occurrence information. Steps 2 and 3 in Fig. 2 implement the fuzzy rough model in determining the similarity score for each candidate relation. The process consists of building a fuzzy thesaurus, applying the upper and lower approximator operators (resulting in tight upper approximation). Steps 4 and 5 form the semi-supervised learning component for promoting instances as trusted and growing the knowledge base. Since learning is done for each category at a time, step 6 shows the iterative nature of this process.

Fig. 2
figure 2

Semi-supervised learning using FRL

4.1 FRL algorithm

FRL is an iterative algorithm which is designated to run indefinitely.

figure c

Examples of input to Algorithm 1 are trusted relations \(\mathcal {TR}\) per category such as (e.g., Lionel Messi_FC Barcelona, Michael Jordan_Chicago Bulls), categories (e.g., Athlete_Team) and a large corpus of unlabeled relations \({\mathcal {H}}\). Another input is a large co-occurrence relational matrix \(\mathrm{CO}\) representing relations and their co-occurring contextual patterns. The matrix consisted of approximately 6.5 million relations and 11 million contextual patterns. The output of the FRL algorithm is a set of trusted relations categorized by their respective categories from the ontology. We use a score-based ranking. For the category \(\mathrm{cat}\), after calculating the score for every candidate of \({\mathcal {H}}\), we rank the candidates by their micro-scores normalized by the number of trusted instances of \(\mathrm{cat}\). Finally, we promote the top new candidates as trusted. After every iteration, FRL learns new trusted instances and grows its knowledge base to make decisions in the subsequent iterations.

4.2 Illustration

In this section, we trace the steps of FRL starting with the examples of categories, trusted relations and contextual patterns which are input to the proposed FRL. Note that a relation is a pair of nouns. The initial seeds are identical to the ones used in TPL and CBS algorithms.

FRL Algorithm-Input

Example: Categories

{Athlete_Team, CEO_Company, City_Country, City_State, Coach_Team, Company_City, Stadium_City, State_Capital, State_Country, Team_vs_Team}

Example: Trusted relations

{FC Barcelona\(\Vert \)Lionel Messi, Chelsea\(\Vert \)Didier Drogba, AC Milan\(\Vert \)Ronaldinho, Chelsea\(\Vert \)Frank Lampard, Bastian Schweinsteiger\(\Vert \)Bayern Munich}

Example: Contextual patterns

{attacking midfielder, forward, have handed, player, prodigy}

Algo. Step 2

In this trace, we consider category Athlete_Team. We will be focusing on one trusted relation FC Barcelona \(\Vert \)Lionel Messi, and two candidate relations England\(\Vert \)Steven Gerrard and Tiger Woods\(\Vert \)United States.

Algo. Step 3 [Trusted Relations \(\mathcal {TR}\)]

We start the trace with 5 trusted relations which are a pair of nouns separated by\(\Vert \)given as follows:

  • FC Barcelona\(\Vert \)Lionel Messi, Chelsea\(\Vert \)Didier Drogba, AC Milan\(\Vert \)Ronaldinho, Chelsea\(\Vert \)Frank Lampard, Bastian Schweinsteiger\(\Vert \)Bayern Munich

We start the trace with the following 5 initial contextual patterns for the trusted relations given as follows:

  • attacking midfielder, forward, have handed, player, prodigy

Algo. Step 4

We must first fuzzify the co-occurrence information for the trusted relations. The next three steps (4.1 to 4.3) illustrate how the thesaurus is constructed.

Algo. Step 4.1 [Co-occurrence frequency \(f_{R}\)].

Table 1 shows the frequencies of the trusted relations and their co-occurring contextual patterns.

Table 1 Trusted relations along with their commonly co-occurring contextual patterns

Algo. Step 4.2 [Normalized frequency \(\vartheta \)].

Based on the co-occurrence information given in Table 1, we normalize the table by calculating the normalized frequency \(\vartheta \) for each relation. Table 2 shows the normalized result for each trusted relation.

Table 2 Normalized frequency \(\vartheta \) for each relation

Algo. Step 4-revisited In this step, each iteration starts with only one trusted relation. Hence, we start the trace with FC Barcelona\(\Vert \)Lionel Messi listed in Table 3. This in turns leads to step 4.3.

Table 3 Normalized frequency \(\vartheta \) for FC Barcelona\(\Vert \)Lionel Messi

Algo. Step 4.3 [Calculating membership value S based on Eq. 3].

We use Eq. 3 for all contextual patterns for one trusted relation FC Barcelona\(\Vert \)Lionel Messi given in Table 3. Note that the S-function values are all one since \(\vartheta \ \ge \ \beta \) where \(\beta =0.02\) as shown in Table 4.

Table 4 S-function value for FC Barcelona\(\Vert \)Lionel Messi

Algo. Step 5 The above process is repeated for all relations in \(\mathcal {TR}\) (trusted relations) which results in \(\mathcal {\hbox {CO}_F}\) (fuzzified co-occurrence).

Once we have the membership values for each of the trusted relations, we start the process for each candidate relation. A candidate relation is a new relation that must be classified. We repeat steps Algo. Step 1 to Algo. Step 5 for each candidate relation and calculate their memberships along with the commonly co-occurring contextual patterns.

Algo. Step 6 Calculate Upper Approximation.

It is the set of all the candidate relations having a membership co-occurrence value either equal to or more than that of the trusted relations. The threshold \(\varepsilon _u\) for this dataset was set to 50%. The following members constitute a sample belonging to the upper approximation set. These relations are possible candidates for promotion:

{Steven Gerrard\(\Vert \)England, Tiger Woods\(\Vert \)United States, Ronaldo\(\Vert \)Manchester United, Michael Jordan\(\Vert \)Chicago Bulls, Kobe Bryant\(\Vert \)LA Lakers,...}

Algo. Step 7 Calculate Weight \(\omega _1\).

Here, we give the illustration in terms of two candidate relations (Steven Gerrard\(\Vert \)England and Tiger Woods\(\Vert \)United States) from the upper approximation. Note \(\omega _1\) is determined based on criteria set in Eq. 4. Table 5 shows the S-function values for the candidate relation Steven Gerrard\(\Vert \)England. The contextual patterns listed are determined as common with the trusted patterns.

Table 5 S-function value for Steven Gerrard\(\Vert \)England

The weight \(\omega _1\) for Steven Gerrard\(\Vert \)England is calculated as the sum of the memberships of the above patterns. Thus, \(\omega _1 = 4\). Table 6 gives the S-function values for the candidate relation Tiger Woods\(\Vert \)USA.

Table 6 S-function value for Tiger Woods\(\Vert \)USA

For the second candidate relation Tiger Woods\(\Vert \)USA, \(\omega _1 = 3\).

Algo. Step 9 Calculate lower approximation. Here we apply the lower approximation operator on the set of relations in the upper approximation. First, we find all the contextual patterns for all trusted relations. Nineteen contextual patterns for all trusted relations were found by FRL. We only list 14 of them. The remaining 5 contextual patterns are derived from Table 1. A candidate relation is required to have at least one common contextual pattern with the trusted seed relation. The threshold \(\varepsilon _l\) for this set will be 5% of 19. In other words, a candidate not having at least one common contextual pattern will automatically be eliminated once we apply the lower approximation operator. The following is a list of the 14 contextual patterns:

  • “about the signing of”, “along with”, “are missing”, “as striker”, “ Goal in”, “ace a”, “added to”, “and England footballer”, “and England midfielder”, “ believes”, “have confirmed-midfielder”, “prodigy” , “winger”

Algo. Step 10: Calculate Weight \(\omega _2\) for the two candidate relations shown in two steps. \(\omega _2\) is calculated as the sum of the total number of common contextual patterns between a candidate relation and trusted relations. Note, \(\omega _2\) is determined based on criteria set in Eq. 5.

Algo. Step 10.1: Determining common contextual patterns for candidate relation Steven Gerrard\(\Vert \)England. The following is a list of all contextual patterns. The bold-faced patterns are common.

  • attacking midfielder, forward, have handed, player, prodigy, about the signing of, along with, are missing, as striker, Goal in, ace a, added to, and England footballer, and England midfielder, believes, have confirmed-midfielder, winger

Hence, for candidate relation Steven Gerrard\(\Vert \)England, \(\omega _2 =12\).

Algo. Step 10.2: Determining common contextual patterns for candidate relation Tiger Woods\(\Vert \)USA.

  • attacking midfielder, forward, have handed, player, prodigy, about the signing of, along with, are missing, as striker, Goal in, ace a, added to, and England footballer, and England midfielder, believes, have confirmed-midfielder, winger

Hence, for candidate relation Tiger Woods\(\Vert \)USA, \(\omega _2 = 6 \).

Algo. Step 11 The \( \hbox {micro}_{\mathrm{cat}}(h_{ij})\) is calculated as the sum of the two weights \(\omega _1\) and \(\omega _2\).

  • Steven Gerrard\(\Vert \)England, \(\hbox {micro}_{\mathrm{cat}}(h_{ij}) = 4 + 12 = 16 \)

  • Tiger Woods\(\Vert \)USA, \(\hbox {micro}_{\mathrm{cat}}(h_{ij}) = 3 + 6 = 9 \)

From the above illustration, since Steven Gerrard\(\Vert \)England is ranked much higher, it will be promoted as trusted and added to the trusted set of relations to be used as seed for subsequent iterations.

In summary, the normalization and S-function operations determine membership values for contextual patterns of both trusted and candidate relations. This is the fuzzification process based on fuzzy set theory and gives us the graded co-occurrence matrix. The approximation operations based on rough sets are then applied to the fuzzified co-occurrence information. The upper approximation operator removes all unrelated candidates from the set. The lower approximation operator further prunes this set. The micro-score obtained by weighting these two rough set operations determines the criteria for promotion of a candidate relation to the trusted set.

5 Experiments

Throughout our experiments, we used the same ontology as in TPL [29] and CBS [39] experiments. We used the same 11 categories as the input ontology: Athlete_Team, CEO_Company, City_Country, City_State, Coach_Team, Company_City, Stadium _City, State_Capital, State_Country, Team_vs_Team. We initialized each relational category with 6 seed instances and ran the experiment for 10 iterations. In every iteration, the top 5 new relations in every category were promoted as trusted relations for subsequent iterations.

To facilitate comparison of our FRL algorithm with the other algorithms, we used Precision@N at each iteration. In each iteration, Precision@N is calculated as the ratio of the correct instances to the N-ranked ones. Since the data was not labeled, the correctness of an instance was judged manually.

It took us approximately 76 minutes for each iteration using a Windows 10 machine with 3.40 GHz Intel i7 processor. Table 7 shows the ranking-based Precision@30 results for each category for iterations 1, 5 and 10 for TPL and FRL algorithms. Bold-faced values indicate average values for all categories at the end of iteration 10.

Table 7 Precision@30 of TPL [29] and FRL for ranking-based method

To evaluate promotion-based results, we use the same steps implemented by CPL [4] and TPL [29]. We sampled X pairs from all the promoted pairs and calculated the Precision@30. Table 8 gives the promotion-based results for all three algorithms. In both ranking and promotion-based methods, correctness results were verified manually. Bold-faced values indicate average values for all categories at the end of iteration 10.

Table 8 Precision@30 of TPL [29], CPL [4] and FRL for promotion-based method

5.1 Analysis of results

Based on the results in Table 7, the average precision after 10 iterations for the proposed FRL algorithm is significantly better than TPL using a ranking-based method. For this dataset, one can observe that FRL is able to handle concept drift better than TPL in all categories. It is important to note that FRL enforces mutual exclusion constraint, whereas TPL was able to maintain high precision with no externally defined constraints. From Table 8, it can be seen that FRL performs better than TPL and CPL. It is noteworthy that CPL enforces 3 forms of constraints during the learning process [4]. Here FRL was able to do better than TPL in terms of concept drift only in two categories, Athlete_Team and Team_vs_Team. With CPL, the only result that was available was after the tenth iteration.

5.2 Handling concept drift in FRL

The concept drift problem is illustrated with examples from Table 9. This table is a snapshot of promoted relation instances for all categories after the 10th iteration. Misclassified relations are shown in bold. Notice that relation Israelites_Joshua has been incorrectly promoted to category Athlete_Team. When the dataset is limited, after N iterations, candidates from different categories are misidentified as trusted. The most egregious case in this table is for the category State_Capital City which includes several misclassifications. For example, India_Mumbai was promoted to State_Capital City category when in fact, it should have been promoted to City_Country category. In Table 9, for category City_Country, there are no misclassifications even after 10 iterations. The reason for these scenarios is that, in our dataset, there were sufficient instances for category City_Country and insufficient instances for category State_Capital City. Hence, concept drift is also an outcome of the sample size.

Table 9 Promoted relation instances after 10th iteration for all categories

To overcome this problem, FRL enforces mutual exclusion by calculating a mutex score for newly trusted relation with the top ranked relation for that iteration using Eq. 8.

$$\begin{aligned} \hbox {mutex} = \frac{\omega _{h_{uv}}}{\omega _{h_{ij}}} \times 100 \ \end{aligned}$$
(8)

where \(\omega _{h_{ij}}\) and \(\omega _{h_{uv}}\) are the calculated micro-scores for relations to be parsed and top ranked trusted relations, respectively.

Example 1

The initial seed for the category State_Capital included the following set of relations:

  • Manitoba\(\Vert \)Winnipeg, Alabama\(\Vert \)Montgomery, California \(\Vert \)Sacramento

  • Florida\(\Vert \)Tallahassee, Georgia\(\Vert \)Atlanta, Minnesota \(\Vert \)Saint Paul

Note, that all samples are equally qualified as trusted instances for  category City_State, but are mutually exclusive to State_Capital and hence if promoted in any other category as trusted were ignored.

Table 10 Relations potentially belonging to more than one category

In Table 10, we give some examples of relations that had the potential to be promoted as trusted in two separate categories. However, these relations were ignored as their mutex score was higher for the one category and hence excluded from the others. Categories shown in bold had a higher mutex score; therefore, the associated relation was mutually exclusive to that category.

5.3 FRL for nouns

To complete the discussion of both linguistic entities (nouns and relations), we have included the results from [2]. Here, the dataset includes 68,919 noun phrase instances and 59,325 contextual patterns for nouns. The experimental set up was exactly the same as in TPL [29] and CBS [39] experiments. We used the same 11 categories as the input ontology: Company, Disease, KitchenItem, Person, PhysicsTerm, Plant, Profession, Sociopolitics, Website, Vegetable, Sport.

Table 11 Precision@30 of TPL [29], CBS [39] and FRL [2] for nouns

We initialized each category with 5–6 seed instances and ran the experiment for 10 iterations. In every iteration, the top 5 new noun phrases for every category were promoted as trusted nouns for subsequent iterations. Table 11 shows the result for all categories for Precision@30. One can observe that FRL algorithm performs better than CBS but not as good as TPL in terms of average precision value over all 11 categories and 10 iterations. In the case of nouns, the FRL algorithm does not enforce any mutual exclusion constraints. This was due to lack of richness of the dataset. For example, the category vegetables did not have sufficient seeds, and as a result, all three algorithms fared poorly with this category. Here, parameter values \(\alpha \) and \(\beta \) for S-function were identical to that of the relations. Bold-faced values indicate average values for all categories.

6 Complexity and scalability issues

The complexity of FRL is affected by several individual parameters. In FRL, step 1 has a linear time and space complexity \(O({\mathcal {H}})\), where \({\mathcal {H}}\) is the number of relations. Step 2 has a complexity \(O(\#_{\mathrm{cat}})\) where each \(\#_{\mathrm{cat}}\) represents number for categories. Steps 3, 4 and 8 require \(O(|\mathrm{CO}|)\) to form fuzzy sets where \(\mathrm{CO}\) represents mapping for each relation in \({\mathcal {R}}\) with the co-occurring context, and hence, the complexity is O(|R|) where R is the number of contexts for each relation. Steps 5 to 7 and 9 to 11 are a linear pass, and repeat steps 4 and 8 have a complexity of O(|R|), respectively.

Considering that most of the factors remain consistent through all of the iterations, the most expensive or time consuming part of FRL is to calculate the fuzzy thesaurus as well as the upper and lower approximations. Thus, the time complexity of the algorithm is \(O({{\mathcal {H}}})\) for each candidate relation. For space complexity, we preprocess to filter out low cardinality co-occurring values and hence the complexity is \(O({\mathcal {|H|}} \times {\mathcal {|R|}})\) based on the number of co-occurring nonzero values.

In terms of scalability, FRL was tested on single thread CPU in a non-continuous (unchanged dataset) environment. Most of the tasks such as calculating fuzzy memberships and upper and lower approximations for each category can be processed in parallel with GPU computing. However, for a continuous ever-changing dataset environment, some sort of preprocessing of initial data needs to be done. Also, scalability depends on (i) learning nouns vs. relations (noun pairs), (ii) the richness of the examples, (iii) the number of categories, and (iv) quality of linguistic entities within each categories. The last factor has a direct bearing on the number of constraints that need to be defined to handle concept drift.

7 Conclusion

We have proposed a novel semi-supervised learning algorithm (FRL) based on fuzzy rough sets for labeling relations using contextual pattern information. We introduced a formal framework of the fuzzy rough set for linguistic entities. The proposed FRL algorithm was experimentally compared with a tolerance rough set-based learner (TPL) and the coupled pattern learner (CPL). The choice of methods for comparative study was motivated by the fact that fuzzy rough sets and tolerance rough sets permit overlapping (or soft similarity) of classes. CPL and CBS (for nouns) methods were used as benchmarks for this work, and hence, the same datasets were used in all the experiments. Experimental results demonstrate that for this dataset, FRL (for relations) performs better than TPL based on the ranking method and outperforms CPL and TPL using the promotion-based method. In terms of representation, with the tolerance rough set model, a crisp thesaurus is constructed using a tolerance value \(\varepsilon \). Determination of the optimal value of \(\varepsilon \) is a challenge. With the fuzzy rough set model, the challenge was to determine the optimal parameters \(\alpha \) and \(\beta \) for the S-function to construct a graded thesaurus. For handling concept drift, this work also reveals that it was necessary to define mutual exclusion constraints for FRL. CPL defines several constraints, and TPL requires no constraints. This could be important since constraints can add to the computational overhead as the number of categories, and the size datasets grow in an ever learning environment typically associated with web corpora. As future work, we plan to explore the capabilities of FRL and TPL over more categories and larger datasets for categorizing nouns and relations.