Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The goal of this research is to decide which knowledge extraction method (for semantic relations) is the more accurate one for a given database. In this case, the chosen was Reuters-21578, a text categorization and test collection database (Lewis 1997). This collection was widely used in natural language process research projects; more specifically in text classification works (Joachims 1998). As each newswire has a quite short text and being Reuters-21578 a well known database, a subset of it has been chosen for this research. The selected extraction methods were those that, according with the state of the art research made in (Rodriguez et al. 2015), proved to be among the top three in terms of quantity and quality of the extracted knowledge pieces.

Knowledge extraction is any technique which allows the analysis of unstructured sources of information, for instance: text in natural language, using an automated process to extract the embedded knowledge in order to show it in a structured form, capable of being manipulated for an automated reasoning process, for instance: a production rule or a sub graph in a semantic network. The output information for this kind of process is called: piece of knowledge (Rancan et al. 2007). If knowledge extraction is presented as an algebraic transformation, the formula could be formulated as follows:

$$ {\text{knowledge}}{\_}{\text{extraction}}\left( {{\text{information}}{\_}{\text{structures}}} \right){\text{ = piece}}{\_}{\text{of}}{\_}{\text{knowledge}} . $$
(1)

Since (Banko et al. 2007) presented a method of knowledge extraction for the Web, many other knowledge extraction methods for the Web have been presented. The paradigm that encompasses this type of self-supervised methods is called Open Information Extraction.

Open Information Extraction (OIE) is a paradigm that facilitates domain independent discovery of relations extracted from text and readily scales to the diversity and size of the Web corpus. The sole input to an OIE system is a corpus, and its output is a set of extracted relations. An OIE system makes a single pass over its corpus guaranteeing scalability with the size of the corpus (Banko et al. 2007).

2 Related Work

A state of the art research was made (Rodríguez et al. 2015) over a set of eight relevant semantic relation extraction methods; methods which work according Open Information Extraction paradigm. In our previous work the quality of each method’s output has been compared, trying to understand which one performs a better extraction than other. The analyzed methods were: KnowItAll (Etzioni et al. 2005), TEXTRUNNER (Banko et al. 2007), WOE (Wu and Weld 2010), SRL-Lund (Christensen et al. 2011), ReVerb (Fader et al. 2011), OLLIE (Schmitz et al. 2012), ClausIE (Del Corro and Gemulla 2013), ReNoun (Yahya et al. 2014) y TRIPLEX (Mirrezaei et al. 2015). Our comparison work is summarized in Table 1, which is a double entry table, where each cell must be understood as a comparison made between two methods. The method indicated in the column against the method indicated in the row. The intersection cell shows the method that has achieved a higher quality and quantity of extracted pieces of knowledge (approximated), regardless of the measure used in the article. References from where comparison was taken, are also given.

Table 1. Summary of comparisons between methods

We can draw the following preliminary conclusions:

  1. [i]

    The best studied method, in terms of quantity and quality of knowledge pieces extracted, is ClausIE.

  2. [ii]

    Since TRIPLEX in combination with OLLIE is only slightly better than OLLIE alone, we would expect that ClausIE exceeds TRIPLEX+OLLIE in precision.

  3. [iii]

    If we consider again quantity and quality of knowledge pieces extracted, after ClausIE, the next methods are: OLLIE, ReVerb and WOE, in that order.

3 Experiment

The goal of this experiment is to obtain a reliable estimation about which of these three methods: ReVerb, OLLIE o ClausIE (the top three methods according to our state of the art research), has the better precision, the better recall and the better F-measure for a given database. The precision, recall, and F-measure will be calculated using the following formulas:

$$ \text{precision} \, = \frac{{\text{amount}\,\text{of}\,\text{relevant}\,\text{extracted}\,\text{knowledge}\,\text{pieces}}}{{\text{amount}\,\text{of}\,\text{extracted}\,\text{knowledge}\,\text{pieces}}} $$
(2)
$$ \text{recall} \, = \frac{{\text{amount}\,\text{of}\,\text{relevant}\,\text{extracted}\,\text{knowledge}\,\text{pieces}}}{{\left( {\text{amount}\,\text{of}\,\text{handmade}\,\text{relation}\,\text{extraction}\,\text{ + }\,\text{new}\,\text{extracted}\,\text{pieces}} \right)}} $$
(3)
$$ {\text{F}}_{\beta } = \frac{{\left( {1 + \beta^{2} } \right) \cdot {\text{precision}} \cdot {\text{recall}} }}{{ (\beta^{2} \cdot {\text{precision)}}\,{ + }\,{\text{recall}}}} $$
(4)

The “new extracted pieces” in formula (3), are the relevant extracted knowledge pieces which are not in the handmade set. In formula (4), the selected value for β is 1, for simplicity the F-measure will be called F1-measure or just F1.

To calculate the confidence level and the associated margin of error for a given number of samples, the following formula (see Hamburg 1979) for sample size determination will be used:

$$ {\text{n = }}\frac{{{\text{N}} \cdot {\text{Z}}^{ 2} \cdot {\text{p}} \cdot \left( { 1- {\text{p}}} \right)}}{{\left( {{\text{N}} - 1} \right) \cdot {\text{e}}^{ 2} {\text{ + Z}}^{ 2} \cdot {\text{p}} \cdot \left( { 1- {\text{p}}} \right)}} $$
(5)

Where:

N::

total number newswire articles in Reuters-21578 (21578)

Z::

is the deviation from the mean accepted to achieve the desired level of confidence

p::

is the ratio we hope to find (for an unknown sample 50 % is usually taken)

e::

is the maximum permissible margin error

The research goal is to get the highest confidence level with a maximum margin error of 10 %. According to the formula (5), the current confidence level will be 86 %, to know which of the three evaluated method would be the preferred one to extract semantic relations of the Reuters-21578 database, with the established error margin.

3.1 Manual Extraction

The first part of our experiment was to develop a semantic relation extraction manually, for each selected newswire of the selected subset. During this part of the experiment we were helped with several senior students of Information Engineering Bachelor Degree level. The semantic relation extraction procedure was explained to them. Finally, the authors made a revision of the students extraction work. To show an example of these handmade extractions, let’s see the newswire with id 44:

“…McLean Industries Inc’s United States Lines Inc subsidiary said it has agreed in principle to transfer its South American service by arranging for the transfer of certain charters and assets to Crowley Mariotime Corp’s American Transport Lines Inc subsidiary. U.S. Lines said negotiations on the contract are expected to be completed within the next week. Terms and conditions of the contract would be subject to approval of various regulatory bodies, including the U.S. Bankruptcy Court…”

The following semantic relations were obtained manually:

  • (McLean Industries Inc; is subsidiary of; United States Lines Inc.)

  • (McLean Industries Inc; said; it has agreed in principle to transfer its South American service by arranging for the transfer of certain charters and assets to Crowley Mariotime Corp’s American Transport Lines Inc. subsidiary)

  • (McLean Industries Inc; has agreed to transfer; its South American service by arranging for the transfer of certain charters and assets to Crowley Mariotime Corp’s American Transport Lines Inc. subsidiary)

  • (U.S. Lines; said; negotiations on the contract are expected to be completed within the next week)

  • (negotiations on the contract; are expected to be completed; within the next week)

  • (Terms and conditions of the contract; would be; subject to approval of various regulatory bodies)

3.2 Verification

The next step was to run the methods over the same 55 Reuter’s articles and made a validation by hand for each automatic extraction. A category of three values was used: right, invalid and more-or-less-right. This last value was used for extractions in a limit, when was difficult to see if the extraction was right or not. An extraction marked as more-or-less-right was not taken into consideration for obtaining the precision and recall, in this way a penalization for do a more-or-less-right extraction was avoided, or a double penalization if we think in the F1-measure. This value (more-or-less-right) was also used to avoid compute twice two right extractions very similar each other, extractions where the only difference was in the second entity scope (typically in ClausIE). For the automatic extraction made by ClausIE over newswire with Id 44, the following two are of our interest:

  • (it; has agreed; to transfer its South American service)

  • (it; has agreed; in principle to transfer its South American service)

Both extractions were correct, and both made reference to the same sentence. In this particularly case the first was marked as right and the second was marked as more-or-less-right. A second consideration we had made, before mark an automatic extraction as right, was to identify if there was a manual extraction to match with the automatic one, in other words we check that both extractions that made reference to the same sentence and to the same relation, regardless of minor details. Continuing with the same example, the following manual extraction:

(McLean Industries Inc; has agreed to transfer; its South American service by arranging for the transfer of certain charters and assets to Crowley Mariotime Corp’s American Transport Lines Inc subsidiary)

was considered equivalent to the following automatic extraction:

(it; has agreed; to transfer its South American service)

Even though, in this case, there were differences within the two entities and differences with the relation too; the manual version and the automatic version were considered semantically equals. Then, for cases where a valid automatic extraction was identified, but there was not matching with any handmade extraction, it was marked as “new”. So at the moment of calculate the recall, the amount of valid relations was computed as all the handmade extractions plus all the automatic extraction marked as “new” (for a given method).

4 Results

The Table 2 shows (for each method) a summary of the total amount of automatic extractions made, the right ones and the total of the valid semantic relations, calculated in the way just described. The precision, recall and F1-measure were calculated using values in Table 2, with the formulas (2), (3) and (4). The obtained results are shown in Table 3.

Table 2. Summary of relations and extractions for each method.
Table 3. Precision, Recall and F1-measure for each method.

5 Conclusions and Future Research

According with results summarized in the Table 1, ClausIE should have obtained the higher precision, followed by OLLIE and then by ReVerb but the obtained results contradict these assumptions. What we see is that ReVerb is the method with a higher precision, followed by ClausIE and finally by OLLIE. But if we see the obtained recall, this value is consistent with the expected results. ClausIE has a better recall than OLLIE, and OLLIE a better recall than ReVerb. ReVerb extracts less semantic relations than ClausIE or OLLIE (see in Table 2, 301 against 638 and 545), but the valid extraction percentage is bigger. To conclude, the F1-measure shows that ClausIE has a better F1-measure than ReVerb and OLLIE. The ReVerb F1-measure is a little better than OLLIE but they have almost the same value, the difference is only 0.013.

The next step in our research is to increase the evaluated newswires to 96 in order to get a confidence level of 95 % to establish which one of the three methods is the best to extract the semantic relations in Reuters-21578 database.