A New Semantic Distance Measure for the VSM-Based Information Retrieval Systems

Al-Zoghby, Aya M.

doi:10.1007/978-3-319-67056-0_12

Aya M. Al-Zoghby⁵

Part of the book series: Studies in Computational Intelligence ((SCI,volume 740))

3425 Accesses
2 Citations

Abstract

One of the main reasons for adopting the Semantic Web technology in search systems is to enhance the performance of the retrieval process. A semantic-based search is characterized by finding the contents that are semantically associated with the concepts of the query rather than those which are exactly matching the query’s keywords. There is a growing interest in searching the Arabic content worldwide due to its importance for culture, religion, and economics. However, the Arabic language; across all of its linguistics levels; is morphologically and syntactically rich. This linguistic nature of Arabic makes the effective search of its content be a challenge. In this study, we propose an Arabic semantic-based search approach that is based on the Vector Space Model (VSM). VSM has proved its success, and many studies have been focused on refining its old-style version. Our proposed approach uses the Universal WordNet (UWN) ontology to build a rich index of concepts, Concept-Space (CS), which replaces the traditional index of terms, Term-Space (TS) and enhances the Semantic VSM capability. As a consequence, we proposed a new incidence indicator to calculate the Significance Level of a Concept (SLC) in the document. The new indicator is used to evaluate the performance of the retrieval process semantically instead of the traditional syntactic retrieval that is based on the traditional incidence indicator; Term Frequency (TF). This new indicator has motivated us to develop a new formula to calculate the Semantic Weight of the Concept (SWC). The SWC is necessary for determining the Semantic Distance (SD) of two vectors. As a proof of concept, a prototype is applied on a full dump of the Arabic Wikipedia. Since documents are indexed by their concepts and, hence, classified semantically, we were able to search Arabic documents efficiently. The experimental results regarding the Precision, Recall, and F-measure presented a noticeable improvement in performance.

Access provided by CONRICYT-eBooks. Download chapter PDF

Conceptual Search for Arabic Web Content

Towards a Hybrid Document Indexing Approach for Arabic Documentary Retrieval System

Semantic similarity based approach for reducing Arabic texts dimensionality

Article 09 June 2015

Keywords

1 Introduction

The ambiguity of the search query’s keywords is one of the main problems that may frustrate the search process efficiency. The use of the terminological variations for the same concept, Synonyms, creates a many-to-one ambiguity. Whereas, the use of the same terminology for different concepts, polysemous, creates a one-to-many ambiguity [1, 2]. The problem becomes more sophisticated with a highly ambiguous language such as Arabic [3, 4]. For example, the optional vowelization in modern Arabic text increases the polysemy of its written words [5, 6].

Traditionally, the search engines are characterized by trading off a high-recall for low-precision. This is caused mainly due to their sensitivity to the query keywords, and the misinterpretation of the synonymous and polysemous terminologies [7]. In other words, not only all relevant pages are retrieved, but also some other irrelevant, which directly affects the Precision. Moreover, the absence of some relevant pages is leading to low Recall. A recommended solution is to use the semantic search, which relies on ontological resources for semantic indexing instead of the lexical indexing that are commonly used by traditional search systems. Thus, the Semantic search aims to resolve the semantic ambiguity by retrieving the pages referring to the semantic interpretation, hence a particular concept, of the search query instead of the pages that are just mentioning its keywords [8, 9].

This research proposes an enhanced semantic VSM-based search approach for Arabic information retrieval applications and the like. In the proposed search approach, we built a concept-space which is used to construct the VSM index. This model enabled us to represent the Arabic documents as semantic vectors, in which the most representative concepts are got the highest weights. This representation allows a semantic classification for the search space. Thus, the semantic retrieval abilities, reflected in its Precision and Recall values, can be obtained. The evaluation of the retrieval effectiveness using the concept-space index resulted in a noticeable improvement in terms of the Precision and the Recall as compared to the traditional syntactic term-space baseline.

The rest of the paper is organized as follows: Sect. 2 describes the main aspects of the proposed model. Section 3 represents the architecture of the proposed model and the implementation aspects. A system is implemented, and the experimental results are discussed in details in Sect. 4. Finally, the paper is concluded at the last section. The list of the algorithms developed to implement the proposed system are listed in the article’s appendix.

2 The Proposed Approach

This research proposed an enhanced semantic VSM-based approach for Arabic information retrieval applications and the like. The VSM is a conventional information retrieval model that has demonstrated its ability to represent documents in a computer interpretable form [10].

In the proposed model, we built a rich VSM index of concepts, concept-space (CS) that is enabling the representation of the documents as semantic vectors, in which the most relevant concept are given the highest weights. This semantic representation allows a semantic classification of the documents. Thus the semantic search facilities can be obtained. The construction of CS is derived from the semantic relationships obtainable form the UWN. UWN provides a corresponding list of meanings and shows how they are semantically associated [11]. Fortunately, the UWN supports the Arabic language and its dialects as well. As a proof of concept, a system is implemented on the Arabic Wikipedia. The evaluation of the system’s semantic retrieval effectiveness is tested in terms of the Precision and Recall. It resulted in noticeable improvements in its performance as compared to the syntactic term-space based systems.

The key contributions of the study, and how it is distinguished from the traditional VSM are highlighted at the next sections.

2.1 A Novel Indexing Approach

In Semantic Web, terms are used to explain concepts^{Footnote 1} and their relations, [8, 9]. Consequently, the concepts sharing some terms in their definition will share many perspectives of their meanings. This can be realized when concepts can be identified and used as a semantic index of the VSM. For performance evaluation purposes, we produced three indices: Term-Space (TS), Semantic Term-Space (STS), and Concept-Space (CS). As specified by Definition 1, each entry of TS considers all inflected forms of the term. For more clarification, see the TS block of Fig. 1.

Each entry of the STS dictionary, on the other hand, is the UWN semantic expansion of the corresponding TS entry. In other words, each term in the TS is represented by its semantic expansions^{Footnote 2}; as specified by Definition 2 ^{Footnote 3} and the STS block of Fig. 1.

However, the generation of STS index has revealed some drawbacks that need to be addressed. It might produce duplicated entries that are directly or indirectly linked, see Fig. 2. This duplication causes a kind of semantic redundancy that should be resolved to improve the retrieval performance.

For example, as presented in Fig. 2, the terms:

‘Beast^{Footnote 4}/وحش’, ‘Savage^{Footnote 5}/’, ‘Brute^{Footnote 6}/ ’,‘Behemoth^{Footnote 7}/ ’, and ‘Demon^{Footnote 8}/ ’

are all semantically expanded entries at the STS index. However, there are some direct and indirect redundancies at these expansions. This redundancy is presented at the existence of terms that share their synonyms with other terms or even with the synonyms of other terms. For example, the synonyms ‘Beast’ of the terms ‘Savage’ and ‘Brute’ with the term ‘Beast’ itself. Moreover, the shared synonym ‘Wildcat’ of both terms ‘Savage’ and ‘Brute’. Also, the synonym ‘Monster’ of ‘Beast’ that is shared by both ‘Behemoth’ and ‘Demon’.

To overcome these deficiencies of the STS, we introduced a novel indexing approach to build a CS index that is capable of improving the search performance. As stated at Definition 3, an entry of the CS dictionary is a concept. In our study, the concept is defined by: a main keyword identifying the concept, all of the terms enclosed by the concept’s meaning, all of these terms morphological derivations, and all of their UWN semantic expansions. See the CS block of Fig. 1.

The generation and the movement from one indexing type to the next advanced type are depicted at the ‘Morphological Indexing’ and ‘Semantic and Conceptual Indexing’ phases at the system architecture presented in Fig. 7.

As VSM has proved its capability to represent documents in a computer interpretable form, we tried to improve its performance by replacing its traditional index TS with the new index CS, see Fig. 3. The CS index enabled us to represent documents as semantic vectors, in which the highest weights are assigned to the most representative concept. Therefore, the vector is accurately directed if its highest weight is assigned to the document’s fundamental concept, Fig. 3. Thus the semantic search facilities, reflected in its Precision and Recall values, can be gained [4].

2.2 The Significance Level of a Concept (SLC)

In the literature, the performance evaluation of the retrieval capability of indices is usually measured using the following measures: Document Frequency (df), Term Frequency (tf), and Weight (w). The bigger the value of the df, tf, or w for an index entry, the more relevant documents are found. It is obvious that following the traditional method of calculating the occurrences of the term or its semantic expansions across the document is neither fair nor efficient. It might cause deceptions since documents are considered as relevant basing on the absolute frequency of terms. Therefore, the calculation of the frequency must be controlled by other factors that consider the relevance degree instead of the absolute frequency. As a matter of fact, the direct increment of df, tf, and w for each occurrence of the term itself, its semantic expansions, or its conceptualization terms, respectively, may suffer from inaccurate results since there are variations in the relevance levels the expansions. This has motived us to introduce a stage of processing that calculates the significance level of the term/concept as a more accurate alternative of the traditional term frequency which positively impact the recognition of relevant documents.

In VSM, the weight of the term t in document d refers to the term’s capability of distinguishing the document. Traditionally, the weight w of t in d is defined in terms of its frequency tf, which is the number of times that t occurs in d. However, when the semantic conceptual model is adopted, the equation that calculates the weight will no longer be accurate, and three factors, which are affecting the calculation of the decisive weight, should be taken into consideration.

Definition 1 Term-Space (TS)

The TS is defined as the set of all distinct terms belonging to the knowledge source^{Footnote 9} as follows:

TS = {T₁, T₂, …, T_i, …, T_k}, where:

T_i is a set of inflected forms of Term #i at the TS, i.e. T_i = {t_i1, t _i2, …, t _ij, …, t _im}

k = # terms in TS

m = # inflection forms of Term i

Definition 2 Semantic Term-Space (STS)

Given TS; the STS is defined as follows:

STS = { ST₁, ST₂, …, ST_i, …, ST_k }, where:

ST_i = Semantic_Expansion (T_i) ∪ T_i ^{Footnote 10}

Semantic_Expansion (T_i) = {S_i, U_i, P_i, H_i, I_i}

Thus,

ST_i = {e_i1,…, e_ix,, …, e_in}^{Footnote 11}

e_{ix =} the inflectional/semantic expansion #x of the term T_i

n = the total count of T_i sematic expansions

Definition 3 Concept-Space (CS)

Given STS, the CS is generated from groups of concepts, each of which is represented by interrelated semantic terms of the STS as follows:

CS = {C₁, C₂,… ,C_q}, where:

C_i = {ST_i1, ST _i2, …, ST _ij, …, ST _ic}^{Footnote 12}

Thus,

C_i = {e₁₁, …, e_1x,, …, e_1n, … , e_c1,…, e_cx,, …, e_cn }

q = # concepts extracted from the STS

C_i = the concept #i, which is defined by a set of semantic terms from STS

c = # semantic terms used to define the concept #i

First, the term needs to be matched against all of its inflected forms that occur in d rather than the input or the normalized form. Therefore, not all matches would have the same weight.

Second, the semantic expansions of terms can be classified into five different types or relations: Synonyms, SubClasses and HasInstances (Specialization expansions), and SuperClasses and InstancesOf (Generalization expansions). It is evident that the Synonyms expansions are semantically closer to the original term than either its generalized or specialized expansions. Moreover, the SubClasses and SuperClasses expansions are worthier than those of HasInstances or InstancesOf, since the former types represent classes that encompass a set of related concepts while the latter types represent instances referring to specifically related elements. This also means that matches with different semantic relations also have different distances that directly affect the weights.

Third, the available semantic expansions of each semantic relation type have different confidences because not all of expansions are relevant to the original term with the same degree or weight. So, an additional confidence factor should be used when calculating the weight.

The factors above impede the term frequency to be used as an accurate incidence indicator for the semantic expansions in the semantic search process. Hence, we introduced a new incidence indicator of the term t, and consequently the concept c that is defined in terms of t, in document d. This indicator is based on the significance level of the term in the document instead of its frequency count. The new measurement, significance level, is computed in terms of the association strength of each expansion of the term. The association strength depends on the distance of the semantic relation and the confidence of the semantic expansion. The distance of the semantic relation is a constant^{Footnote 13} that is heuristically determined by the semantic closeness of the expansion as declared before. The value of the confidence, on the other hand, is directly determined by the UWN specifications.

Figure 4 shows the semantic closeness of the expansion. As depicted in this figure, the set of Synonyms has an exact matching to the original term. Therefore, we decided to set the value of the multiplying coefficient of synonyms to be 1. On the other hand, the Sub/Super classes represent some degree of generalizations such that they are not found by the exact match with the original term. However, they are more closely related than instances since they indicate a general perspective of the meaning. Whereas the instances represent certain items that may share one or more characteristics of the original term’s meaning. Therefore, we decided to set the value of the multiplying coefficient of the sub/super classes to be 0.75 and that of the has-instances/instances-of classes to be 0.5. More illustrative examples can be found in [14]. Formally, let AS(e) denotes the Association Strength of the expansion e. As presented at Eq. (1), it is defined in terms of the confidence of the expansion e, and the distance of its semantic relation type. The new incidence indicator, which denotes the Significance Level of a Term ST at document d, is defined by SLT(ST,d) as presented at Eq. 2. Where n is the number of all expansions e of term ST. ^{Footnote 14} The tf(e_x) factor represents the frequency of expansion e _x instances occurred in document d.

$${\text{AS}}\left( {\text{e}} \right) = {\text{confidence}}\left( {\text{e}} \right)_{{}}\,*\,{\text{distance}}\left( {\text{e}} \right)_{{}}$$

(1)

$${\text{SLT}}\left( {{\text{ST}},{\text{d}}} \right)_{{}} = \mathop \sum \limits_{{{\text{x}} = 1}}^{\text{n}} {\text{AS}}\left( {e_{x} } \right)\,*\,{\text{tf}}\left( {e_{x} } \right)$$

(2)

Let SIDF(ST) denotes the Semantic Inverse Document Frequency of the term ST. It is defined using Eq. 3.^{Footnote 15},^{Footnote 16} Thus, the Semantic Weight of term ST in document d is defined by SW(ST,d) presented at Eq. 4. In terms of concepts, the Significance Level of a Concept (SLC) is defined by Eq. 5. Where c is the count of the semantic terms used to define the concept C.

$$\text{SIDF}\left( {\text{ST}} \right) = \log \frac{{\left| {\text{D}} \right|}}{{\left| {\left\{ {e_{x} \in {\text{d }} |\,e_{x} \in {\text{ST}}} \right\}} \right|}}$$

(3)

$$\text{SW}\left( {\text{ST,d}} \right) = {\text{SLT}}\left( {{\text{ST}},{\text{d}}} \right)\,*\,\text{SIDF}\left( {\text{ST}} \right)$$

(4)

$${\text{SLC}}\left( {{\text{C}},{\text{d}}} \right) = \mathop \sum \limits_{{{\text{i}} = 1}}^{\text{c}} {\text{SLT}}\left( {{\text{STi}},{\text{d}}} \right)$$

(5)

Finally, let SIDF(C) denotes the Semantic Inverse Document Frequency of the concept C as defined by Eq. 6.^{Footnote 17} The Semantic Weight of each concept in the new Concept Space, SW(C,d), is defined by Eq. 7.

$$\text{SIDF}\left( {\text{C}} \right) = \log \frac{{\left| {\text{D}} \right|}}{{\left| {\left\{ {e_{x} \in {\text{d }} |\, e_{x} \in {\text{C}}} \right\}} \right|}}$$

(6)

$$\text{SW}\left( {\text{C,d}} \right) = {\text{SLT}}\left( {{\text{C}},{\text{d}}} \right)\,*\,\text{SIDF}\left( {\text{C}} \right)$$

(7)

2.3 Semantic Distance Between Query and CS

As far as the semantic of concepts are considered, we need to accurately match each expanded word in the input query with each concept in CS, see Fig. 5. As stated before, not all matches of each semantic expansion have the same matching consistency.

For example, the first synonyms of the first query word, Qw₁s₁, may match one of the super-classes of the concept x in CS, C_xp_i. The Qw₁s₁ itself may matches one of the synonyms of another concept y, C_ys_j. Thus, as we justified earlier, the second match is stronger than the first. Therefore, for efficiency reasons, each case has to be handled separately. Otherwise, unexpected results will be produced. For instance, weak matches may take the same weights as other stronger ones. To that end, we generated the formulas (8) and (9) for constructing the entries of this sensitive-matching vector of the query.

Let n denotes the count of expansions that are matched between the query and the concepts in the space, and m_x denotes the frequency of each match. The Significance Level of the concept C in query Q is defined by Eq. 8. Where CAS_x is the Association Strength between the concept c and the match m_x, while the QwAS_x is the Association Strength between the query word of the same mach.

Thus, the Semantic Weight of the concept C in the query Q is defined by Eq. 9. Accordingly, the Semantic Distance between query Q and document d_i is calculated regarding the semantic weights of the concept C_j in both the query Q and the document d_i by Eq. 10. Where q is the count of concepts in the space.

To sum up, we have described how to calculate weights of concepts in documents to construct the CS index and how to compare the concepts in an input query with these concepts to accurately determine relevant documents.

$$\text{SL}\left( {\text{Q},\text{C}} \right) = \mathop \sum \limits_{{{\text{x}} = 1}}^{\text{n}} \text{m}_{\text{x}} \,*\,\text{CAS}_{\text{x}}\,* \,\text{QwAS}_{x}$$

(8)

$$\text{SW}\left( {\text{Q},\text{C}} \right) = \text{SL}\left( {\text{Q},\text{C}} \right){\,*\,\rm{SIDF}}\left( \text{C} \right)$$

(9)

$${\text{SDist}}\left( {{\text{d}}_{\text{i}} , {\text{Q}}} \right) = \frac{{\mathop \sum \nolimits_{{{\text{j}} = 1}}^{\text{q}} {\text{SW}}\left( {{\text{Q}},{\text{C}}} \right)\,* \,{\text{SW}}\left( {{\text{d}}_{\text{i}} ,{\text{C}}} \right) }}{{\sqrt {\mathop \sum \nolimits_{{{\text{j}} = 1}}^{\text{q}} {\text{SW}}\left( {{\text{Q}},{\text{C}}} \right)_{{}}^{2} } \,*\,\sqrt {\mathop \sum \nolimits_{{{\text{j}} = 1}}^{\text{q}} {\text{SW}}\left( {{\text{d}}_{\text{i}} ,{\text{C}}} \right)_{{}}^{2} } }}$$

(10)

3 System Architecture

This section describes the architecture of the proposed system and its components. The overall architecture is portrayed in Fig. 6. Each process is presented in details at the phases of Fig. 7. The generation particulars of the Inflectional, Semantic, and Conceptual vectors are presented more formally by the Algorithms presented at the Appendix.

4 Experimental Analysis

4.1 Experimental Setup

The knowledge source is a set of documents extracted from a full dump of the Arabic Wikipedia.^{Footnote 18} We have conducted five experiments to test the effectiveness of the features proposed by the current research. The description of these experiments is presented in Table 1. The experimental results are divided into three interrelated dimensions: document representation, document retrieval, and document ranking. The objective is to measure the following (Fig. 8):

Table 1 The experiments description and the results summary

Full size table

1-The indexing efficiency: when the conceptual indexing is adopted, the documents are retrieved according to their semantic concepts instead of the lexical terms. This dimension of evaluation aims to measure the efficiency of representing documents according to their central concept(s). As presented before, the weight of an index-entry^{Footnote 19} regarding to a document is referring to its capability of distinguishing this document in the space. Therefore, this evaluation is judged by the Document Frequency (df) and Weight (W) averages. The higher the df and w for an index-entry, the documents that are more relevant will be retrieved, which increasing the relevance accuracy. Besides the relevance accuracy, another dimension is needed for measuring the capability of the retrieval process itself.

2-The retrieval capability: the documents retrieval method does not only retrieve the documents that are syntactically or even semantically relevant, but also documents which are conceptually (ontologically) relevant. This evaluation aims to measure the impact of the indexing method on the performance of the retrieval process. It is judged by: Precision, Recall, and F-Measure.

However, the retrieved documents do not rank the same. The efficiency of the ranking process affects the overall performance of the search system. Besides measuring the accuracy of the retrieval process, another dimension is needed to measure the accuracy of the document ranking.

3-The accurate ranking: this evaluation aims to measure the accuracy of assigning the corresponding weight that exactly represents the association strength of each index entry with a document in the space. Therefore, the incidence indicator factor directly affects the capability of the accurate ranking. This evaluation is semi-automatic. It calculates the average of distance between the ranking order that results from each experiment and the ranking order judged by a human expert. The smaller the distance average, the closer the rank to the standard.

4.2 Experiments, Results, and Evaluation

To demonstrate the variations of the retrieval capability when each indexing method is applied, we measured their performance in terms of the values of Document Frequency (df), Term Frequency (tf), and Weight (w). Table 1 represents the summary of the results presented in the subsequent sections.

4.2.1 The Conceptualization Levels

From the AWDS, a terms-space (TS) of 360486 terms is extracted after excluding the Named Entities. The TS is then semantically expanded using the UWN to construct the STS as defined at Definition 2. The two conceptualization levels, presented at Algorithm 3, are then applied on the STS to generate the CS as presented at Definition 3. As a result, the size of the STS is shrunk by 38% to construct the CS version.

This leads to the increment of the representation power of each item in the space since the average of items weights is increased as shown Table 1. Note that the weight average of V3 and V5 are lower than those of V2 and V4, respectively. We are going to explain these results with the discussion of ranking accuracy. However, it is noticeable that the weights of the STS index are higher than those of the inflectional indexing TS. Moreover, the weights of the CS are greater than those of the STS. The important observation is that the obtained results are demonstrating the efficiency of the CS in distinguishing documents according to weights of the corresponding concepts.

4.2.2 The Retrieval Capability

The F-Measure is a very common measurement for calculating the performance of the retrieval systems. It is based on the harmonic mean of the Recall and the Precision scores, which we used to evaluate the retrieval accuracy of the proposed system. The F-Measure is defined by Eq. 11.

$${\text{F}} - {\text{Measure}} = 2 \,*\, \frac{\text{Precision * {Recall}}}{{{\text{Precision}} + {\text{Recall}}}}$$

(11)

4.2.3 The Ranking Accuracy

The ranking accuracy of the proposed system is evaluated by measuring the correctness of assigning the weight that precisely characterizes the Association Strength of each index with documents in the space. This is representing how the Incidence Indicator factor affects the ability of the accurate ranking.

The values of the weights averages of the experiments V2 and V4 are greater than those of V3 and V5 respectively. The ranking results show that the V3 experiment is better than V2, while V5 is the best. This is due to the extra error ratios caused by the use of the Sf and CF indicators instead of the SLT and SLC. However, these extra ratios are reduced as a result of using the SLT and SLC indicators at the V3 and V5 experiments, which is directly reflected on the ranking efficiency of these experiments. Still, the experiment V4 gives a better ranking order than V3, since it is based on the conceptual indexing CS, although it suffers from the extra error ratio caused by the CF indicator.

The ranking accuracy is calculated using the Distance Average (DA), Eq. 12, between the experimental order and the standard order delivered by a human specialist.

$$\text{DA}\left( \text{V} \right) = \frac{{\mathop \sum \nolimits_{{{\text{i}} = 1 }}^{\text{n}} \left| { \frac{1}{{{\text{SRank}}_{\text{i}} }} - \frac{1}{{{\text{ERank}}_{\text{i}} }}} \right| }}{\text{n}}$$

(12)

Where n is the count of the retrieved documents as a result of the user’s query, while the SRank_i is the standard rank of document i, and the ERank_i is the rank is the experimental rank of document i at experiment V.

The closest ranking order is obtained at the experiment V5, which assures its parameters’ capability to rank the retrieved result more accurately.

5 Conclusion and Future Work

This study sheds light on the inaptitude in searching the Arabic Language semantically, which may be attributed to the sophistication of the Arabic language itself. However, this should not stop more effective efforts for achieving the best possible solutions that enable the Arabic Language users getting the benefit from the new electronic technologies.

In an attempt to take a step in that long pathway, we proposed an Arabic semantic search system that based on the Vector Space Model. The VSM is one of the most common information retrieval models for textual documents due to its ability to represent documents in a computer interpretable form. However, as it is syntactically indexed, its sensitivity to keywords reduces its retrieval efficiency. To improve its effectiveness, the proposed system is extracting a concept-space index, using the universal wordnet UWN, to be used as a semantic index of VSM search system. The proposed system enables a conceptual representation of the document space, which in turn permits the semantic classification of them and thus obtaining the semantic search benefits. Moreover, we introduced a new incidence indicator to calculate the significance level of the concept in a document instead of the traditional term frequency. Furthermore, we introduced a new formula for calculating the semantic weight of the concept to be used in determining the semantic distance between two vectors. The system’s experimental results showed an enhancement of the F-measure value using the conceptual indexing over that is based on the standard syntactic baseline.

As a future work, we have to solve some problems such as the ambiguity by discriminating the meaning contextually. Also, we may work on refining the processing of the multiword expression expansions. That will improve the results noticeably since. Moreover, the improvement of the Arabic knowledge representation in the UWN will help to overcome its limitations that directly affects the search results. Another open research area is to solve the problems of the Arabic language morphological analysis to prevent the consequent errors occurred in the indexing process, and hence, the construction of the search dictionary. We also may try to use Google Translation API with the UWN to find results for these terms that have results in languages other than Arabic.

Notes

1.
In this paper, whenever the word term is used; it refers to a single word. In the VSM, this term is an entry of the TS. Likewise, whenever the word concept is used; it refers to a single concept that is defined in terms of the set of related terms, and represented by a single entry in the CS.
2.
More precisely, the set of related inflected forms of its semantic expansions.
3.
More details about producing STS index are can be found in [12], and [13].
4.
http://www.lexvo.org/uwn/entity/eng/beast.
5.
http://www.lexvo.org/uwn/entity/eng/savage.
6.
http://www.lexvo.org/uwn/entity/s/n9845589.
7.
http://www.Lexvo.Org/uwn/entity/s/n10128909.
8.
http://www.lexvo.org/uwn/entity/eng/demon.
9.
The used knowledge source is the AWDS, which stands for Arabic Wikipedia Documents Space.
10.
See Ti at Definition 1.
11.
Extracted from the UWN.
Si = {s1,…,sa}, //set of Synonyms
Ui = {u1,…,ub}, //set of Sub-Classes
Pi = {p1,…,pc}, //set of Super-Classes
Hi = {h1,…,hd}, //set of Has-Instances
Ii = {i1,…,ie}, //set of Instances-Of
All of these expansions can be accumulated in the set of all expansions on the term Ti: {ei1, …..ein}.
Each expansion of s, u, p, h, & i, is represented as a pair of the expansion-word and the expansion-confidence as (word, conf.).
12.
Therefore, each concept C is defined in terms of all expansion sets of each ingredient STs. I.e., the accumulation of the subsets {ei1, …, eix,, …, ein}.
13.
Multiplying coefficient along the distance scale.
14.
See Definition 2.
15.
D is the entire documents-space.
16.
See ST at Definition 2.
17.
See C at Definition 3.
18.
The Arabic Wikipedia dump is accessed on 29-Aug-2012 from http://dumps.wikimedia.org/arwiki/.
19.
A term or hence a concept.
20.
Getting the text term in D without redundancy.
21.
Search for any inflectional form of t in the document d _i using RDI Swift Searcher.
22.
Search for any inflectional occurrences for the semantic expansion e_x of the term ST in document d_j using RDI Swift Searcher.
23.
S is the set of all Synonyms of all terms in STS.
24.
Search for any inflectional or Synonyms occurrences for the term ST_i in the set s using RDI Swift Searcher.

References

Shaalan, K.: A Survey of Arabic named entity recognition and classification (June 2014)
Google Scholar
Saleh, L.M.B , Al-Khalifa, H.S.: AraTation: an Arabic semantic annotation tool (2009)
Google Scholar
Tazit, N.: El Houssine Bouyakhf, Souad Sabri, Abdellah Yousfi. Semantic internet search engine with focus on Arabic language, Karim Bouzouba (2007)
Google Scholar
Cardoso, J.: Semantic Web Services: Theory, Tools, and Applications. IGI Global, Mar 30 2007
Google Scholar
Hepp, M., De Leenheer, P., de Moor, A.: Ontology Management: Semantic Web, Semantic Web Services, and Business Applications. Springer, New York; [London] (2008)
Google Scholar
Kashyap, V., Bussler, C., Moran, M.: The Semantic Web: Semantics for Data and Services on the Web (Data-Centric Systems and Applications). Springer, 15 Aug 2008
Google Scholar
Panigrahi, S., Biswas, S.: Next generation semantic web and its application, March 2011
Google Scholar
Unni, M., Baskaran, K.: Overview of approaches to semantic web search, July–December 2011
Google Scholar
Renteria-Agualimpia, W., López-Pellicer, F.J., Muro-Medrano, P.R., Nogueras-Iso, J., Zarazaga-Soria1, F.J.: Exploring the Advances in Semantic Search (2010)
Google Scholar
Kassim, J.M., Rahmany, M.: Introduction to Semantic Search Engine. Selangor (2009)
Google Scholar
Hahash, N.: Introduction to Arabic natural language processing. Association for Computational Linguistics, 30 August 2010
Google Scholar
Jarrar, M.: Building a Formal Arabic Ontology (Invited Paper), Alecso, Arab League. Tunis, 26–28 April 2011
Google Scholar
Al-Zoghby, A.M., Shaalan, K.: Conceptual search for Arabic web content. In: Computational Linguistics and Intelligent Text Processing—16th International Conference, vol. 9042, pp. 405–416, 14–20 April 2015
Google Scholar
Al-Zoghby, A., Ahmed, A., Hamza, T.: Arabic semantic web applications: a survey. J. Emerg. Technol. Web Intell. 52–69 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computers and Information Systems, Mansoura University, Mansoura, Egypt
Aya M. Al-Zoghby

Authors

Aya M. Al-Zoghby
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aya M. Al-Zoghby .

Editor information

Editors and Affiliations

The British University in Dubai, Dubai, United Arab Emirates
Khaled Shaalan
Faculty of Computers and Information Technology, Cairo University, Giza, Egypt
Aboul Ella Hassanien
Faculty of Computers and Information, Ain Shams University, Cairo, Egypt
Fahmy Tolba

Appendix: The Implementation Algorithms

^{Footnote 20};

^{Footnote 21};

^{Footnote 22};

)^{Footnote 23};

^{Footnote 24};

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Al-Zoghby, A.M. (2018). A New Semantic Distance Measure for the VSM-Based Information Retrieval Systems. In: Shaalan, K., Hassanien, A., Tolba, F. (eds) Intelligent Natural Language Processing: Trends and Applications. Studies in Computational Intelligence, vol 740. Springer, Cham. https://doi.org/10.1007/978-3-319-67056-0_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-67056-0_12
Published: 18 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67055-3
Online ISBN: 978-3-319-67056-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

A New Semantic Distance Measure for the VSM-Based Information Retrieval Systems

Abstract