Keywords

1 Introduction

Nowadays, the explosive growth of data published on the web in all different fields such as e-learning, social networks, e-commerce among many others is not slowing down soon according to recent studies [1, 2]. The expanding data universe makes it difficult to get benefit from the web content. Furthermore, predicting user responses to options for recommendation purpose becomes an enormous challenge for an extensive class of web applications. Recommending a resource is usually achieved through information filtering. There exist two major approaches to information filtering [1]: Collaborative filtering and Content-based filtering. A Collaborative Filtering (CF) system chooses items based on the correlation between people with similar preferences, while a content-Based Filtering system (CBF) selects items based on the correlation between the content of the items and the user preferences.

Despite the demonstrated effectiveness of CBF technology in many cases, some drawbacks make it inappropriate in its current form for other cases. Indeed, CBF requires analyzing the content of a document which is computationally expensive and even impossible to perform on multimedia items which do not contain descriptive text [3]. Furthermore, CBF presents difficulties to handle the new user problem where no preference is available. At the beginning, a new user does not have any preference value. Therefore, it is very hard to issue any recommendation to him.

In this paper, we propose solving these issues by enhancing CBF systems using semantic derived from the Web of Data. In this latter, the World Wide Web is viewed as a global database by creating links between data which known as Linked Data. When these linked data enable describing people, they are called FOAF (Friend of A Friend). The proposed approach is based on Vector Space Modeling of CBF [4], and enhanced by a semantic level extracted from the web of data leading to a new model that we refer to as Semantic Vector Space Model (SVSM).

Following this introduction, CBF based on Vector Space Model is described in Sect. 2. Section 3 presents key features of the web of data. In Sect. 4, a review of some related works that propose recommender systems in web of data context is given. In Sect. 5, we describe the proposed approach SCBF and we report on the conducted experimental study and obtained results. Finally, conclusion and future work are given.

2 Content Based Filtering (CBF)

Information filtering deals with the delivery of information that would be interesting and useful to a user given his profile and preferences. An information filtering system assists users by filtering the data source and deliver relevant information to them. When the delivered information comes in the form of suggestions such information filtering system is called a recommender system. A CBF technique, also referred to as cognitive filtering [1], recommends items based on a correlation between the content of the items and a user profile. The content of each item is represented as a set of descriptors or terms, classically the words that occur in a document. The user profile is represented by a set of terms built up by analyzing the content of items seen by the user. Typically, a content based filtering system selects relevant items based on the correlation between the content of the items and the user’s preferences.

One of the most important approaches is Vector Space Model (VSM) or term vector model [5]. In the vector space model, a document D (item) is represented as an m-dimensional vector, where each dimension corresponds to a distinct term [6]. The term frequency (tf) is a numerical statistic that measures the importance a term would have with regard to a document in a collection or corpus:

$$ {\text{tf}}_{\text{vi}} = \frac{{{\text{n}}_{\text{vi}} }}{\text{N}} $$
(1)

Where, \( n_{vi} \) is the number of times term \( t_{i} \) appears in a vector v; it models the taste of user and N is the total number of terms in the vector v.

To measure the extent to which documents contain a given term \( t_{i} \) we need to calculate the inverse document frequency (idf).

$$ {\text{idf}}_{\text{i}} = { \log }\left( {\frac{\text{D}}{{{\text{n}}_{\text{j}} }}} \right) $$
(2)

Where, \( D \) is the total number of documents, \( n_{j} \) is the number of documents \( d_{j} \varvec{ } \) containing term \( t_{i} \).

From tf and idf we can calculate the weight (W) or tfidf. This latter is a concept that can be used to create a profile of an item for example a document or an object… etc.

$$ {\text{W}}_{\text{i}} = {\text{tf}}_{\text{vi}} \, *\,{\text{idf}}_{\text{i}} $$
(3)

A content-based filtering system selects relevant items based on the correlation between the content of the items and the user’s preferences [3]. However this technique suffers too from some disadvantages such as: it requires analyzing the content of the document which is expensive and even impossible to perform on multimedia [7] and the problem of new user or no preferences problem. At the beginning, a new user does not have any preference values; this makes it impossible to give him any recommendation. To address these problems, we propose to enhance CBF using the Web of Data.

3 Web of Data

Typically, a data set published in the web contains knowledge about a particular domain, like books, music, encyclopedic data and companies to name just few. If these data sets were interconnected i.e. linked to each other, this makes the World Wide Web a global database termed by Tim Berners Lee as Web of Data.

The most important concepts related to the web of data are: Linked Open Data (LOD), Friend of A Friend (FOAF) vocabulary, and Resource Description Frame work (RDF).

3.1 Linked Open Data

The term Linked Open Data refers to a set of best practices for publishing and connecting structured data on the web using international standards of the World Wide Web Consortium.Footnote 1 LOD cloud is considered as a network or collection of data silos.

The diagram of Fig. 1 is maintained by Richard Cyganiak, and Anja Jentzsch (http://lod-cloud.net/).

Fig. 1.
figure 1

Liked data cloud diagram [8].

The core of this diagram is DBpediaFootnote 2 which is a community effort to extract structured information from Wikipedia and to make this information available on the Web.

3.2 FOAF Vocabulary

The FOAF project began as an “experimental linked information project.” Dan Brickley and Libby Miller are responsible for its inception, and EddDumbill and Leigh Dodds (http://www.foaf-project.org) notably contributed to its success. FOAF enables to describe people, their interests, their achievements, their activities, and their relationship with other people [8]. In Table 1 below all FOAF classes and proprieties are presented.

Table 1. Classes and proprieties of FOAF vocabulary.

3.3 Resource Description Framework (RDF)

Resource Description Framework or in short RDF provides a common data model for Linked Data [8] and is particularly suited for representing data on the Web. Linked Data uses RDF as its data model and represents it in one of several syntaxes. There is also a standard query language called SPARQL. A single RDF statement describes two things and a relationship between them. Technically, this is called an Entity-Attribute-Value (EAV) data model.

4 Related Work

Few recommender systems based on web of data have been developed till date. The following Table 2 reviews some recent approaches. It provides a short description of the methods and indicates the web of data concepts used.

Table 2. Recent recommender systems enhanced by web of data.

From Table 2, we can observe that most proposed approaches are dedicated to a specific domain example movies or music and use either FOAF vocabulary or linked data cloud. Almost half of these methods are based on Collaborative filtering.

Our work is motivated by the fact that combination of FOAF vocabulary and linked data cloud would have the potential to further improve the ability of CBF to achieve suitable recommendations. Using the FOAF vocabulary helps in solving the problem of new user and the extracted linked data from the cloud provide a semantic description of non-structured items.

5 Proposed Semantic Content Based Filtering (SCBF)

CBF selects items based on the correlation between the content of the items and the user’s preferences. As aforementioned, the problem with CBF is that it requires analyzing the content of the items which is expensive or impossible with multimedia items. To solve this issue along with the new user problem, we describe in this section how the Web of Data technologies could be used to enhance CBF systems. We refer to the proposed web of data based variant of CBF as Semantic Content Based Filtering (SCBF). In SCBF, we suggest integration of the following technologies:

  • FOAF Vocabulary: if new user is connected, his FOAF description will be compared with the other users’ FOAF descriptions. The comparison is based on the proposed formula:

    $$ {\text{D}}_{\text{FOAF}} \left( {{\text{u}},{\text{v}}} \right) = 1 + { \log }\left( {\frac{{1 + {\text{K}}}}{\text{P}}} \right) $$
    (4)

    Where, \( D_{FOAF} \left( {u,v} \right) \) is the FOAF distance between users u and v, \( K = L + S \) with S is number of the similar FOAF proprieties between users u and v, L is number of links between u and v and P stands for the total number of FOAF proprieties describing target user u.

    Following some important properties for the class person [8]:

    • Based near - A location that something is based near, for some broadly human notion of near (The based near relationship relates two “spatial things”).

    • Age - The age in years of some person.

    • Gender - The gender of this person (typically but not necessarily ‘male’ or ‘female’).

    • Title - Title (Mr, Mrs, Ms, Dr. etc.).

    • Knows - A person known by this person (indicating some level of reciprocated interaction between the parties).

    • dMaker - An agent that made this thing.

    • Member - Indicates a member of a Group.

    • Interest - A page about a topic of interest to this person.

    • Topic_interest - A thing of interest to this person.

  • Linked Data Cloud

    The vector space model is a representation often used for text items In this model, an item i is represented as an m-dimensional vector, where each dimension corresponds to a distinct term. However, this technique is too limited with unstructured and even with semi-structured items.

    To fix this problem, we propose in SFBC to enhance the m dimensional vector by n other textual or semantic attributes extracted from the linked data cloud. Therefore, the representation of the item will include (m + n) attributes and expressed of a (m + n)- dimensional vector that we refer to as Semantic Vector Space Model (SVSM). The example below brings more explanation about the proposed SVSM.

In the dataset MovielensFootnote 3, the movie “No escape” is represented by the following textual attributes:

Id

Title

Realise date

Genre

1416

No escape

1994-01-01

Action, science fiction

On the same movie and using DBpedia, we can extract other information such as those given in the following Table 3.

Table 3. Textual and semantic attributes describing the movie “no escape”.

For that we propose a new version of the tf denoted by \( \widetilde{tf} \) defined as follows:

$$ \widetilde{\text{tf}} \left( {{\text{v}},{\text{i}}} \right) = \frac{{{\text{NS}}_{\text{vi}} }}{\text{T}} $$
(5)

Where, \( {\text{NS}}_{\text{vi}} \) is the number of times triplet \( t_{i } \) appears in the semantic segment of the vector v and T is the total number of triplets in the semantic segment of the vector v.

$$ \widetilde{{{\text{idf}}_{\text{i}} }} = { \log }\left( {\frac{\text{Tt}}{{{\text{n}}_{\text{j}} }}} \right) $$
(6)

Where, \( Tt \) is the total number of triplet and \( n_{j} \) is the number of documents \( d_{j} \) where triplet \( t_{i} \in d_{j} \).

Therefore, the semantic weight is given by:

$$ \widetilde{{W_{i} }} = \widetilde{{{\text{tf}}_{{{\text{v}},{\text{i}}}} }} *\widetilde{\text{idf}}_{\text{i}} $$
(7)

And the global weight Wg for the item is defined as follows:

$$ {\text{Wg}}_{\text{i}} = {\text{Wi *}}\widetilde{W}_{i} $$
(8)

Based on the above description, the proposed SCBF approach suggests the following architecture of recommender systems shown on Fig. 2.

Fig. 2.
figure 2

General architecture of SCBF.

In the case of new user (the feedback is empty), his DFOAF is calculated using other users, just after we recommend the set of items liked by the user who has the maximum DFOAF.

The Space of attributes that describe the items is enhanced by semantic and textual attributes extracted from linked data cloud, which gives further descriptions of the items.

The proposed SCBF engine can be outlined by the following algorithm.

figure a

6 Experiments

In the dataset MovielensFootnote 4, all movies are characterized by the following attributes: Id, Title, Realize date, and Genre, using following SPARQL query based on the federation (released by FedX) [17], between DBpedia and Linked Movie DataBase (LDMDBFootnote 5). We can extract more information about these movies like: Director, Country, Actor, and Abstract. The common attribute between Movielens and the federated query is the movies titles. Following is the SPARQL query that extract more information about Movielens movies:

figure b

To measure the effectiveness of our approach, we calculated the Mean Absolute Error MAE, and Root Mean Square Error (RMSE) using the following formulas:

$$ {\text{MAE = }}\frac{{\mathop \sum \nolimits_{\text{u,i}} \left| {{\text{p}}_{\text{u,i}} - {\text{n}}_{\text{u,i}} } \right|}}{\text{n}} $$
(9)
$$ {\text{RMSE = }}\sqrt {\frac{1}{\text{n}}\mathop \sum \limits_{\text{u,i}} {\text{p}}_{\text{u,i}} - {\text{n}}_{\text{u,i}}^{2} } $$
(10)

Where \( n_{u,i } \) is the note given by the user u on item I, \( p_{u,i} \) is the predicted note, n is the total number of predicted notes.

The value MAE and RMSE of SBCF are compared with other values of state of the art techniques described in [18]. The results are shown on Table 4 where we can observe that the proposed approach offers the minimum error value (Fig. 3).

Table 4. Comparative results.
Fig. 3.
figure 3

Experimental results of MAE and RMSE.

7 Conclusion

In this work, we described a new approach to content based recommendation using web of data which is mainly supported by some of intelligent technologies namely: FOAF vocabulary and Linked Data Cloud. We were faced with a challenge to use the technique of CBF while reducing the impact of new user issue and the difficulty of analyzing unstructured items. Promising preliminary results have been obtained. As future work, our plan is to test and evaluate the proposed approach with other metrics like recall and precision, and apply new user problem solution to Collaborative Filtering (CF) algorithm to reduce the impact related to cold start issues.