Keywords

1 Introduction

Recommender systems (RS) are on the forefront of decision support systems within e-commerce applications [1, 2]. Among the most known examples of recommender systems are the ones used by well-established e-commerce retailers such as AmazonFootnote 1 or NetflixFootnote 2. Here, users receive personalized suggestions for items of the product catalog. But both content-based and collaborative filtering systems suffer from certain shortcomings, such as rating sparsity or limited content analysis [1]. To address the problem of data sparsity, the Linked Open Data (LOD) movement gave rise to the type of Linked Data recommender systems (LDRS). These systems tackle drawbacks of traditional approaches by enriching existing recommender systems with information from public data sources. But even though current LDRS show promising results, they do not yet take full advantage of the potential that LOD offers. The paper addresses this research gap through:

  • Identification of central problems of different types of recommender systems (Sect. 2).

  • Overview of current challenges for RS in e-commerce applications and how they can be addressed with the help of Linked Data technologies (Sect. 3).

  • Technical description of the SKOSRec prototype, that implements these ideas (Sect. 4).

  • Evaluation of the systems’ key components (Sect. 5).

  • Discussion of the main findings and of the practical implications of the approach (Sect. 6).

2 Related Work

2.1 Recommender Systems

Upon the presentation of the first systems in the 1990s, the area of recommender systems has been an established research field. The most common recommendation algorithms apply some form of collaborative filtering technique, where users are referred to preferred items of like-minded peers [35]. In contrast to that, content-based systems derive recommendations from feature information for items in the user profile [1, 6, 7, 9, 10]. Thus, similar items are detected. Content-based and collaborative filtering systems suffer from certain shortcomings. Even though both paradigms can be combined in hybrid systems [7], many of the problems still remain.

One of the main issues on the operational side is the data sparsity problem, where user preferences are rare. It mostly occurrs, when new users or items are added to the system (cold start problem), but it can also arise when the amount of feedback information is simply not enough to derive meaningful recommendations. Especially in content-based systems, users can receive unfitting recommendations due to incomplete or ambiguous item feature information (limited content analysis) [1].

2.2 Linked Data Recommender Systems

Recently, researchers have started to utilize Linked Data information sources to address the problem of insufficient item feature information. The LOD cloud comprises data on almost any kind of subject and offers general purpose information (e.g. DBpediaFootnote 3) as well as data from special domains [8]. LOD resources are usually identified through URIs. Thus, the LOD cloud provides less ambiguous information than text-based item descriptions [9]. Experiments on historic data showed that LDRS are at least competitive with classical systems and sometimes even outperform them in terms of accuracy [912, 21].

But even though LDRS achieve considerable results, they do not yet take full advantage of the LOD cloud. Current approaches require a considerable amount of pre-processing, such as the selection and extraction of item features. Once a set of item features has been selected, the recommendation model is ‘hard wired’ into the system and can not be adapted to changing user or business demands.

2.3 Query-Based Recommender Systems

Due to the fact that most LDRS and non-LDRS are not capable of customizations, there have been efforts to enable systems to produce query-based recommendations. In the field of non-LDRS this is achieved through enhancing relational databases with recommendation functionalities [13]. For instance, the REQUEST system integrates the personalization process with OLAP-like queries, such that the selection of items/users can be based on certain conditions and aggregations. Thus, recommendation models are adaptable to different requirements at runtime [14, 15]. But as information on user preferences is usually sparse, this information becomes even sparser when only certain items or users are selected. This often leads to unreliable recommendation results [15].

The issue of data sparsity could be addressed through content-based approaches that enhance item feature information with LOD resources. To date, there are only a few systems that consider user preferences in conjunction with Linked Data technologies, such as SPARQL queries [1619]. With these systems, expressive user preferences can be formulated. But we argue that the potential of LOD is not yet fully exploited for recommendation tasks. Query-based LDRS either follow a fixed workflow of similarity detection and SPARQL graph pattern matchings [16, 17] or face long execution times when processing a large number of triple statements [18].

Therefore, the main goal of the paper is the development of a system that flexibly integrates user preferences with SPARQL elements at reasonable computational cost and hat provides novel and meaningful recommendations in update-intensive environments, where local databases do not provide sufficient data.

3 LOD for Flexible Recommendations in e-Commerce

The following aspects give an overview of current challenges of RS applications in the e-commerce sector and describe how they can be addressed with the help of Linked Open Data:

  • Comprehensiveness: As stated in the previous sections, e-commerce RS have to deal with the issues of data sparsity and low profile overlap among users. Especially small sites do not have a customer base that is big enough to provide enough ratings [2]. That is why the integration of Linked Data sources into recommender systems could help to overcome existing limitations on the data side. The LOD cloud comprises billions of triple statements ranging from general purpose data to information sources from domains, such as media or geography. These datasets can be of value in multimedia retailing or online travel agencies [8].

  • Adaptability: RS of e-commerce sites usually do not offer functionalities where customers can restrict recommendation results to specific criteria. But to achieve deeply personalized results, it would be desirable to apply pre- or postfiltering on the product catalog [2, 13]. For instance, think of a customer of a media streaming site who, in spite of his/her purchasing history, wants to provide the information that he/she is strongly interested in European movies. Above that, in areas like tourism, user preferences depend on many factors, such as context, travel companions or travel destination preferences [20]. In addition to that, not only consumers could profit from customization functionalities, but also marketing professionals and administrators [2, 15]. For instance, marketing campaigns could be fit to special holiday occassions of the year to promote long-tail items. These aspects require data-rich applications, that can be accessed with expressive queries.

  • On-the-fly recommendations: Current RS usually rely on pre-computed recommendation results. But the aspect of adapatability is strongly tied to the aspect of just-in-time recommendations. As customer and business requirements can not be foreseen, recommendation models should be configurable at runtime, such that a user can select the right data when it is actually needed [21]. To enable flexible recommendations results from Linked Open Data repositories, efficient strategies for processing large numbers of triple statements have to be identified.

4 The SKOS Recommender

In this section we present SKOSRec, a system prototype that addresses the previously identified challenges of RS in e-commerce applications.

4.1 Scalable On-the-Fly Recommendations

Most LDRS identify similar items through their features. But considering the large amount of information, using all known features of a resource leads to poor scalability and long processing times. Thus, in the context of LDRS it was proposed to select certain item features (properties) for the recommendation model [9, 10, 2224]. But due to the large amount of information in the LOD cloud, the selection process can be error-prone and time consuming. Thus we propose to perform similarity computation on URI annotations that are part of commonly used vocabularies, such as the Simple Knowledge Organization System (SKOS). SKOS vocabularies have become a de-facto standard for subject annotations, since a majority of Linked Data sets are annotated with SKOS concepts. We implemented a system, called SKOS Recommender that uses its own SPARQL-like query language (SKOSRec) to produce flexible on-the-fly recommendations. For identifying similar items the systems relies on SKOS concepts, but can be extended to other URI resources from the LOD cloud. The system uses Apache JenaFootnote 4 and can be applied on local as well remote SPARQL endpoints. The following section summarizes the general workflow of the SKOSRec system.

  1. 1.

    Parsing: Parse the SKOSRec query.

  2. 2.

    Compiling: Decompose the query into the preferred input resource r (e.g. a movie) and a SPARQL graph pattern P.

  3. 3.

    Resource retrieval: Retrieve relevant resources from SKOS annotations of r in conjunction with P.

  4. 4.

    Similar resources: Score and rank the resources according to their conditional similarity with the input resource p.

  5. 5.

    Recommendation: Output the final recommendation results.

In the following, we will now rigorously define keywords in italics by using the notation for SPARQL semantics that was introduced by [25].

Definition 1

(SKOS annotations). Let AG be the annotation graph of an RDF dataset D (\(AG \subset D\)), where resources are directly linked to concepts c of a SKOS system via a predefined property (e.g. dct:subject). All nodes of the AG are IRIs and the annotations of an input resource r are defined as follows:

$$\begin{aligned} annot(r) = \{ c \in AG \; | \exists <r,subject,c> \} \end{aligned}$$
(1)

Upon retrieval of input resource annotations, similarity calculation does not have to be performed on the whole item space.

Definition 2

(Relevant resources). The mapping \(\varOmega \) of relevant resources and their annotations is obtained by retrieving all resources \(P_r\) that share at least one SKOS concept with resource r. In addition to that, relevant resources are joined with a SPARQL graph pattern P, so that resources are excluded when certain user requirements are defined.

$$\begin{aligned} P_r = (r,subject,?c) \; AND \; (?x,subject,?c) \end{aligned}$$
(2)
$$\begin{aligned} \varOmega = \lbrace \mu (?x) | \mu \in \llbracket P \rrbracket _{D} \rbrace \bowtie \llbracket P_r \rrbracket _{AG} \end{aligned}$$
(3)

After querying all relevant resources and their annotations, similarity values can be calculated. They are based on the Information Content (IC) of the shared features of two resources. This idea was introduced by [23], but is expanded to the case when the item space is restricted to match a user defined graph pattern.

Definition 3

(Conditional similarity). Let annot(r) be the set of SKOS features of resource r and annot(q) the set of SKOS features of resource q and \(q \in \lbrace \mu (?x) | \mu \in \varOmega \rbrace \), then their similarity can be derived from the IC of their shared concepts \(C = \lbrace annot(r) \cap annot(q) \rbrace \)

$$\begin{aligned} sim(r,q) = IC_{cond}(C) \end{aligned}$$
(4)

Definition 4

(Conditional Information Content). The IC of a set of SKOS annotations is defined through the sum of the IC of each concept \(c \in \lbrace \mu (?c) | \mu \in \varOmega \rbrace \), where freq(c) is the frequency of c among all relevant resources and n is the maximum frequency among these resources.

$$\begin{aligned} IC_{cond}(C) = - \sum _{c \in C} log\left( \frac{freq(c)}{n}\right) \end{aligned}$$
(5)

The retrieval of relevant resources and concept annotations can lead to long processing times, especially in cases when concepts are frequently used in a dataset. Hence, the number of records from SPARQL endpoints should be reduced. By knowing the length of the top-n recommendation list, it can be calculated which resources can be omitted without influencing the final ranking. This is the case when the maximum potential score for a certain number of shared features is smaller than the minimum potential score for a higher number of shared features. By this means, it is determined how many annotations have to be shared at least with an input resource (cut value) (see Eqs. 6 and 7).

$$\begin{aligned} \varOmega _{cut} = \lbrace \mu _{cut}(?x) | \mu _{cut} \in F_{count(?c)}(\varOmega ) > cut \rbrace \end{aligned}$$
(6)
$$\begin{aligned} \varOmega _{reduced} = \varOmega \bowtie \varOmega _{cut} \end{aligned}$$
(7)

4.2 Expressive SPARQL Integration

In the course of this paper, we are only able to give a short overview of the SKOSRec query language. Central to the idea of customizable on-the-fly recommendations is that both item similarity computation and querying of LOD resources can be flexibly integrated in a single query language. Even though there already exist some language extensions that combine SPARQL with imprecise parts [16, 17], they do not take full advantage of the expressiveness of the RDF data model. Hence, we propose the SKOSRec query language that extends elements of the SPARQL 1.1 syntax (see underlined clauses in Listing 1) [26]. It enables flexible and powerful combinations of graph pattern matchings and subquerying with recommendation results. The ‘RECOMMEND’ operator issues the process of similarity calculation based on the input resource and potential user defined graph patterns, whereas the ‘AGG’ construct ensures that certain resources are exluded from the result set.

figure a

The central contributions of the new language are summarized below.

  • Recommendations for an input profile: Whereas recommendations can be generated from both user and item data [16], we argue that the integration of local customer information and LOD resources is not feasible. An e-commerce retailer might avoid such a solution because of privacy concerns and additional costs and would rather prefer to obtain recommendations from outsourced repositories.

  • Graph pattern matching for preference information: The SKOSRec language allows expression of preferences for variables that are contained in graph patterns. Thus, users can formulate vague preferences.

  • Subquerying with recommendation results: In some areas it might be helpful to reuse recommendation results as a SPARQL-like subquery. Thus, triple stores can be powerfully navigated.

5 Experiments

We conducted several experiments to evaluate the viability of our approach. The goal of the evaluation was to find out, whether it is possible to get meaningful recommendation results with highly expressive queries from existing LOD repositories at reasonable computational cost. For this purpose, we issued SKOSRec queries from different target domains (movies, music, books and travel) (Table 1) to a local virtuoso server containing the DBpedia 3.9 dataset.

Table 1. Overview of the target domains

5.1 Scalable On-the-Fly Recommendations

The effectiveness of the optimization approach presented in Sect. 4.1 (Eqs. 6 and 7) was examined on 4 different datasets, where each dataset comprised 100 randomly selected DBpedia resources from the target domains. The performance test was carried out on an Intel Core i5 2500, clocked at 3.30 GHz with 8 GB of RAM. Evaluation results showed that, even though our approach imposes overhead that leads to slightly increased computational cost for smaller datasets (e. g. Fig. 4), it considerably reduces processing times for a growing number of triple statements for bigger datasets (Table 2, Figs. 1, 2 and 3).

Fig. 1.
figure 1

Movie domain

Fig. 2.
figure 2

Book domain

Fig. 3.
figure 3

Music domain

Fig. 4.
figure 4

Travel domain

Table 2. Results of the performance test

5.2 Expressive SPARQL Integration

As former research on LDRS has already shown that the application of SKOS annotations leads to good precision and recall values in comparison to standard RS algorithms [9, 10], we followed an explorative approach. We investigated, whether it is possible to formulate highly expressive SKOSRec queries that produce meaningful recommendation results from the DBpedia dataset. We issued advanced queries to showcase the viability of our language in several usage scenarios of the target domains.

Conditional Recommendations. This query template generates highly individual or business relevant recommendations. In the example depicted below, a marketer wants to obtain query results that are personalized and promote Christmas movies at the same time.

figure b

Aggregation-Based Recommendations: Roll Up. When user preferences can be derived from their sublevel entities, the roll-up template might improve recommendations. Think of a travel agency intending to suggest city trips. Two customers would receive similar trip recommendations once they have been to the same cities even though they have visited different points of interest (POI). The example shows that it can be reasonable to instantiate the process of similarity calculation on sublevel entities to better fit recommendations to customer needs.

figure c

Aggregation-Based Recommendations: Drill Down. Sometimes customers find it hard to concretize their preferences. They might have a vague understanding of what they like, but could not tell why. In this example a user knows that somehow he/she likes movies directed by Quentin Tarantino. A drill-down query would find the most similar films to those that were directed by him and aggregate the results, such that related directors and their movies would be recommended.

figure d

Cross-Domain Recommendations. Even though, standard collaborative filtering algorithms sometimes generate recommendations that are from a different domain than the items in a user profile, marketers cannot directly control the outputs. In contrast to that, the SKOSRec language enables explicit cross-domain querying. The following example shows that suggestions for novels (e. g. Beat novels) can be obtained by examining the user preference for a music group (e. g. The Beatles).

figure e

6 Conclusion

This paper demonstrated how Linked Open Data technologies can be utilized for highly flexible on-the-fly recommendations in e-commerce applications. Former LDRS calculated user preference predictions offline and thus prevented customizations and frequent updates of data sources as well as recommendation models. Although there have been efforts to enable user restrictions at runtime in query-based recommender systems, most of them either do not scale to large item spaces or do not handle data sparsity issues that arise when restricting the set of potential products to certain criteria. Above that, existing query-based LDRS do not yet take full advantage of the expressiveness that RDF and SPARQL graph patterns offer.

The SKOSRec prototype addresses this research gap by offering a powerful combination of similar resource retrieval and SPARQL graph pattern matchings from Linked Open Data repositories. Thus, individual and/or business preferences can be flexibly integrated. With the SKOSRec query language at hand, e-commerce retailers could define various recommendation workflows that can be adapted to specific usage contexts. For instance, the marketing department could use campaign templates and end users could enter their preferences through a user interface.