Keywords

1 Introduction

The use of recommender systems in academia has recently been on the rise. Students and researchers use them in digital libraries to find relevant theses, articles, studies, datasets and other documents. As an important feature of digital libraries, quite a few recommender systems have been developed for use in academia. Recommender systems such as presented in [1, 3, 5] were developed specifically for use in academic digital libraries and repositories to aid researchers in finding relevant publications. Moreover, such recommender systems can also be found in academic social networks like Mendeley [6]. In Slovenia, research regarding recommending documents in the Slovenian language for academic purposes is very scarce. The reason for this was the lack of a structured dataset of documents. This has improved since the introduction of the Slovenian Open-Access Infrastructure [4] which provided a large structured dataset with approximately 200,000 documentsFootnote 1. As a part of the infrastructure, a hybrid recommender system has been developed with the aim to improve the visibility of research in Slovenia and encourage researchers from all Slovenian universities to collaborate. This work presents the architecture of our hybrid recommender system included in the Slovenian Open-Access Infrastructure and some observations we made on the digital libraries that are using our recommender system.

2 Slovenian Open-Access Infrastructure

In 2013, the Slovenian Open-Access Infrastructure was established and has provided researchers, students and the public with access to the publications of Slovenian educational and research institutions. The infrastructure consists of a national web portal, institutional repositories for each of the four Slovenian universities, a repository for research institutions and a repository for colleges and higher education institutions. Metadata from other digital archives are also aggregated within the infrastructure. By type, the infrastructure contains diploma, master’s and doctoral theses, journal and conference articles, proceedings, datasets, scientific and technical reports, books, lecture materials and videos of lectures. Because a great majority of publications are in Slovenian, an extensive full-text corpus of Slovenian language in different research domains was created. Currently it represents the largest corpus of texts in Slovenian language [2].

3 Method

We use a cascade approach in our hybrid recommender system (Fig. 1) with content-based filtering acting as a primary recommendation technique and collaborative filtering as its cascading re-ranking method. Documents are represented with titles, keywords, abstracts, typologies and year of publication. We use tf-idf weights that are the basis for the calculation of BM25 similarity values for each document pair, forming a document similarity index. New documents are periodically processed as they are included to the system daily. Finally, the user activity data and the calculated similarities between documents are also considered before ranking the documents into a list that is presented to the end-user. The ranking process is where the hybridization occurs, applying content-based filtering and collaborative filtering in cascade.

Fig. 1.
figure 1

Architectural diagram of the hybrid recommender system.

In our content-based filtering method, we use a collection of metadata, which describes the documents with titles, keywords and abstracts, document typology [8], issue year, authors, repository and the language of the document. Our content-based filtering method uses two scores to return an initial ranking of the documents. A BM25 score is used as a relevance measure between the documents multiplied by a Jaro-Winkler [7] distance score (Eq. 1) acting as a document typology similarity.

$$\begin{aligned} Score_{CBF}=BM25(d_A, d_B)\cdot d_{jw}(t_{d_A}, t_{d_B}) \end{aligned}$$
(1)

In our collaborative filtering method, we use the user activity for a document \(a_d\). As actions include views and downloads, the counts of these actions are stored for each document and regularly updated as users use the digital libraries. A feedback value \(f(a_d)\) is calculated with the sum of all values of actions on each document. A similar feedback value \(f(a_r)\) is calculated with the sum of all values of action on each clicked recommended document. The final score for this method is calculated with the sum of feedback values \(f(a_d)\) and \(f(a_r)\) multiplied by the respective download to view ratios \(h_d\) and \(h_r\) as shown in Eq. 2.

$$\begin{aligned} Score_{CF}=f(a_d) \cdot \frac{downloads(d)}{views(d)} + f(a_r) \cdot \frac{downloads(d_r)}{views(d_r)} \end{aligned}$$
(2)

The hybrid recommender system is implemented in two phases. The content-based method is first used to obtain an initial relevant set of documents which can be recommended. At this stage, an additional exponential temporal decay is applied to increase the ranks of recently published documents. The resulting set of ranked documents is then re-ranked using the feedback values of user actions obtained with our collaborative filtering method.

4 Observations and Conclusions

The goal of the recommender system was to provide recommendations in repositories across the national open-access infrastructure and encourage collaboration of researchers from different Slovenian universities. We investigated the types of documents which get recommended the most. This ties into the logic of the recommender system, which is configured to recommend similar types of documents and it reflects what types of documents are the most popular among our users.

Fig. 2.
figure 2

Recommendations per year (left: group 1; right: group 2).

We found that two groups of documents emerged as the most recommended. The first group consists of undergraduate theses, followed by master’s theses and doctoral dissertations. The second group consists of scientific articles, review articles, professional articles and other reviews. An increase of recommendations through the years for these two groups can also be observed from Fig. 2. This is due to natural accumulation of new documents in our digital repositories which is on average approximately 13000 per year. We conclude that recommendations in digital libraries have a positive effect on students and researchers looking to broaden their research or acquire different views on the same topic. A unified framework is to be developed in the future in order to perform a more extensive evaluation of our recommender system’s contribution to knowledge exchange.