1 Introduction

The importance of research data is widely recognized across all scientific fields as this data constitutes a fundamental building block of science. Recently, a great deal of attention has been dedicated to the nature of research data [31] and how to describe, share, cite, and reuse them to enable reproducibility in science and to ease the creation of advanced services based on them. In this context, the linked open data (LOD) paradigm [26, 61] has rapidly become the de facto standard for publishing and enriching data. It allows for opening public data up in machine-readable formats ready for consumption, and for reuse and enrichment through semantic connections enabling new knowledge creation and discovery possibilities.

Several scientific fields started to expose research data as LOD on the Web. Relevant examples include applied life science research [57, 63], social sciences [95], linguistics [45] and cultural heritage with the EuropeanaFootnote 1 LOD publishing effort [65], and the Library of Congress Linked Data ServiceFootnote 2, which “provides access to commonly found standards and vocabularies promulgated by the Library of Congress”. Publishing houses are increasingly investing effort and money into exposing scientific publications metadata as LOD and into connecting publications with the underlying raw data. For instance, (i) Springer started an LOD projectFootnote 3 for making the data about conference proceedings available and enriching their metadata with available data in the LOD cloud; (ii) Elsevier lauched the “Linked Data Repository”Footnote 4 with the aim to “store and retrieve content enhancements and other forms of semantic metadata about both Elsevier content”; and, (iii) in 2012 the Nature Publishing Group released a platformFootnote 5, which gives access to millions of LOD triples comprising “bibliographic metadata for all articles (and their references) from Nature Publishing Group and Palgrave Macmillan titles”. Moreover, in the last years more than 100 data journals—whose aim is making research data effectively discoverable and reusable through data publications—has been proposed [39].

Paradoxically, in the field of information retrieval (IR), where experimental evaluation based on shared data collections and experiments has always been central to the advancement of the field [42, 58], the LOD paradigm has not been adopted yet and no models or common ontologies for data sharing have been proposed. So despite the importance of data to IR, the field does not share any clear ways of exposing, enriching, and reusing experimental data as LOD with the research community. This impairs the reproducibility and generalization of IR experiments, which is rapidly becoming a key issue in the field. In 2011, the ACM International Conference on Information and Knowledge Management (CIKM) hosted the DESIRE workshop [10] on data infrastructures for supporting IR evaluation with a specific focus on reproducibility. Since 2015, the European Conference in IR (ECIR) series has allocated a whole paper track on reproducibility; and, the 2015 edition of the International ACM SIGIR Conference on Research and Development in IR dedicated a specific workshop to the topic: SIGIR Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR) [14]. It is time to explore the possibility of (semi-) automatically maintaining and enriching the experimental data and of providing advanced services on top of them, as has been done in other scientific fields.

Therefore, the main objectives of this paper are to:

  • define an RDF model of the scientific IR data with the aim of enhancing their discoverability and easing their connections with the scientific production related to and based on them;

  • provide a methodology for automatically enriching the data by exploiting relevant external entities from the LOD cloud.

In particular, as far as the first objective is concerned, we define an RDF model [91, 92] for representing experimental data and exposing them as LOD on the Web. This will enable a seamless integration of datasets produced by different experimental evaluation initiatives as well as the standardization of terms and concepts used to label data across research groups [51].

As for the second objective, it builds upon the proposed RDF model and allows for automatically finding topics in the scientific literature, exploiting the scientific IR data as well as connecting the dataset with other datasets in the LOD cloud. This augments the access points to the data as well as the potential for their interpretability and reusability.

A positive collateral effect deriving from the pursuit of the second objective is the possibility of tackling the inherent complexity and heterogeneity of IR experimental data, which makes it difficult to find collaborators with an interest in a given topic or task, or to find all the experimental collections for a given topic. Identifying, measuring, and representing expertise has the potential to encourage interaction and collaboration—and ultimately knowledge creation—by constructing a web of connections between experts and the knowledge that they create. These connections allow individuals to access knowledge beyond their tightly knit networks, where all members tend to have access to the same information. Additionally, expertise development is accelerated by providing valuable insight to outsiders and novice members of a community. In this way, experimental data can be linked with underlying publications and associated people through extracted topics. The combination of experimental data with information extracted from related scientific narrative and semantic metadata helps to enable a more meaningful interaction with them.

This paper is organised as follows. Section 2 discusses the workflow entailed by evaluation activities, presents an overview of the main challenges and existing solutions for modelling and managing experimental data in IR and describes state-of-the-art expert profiling and finding methodologies. Section 3 presents a concrete use case of our approach describing an RDF graph of experimental data enriched with expertise topics, experts’ profiles, and links to external datasets. Section 4 tackles the first objective of this paper by presenting the parts of the Distributed Information Retrieval Evaluation Campaign Tool (DIRECT) conceptual model related to scientific production, experiments, semantic enrichment and expert profiling and the RDF model we defined. Section 5 describes the LOD-DIRECT system and the IR data available as LOD on the Web. Section 6 builds on the presented RDF model and tackles the second objective of the paper by defining the enrichment process of experimental data based on the publications related to evaluation campaigns and background knowledge available on the LOD cloud. In Sect. 7, we discuss several experiments for assessing the effectiveness and the semantic grounding of expertise topics, expert finding, and expert profiling. We conclude this paper by presenting some final remarks and future work in Sect. 8.

2 Background

2.1 Experimental evaluation in IR

IR is concerned with complex systems delivering a variety of key applications to industry and society: Web search engines [43], (bio)medical search [74], enterprise search [36], intellectual property and patent search [70], expertise retrieval systems [23], and many others.

Therefore, designing and developing these faceted and complex systems are quite challenging activities and, since the inception of the field, have been accompanied by thorough experimental evaluation methodologies, to be able to measure the achieved performance, assess the impact of alternatives and new ideas, and ensure the levels of effectiveness needed to meet user expectations.

Experimental evaluation is a demanding activity in terms of effort and required resources that benefits from using shared datasets, which allow for repeatability of the experiments and comparison among state-of-the-art approaches [52, 53, 67]. Therefore, over the last 20 years, experimental evaluation is carried out in large-scale evaluation campaigns at the international level, such as the Text REtrieval Conference (TREC)Footnote 6 [60] in the USA, the Conference and Labs of the Evaluation Forum (CLEF)Footnote 7 [48] in Europe, or the NII Testbeds and Community for Informa- tion access Research (NTCIR)Footnote 8 in Japan and Asia.

Over the years, these evaluation activities have provided sizable results and improvements in various key areas, such as indexing techniques, relevance feedback, multilingual search, results merging, and so on. For example, before CLEF started in 2000, the best bilingual information access systems performed about 45–50 % as well as the corresponding best monolingual systems [59], further limited to resource-rich languages such as English, French, and German. After 10 years of CLEF, the best bilingual systems went up to about 85–95 % of the best monolingual ones [2, 50] for most language pairs. Over the years, these initiatives have resulted in massive amounts of scientific data, comprising shared datasets, experimental results, performance measures, descriptive statistics and statistical analyses about them, which provided the foundations for the subsequent scientific and technological development. Consequently, experimental data as well as evaluation campaigns have a high scientific and economical value. TREC estimated that for every $1 that the National Institute of Standards and Technology (NIST) and its partners have invested in TREC, at least $3.35 to $5.07 in benefits accrued to researchers and industry which, for an overall investment in TREC of around 30 million dollars in 20 years, means producing between 90 and 150 million dollars of benefits [81].

Experimental evaluation [58, 79] is a very strong and long-lived tradition in the IR field and dates back to the late 1950s/early 1960s. It is based on the Cranfield methodology [42] which makes use of shared experimental collections to create comparable experiments and evaluate the performance of different information access systems. An experimental collection is a triple composed of: (i) a set of documents, also called collection of documents, which is representative of the domain of interest in terms of both kinds of documents and number of documents; (ii) a set of topics, which simulate actual user information needs and are used by IR systems to produce the actual queries to be answered; and, (iii) the ground truth or the set of relevance judgements, i.e. a set of ‘correct’ answers, where for each topic the documents which are relevant for that topic, are determined.

Fig. 1
figure 1

The typical IR experimental evaluation workflow and the data produced

In Fig. 1, we can see the main phases of the IR experimental evaluation workflow, where an increased attention for the knowledge process entailed by evaluation campaigns is required. Indeed, the complexity of the tasks and the interactions to be studied and evaluated produce valuable scientific data, which need to be properly managed, curated, enriched, and accessed. Further, the information and knowledge derived from them needs to be appropriately treated and managed, as well as the cooperation, communication, discussion, and exchange of ideas among researchers in the field. In this perspective, the information space entailed by evaluation campaigns can be considered in the light of the data, information, knowledge, wisdom (DIKW) hierarchy [1, 55, 82, 96] and used as a model to organize the produced information resources [47].

The first step regards the creation of the experimental collection and is composed of the acquisition and preparation of the documents and the creation of topics from which a set of queries is generated. In the second step, the participants in the evaluation campaign have everything they need to run experiments and test their systems. An experiment is the output of an IR system, which usually consists of a set of ranked lists of documents—one list per topic. Both the experimental collections and the experiments can be regarded as data, since they are raw and basic elements, which have little meaning by themselves and no significance beyond their existence.

In the third step, the gathered experiments are used by the campaign organizers to create the ground truth, typically adopting some appropriate sampling technique to select a subset of the dataset for each topic. In this phase, assessors decide whether or not an object is relevant for a given topic. Relevance judgements are raw data belonging to the experimental collection, but at the same time they represent human-added information connecting documents to topics of an experiment. The triple of documents, topics, and relevance judgements is then used to compute performance measures for each experiment. Both relevance judgements and performance measures can be considered as information, since they associate meaning with the data through some kind of relational connection and are the result of computations on and processing of the data.

Afterwards, measurements are used to produce descriptive statistics and conduct statistical tests about the behaviour of one or more systems, which represents knowledge, as these tests are built upon the performance measurements and used to make decisions and take further action on future scientific work.

Finally, the last step is scientific production where both participants and organizers prepare reports about the campaign and the experiments, the techniques they used, and their findings. This phase usually continues also after the conclusion of the campaign as the investigations of the experimental results require a deeper understanding and further analyses, which may lead to the production of conference and journal papers. This phase corresponds to the wisdom in the DIKW hierarchy. Furthermore, this phase also embraces external actors who were not originally involved in the evaluation campaign. Indeed, the data employed in the evaluation workflow (i.e. documents, topics, and relevant judgements) as well as the data produced (i.e. experiments, measures and statistics, and reports) are usually made freely available to the scientific community, which exploits them to produce new knowledge in the form of scientific papers. Scientific production is central to the evaluation workflow, because it involves all the data used and produced in the process, all the actors who participated in the campaign, and external actors who may exploit and elaborate upon the data.

In this article, we focus in particular on this last step, providing an RDF model of the resources involved in the scientific production and management and leveraging it as the starting point for extracting expert profiles and topics, which are used both to semantically enrich the underlying scientific data and to link them to other data sources in the LOD cloud.

2.2 Modelling and managing IR experimental data

A crucial question in IR, common to other research fields as well, is how to ensure the best exploitation and interpretation of the valuable scientific data employed and produced by experimental evaluation, possibly over large time spans. For example, the importance of describing and annotating scientific datasets is discussed by [32], who notes that this is an essential step for their interpretation, sharing, and reuse. However, this question is often left unanswered in the IR field, since researchers are more interested in developing new algorithms and methods for innovative systems than modelling and managing their experimental data [5, 6]. As a consequence, we have started an effort aimed at modelling the IR experimental data and designing a software infrastructure able to manage and curate them, which led to the development of the Distributed Information Retrieval Evaluation Campaign Tool (DIRECT) system [4], as well as raising awareness and consensus in the research community and beyond [10, 11, 54, 97]. DIRECT enables the evaluation workflow described in the previous section, manages the scientific data produced during a large-scale evaluation campaign, as well as supports the archiving, access, citation, dissemination, interaction, and sharing of the experimental results [3, 7, 12, 13, 49]. To our best knowledge, DIRECT is the most comprehensive tool for managing all the aspects of the IR evaluation methodology, the experimental data produced, and the connected scientific contributions.

DIRECT has been used since 2005 for managing and providing access to CLEF experimental evaluation data. Over these years, the system has been extended and revised according to the necessities and the requirements provided by the community [4]. At the time of writing, DIRECT handles about 35 million documents, 14 thousand topics, around 4 million relevance judgements, 5 thousand experiments and 20 million measures. This data has been used by about 1500 researchers from more than 70 countries worldwide. Overall, DIRECT data has been accessed and downloaded by more than 650 visitors, thus proving that the DIRECT system is well suited to address most of the community requirements for what it is concerned with the access and use of IR experimental data.

There are other projects with similar goals, but with a narrower scope. One such project is the Open Relevance Project (ORP)Footnote 9 which is a “small Apache Lucene sub-project aimed at making materials for doing relevance testing for Information Retrieval, Machine Learning and Natural Language Processing into open source”; the goal of this project is to connect specific elements of the evaluation methodology—e.g. experimental collections, relevance judgements and queries—with the Apache Lucene environment to ease the work of developers and users. Unfortunately, the project was discontinued in 2014. Moreover, ORP neither considers all the aspects of the evaluation process such as the organization of an evaluation campaign in tracks and tasks or the management of the experiments submitted by the participants to a campaign, nor does it take into account the scientific production connected to the experimental data, which is vital for the enrichment of the data themselves as well as, for instance, the definition of expert profiles.

Another relevant project is EvaluatIR.orgFootnote 10 [15] which is focused on the management and comparison of IR experiments. It does not model the whole evaluation workflow and acts more as a repository of experimental data rather than as an information management system for curating and enriching them.

There are other efforts carried out by the IR community which are connected to DIRECT, even though they have different purposes. One relevant example is the TIRA (Testbed for Information Retrieval Algorithms) Web service [56], which aims at publishing IR experiments as a service; this framework does not take into account the whole evaluation process as DIRECT does and it is more focused on modelling and making available “executable experiments” which is out of the scope of DIRECT. We can also mention some other efforts made by the community to provide toolkits to support the different phases of machine learning/IR experiments such as WEKAFootnote 11 and the SimDL framework [69] which integrates a digital library with simulation infrastructures to provide automated support for research components. Although these services are relevant to the field, they are not directly related to DIRECT, the aim of which is to model and manage the whole evaluation process in IR and to provide access to the evaluation products rather than to propose new evaluation methodologies or to provide researchers with new tools for carrying out their activities.

Thorough modelling and managing of experimental data and the related scientific publications are fundamental for creating new knowledge on top of this data; to this purpose, DIRECT and the modelled evaluation workflow are the starting point we consider for exposing experimental collections and related scientific contributions as LOD on the Web. To the best of our knowledge, a semantic model for representing IR experimental data has been proposed here for the first time. Furthermore, as a relevant outcome of this approach, we also show how it is possible to exploit scientific contributions for enriching experimental data as well as for automatically defining expert profiles on a series of identified scientific topics.

2.3 Expert finding and profiling

Expert finding is the task of locating individuals knowledgeable about a specific topic, while expert profiling is the task of constructing a brief overview of the expertise topics of a person. So far, these tasks have been deemed interesting mainly for their application in enterprise settings. However, scientific communities could benefit from such tasks and tools as well, as they enable and strengthen collaboration. In an academic setting, existing work on expert finding focused on the task of finding qualified reviewers to assess the quality of research submissions [73, 80]. In this work, we consider its applications for dissemination and sharing of experimental results in IR.

Initial solutions for expert finding were developed under the area of competency management [46]. These approaches are based on manual construction and querying of databases about knowledge and skills of an organization’s workforce, placing the burden and responsibility of maintaining them on the employees themselves [72]. A disadvantage of this approach is that because the information about experts and expertise is highly dynamic, considerable efforts are required to keep competency databases up to date. This prompted a shift to automated expert finding techniques that support a more natural expertise location process [37].

Expert finding can be modelled as an information retrieval task using queries provided by users to perform a full-text search for experts instead of documents. The goal of the search is to create a ranking of people who are experts on a given topic, instead of ranking relevant documents. A lot of ground was covered in terms of evaluating expert search systems by the organization of three consecutive enterprise tracks by TREC [18], which provided common ground for evaluating different systems and approaches. In this context, the expert finding task is modelled using statistical language modelling [22, 77] or data fusion techniques [71].

The importance of expert profiling when developing solutions for expert search is discussed in [21], without addressing the problem of discovery and identification of expertise topics. The authors assume that a controlled vocabulary of terms is readily available for the considered domain. Currently, such a resource is not available in our application setting; we therefore propose an automatic solution for the extraction of expertise topics by adapting existing term extraction and key phrase extraction approaches.

An extensive analysis of expert profiling is presented in [84], where the language model proposed by [21] is included as one of the features in a machine learning approach. Other features include a more simple binary model of relevance and the frequency of an expertise topic in expert profiles from the training set. Expertise topics, called tags in this work, are assumed to be known in advance, similar to [21], and are collected through self-assessment. An important observation is that the quality of expertise topics is more important than the relevance to a particular person. In their experiments, the most important feature with respect to its performance contribution is the frequency of the expertise topic, a feature that is independent of the particular employee.

We build on this work by using a quality-related measure of expertise topics together with relevance-based measures for expert profiling. Similar to competency management approaches, an intermediate conceptual level between documents and experts is introduced, avoiding their limitations, such as manual gathering of data and quickly outdated profiles through automatic extraction of expertise topics.

Fig. 2
figure 2

An example of RDF graph showing how expertise topics and expert profiles are used for enriching IR experimental data. The colors assigned to the classes identify the different areas of the DIRECT model which are described in Sect. 4

3 Use case: discover, understand, and reuse IR of experimental data

As previously discussed, to allow for a better understanding and reuse of experimental IR data and to increase their potential, visibility and discoverability on the Web, we start from a well-established and comprehensive conceptual model of experimental data—realized by the DIRECT system—and provide a mapping towards an RDF model letting us to expose these data as LOD on the Web. This is the first step towards improving the possibility of discovering experimental data and enriching them to augment their understandability and reusability. Following this line of work, we adopt an automatic approach for extracting expertise topics from the contributions connected to the experimental data and then use them for enriching the contributions themselves and their authors, connecting with the LOD cloud, and defining expert profiles.

In this section, we discuss an example of the outcomes of the semantic modelling and automatic enrichment processes applied to the use case of discovering, understanding and reusing the experimental data. Figure 2 shows an RDF graph, which provides a visual representation of how the experimental data are enriched. In particular, we can see the relationship between a contribution and an author enriched by expertise topics, expert profiles, and connections to the LOD cloud. As reported in Table 1, the different classes are associated with different colors to identify the different areas of the RDF model they belong to. For instance, the class representing the User is colored dark green as the classes of the management area described in Sect. 4; the classes regarding expertise topics and expert profile enrichment are colored blue, the classes related to the measures within the experiments area are colored yellow, the classes related to the evaluation activities (tasks, tracks and campaign) are colored light green and the Contribution class is colored orange.

In this instance, the author (Jussi Karlgren) and the contribution (KarlgrenEtAl-CLEF2012) are data derived from the evaluation workflow, whereas all the other information are automatically determined by the enrichment process. The adopted methodology for expertise topics extraction determined two main topics, “reputation management” and “information retrieval”, which are related to the KarlgrenEtAl-CLEF2012 contribution. These topics are connected to the contribution by instantiating the RDF model shown in Fig. 5 and discussed below, using the Link class which acts as an RDF blank nodeFootnote 12. We can see that KarlgrenEtAl-CLEF2012 is featured by “reputation management” with a score of 0.53 and by “information retrieval” with 0.42, meaning that both these topics are subjects of the contribution; the scores (normalized in the interval [0, 1]) give a measure of how much this contribution is about a specific topic and we can see that in this case it is concerned a bit more with reputation management than with information retrieval. Furthermore, the backward score gives us additional information by measuring how much a contribution is authoritative with respect to a scientific topic. In Fig. 2, we can see that KarlgrenEtAl-CLEF2012 is authoritative for reputation management (backward score of 0.87), whereas it is not a very important reference for information retrieval (backward score of 0.23). Summing up, we can say that if we consider the relation between a contribution and an expertise topic, the score indicates the pertinence of the expertise topic within the contribution, whereas the backward score indicates the pertinence of the contribution within the expertise topic. The higher the backward score, the more pertinent is the contribution for the given topic.

Table 1 Colors used in the RDF graphs and DIRECT areas

This information is confirmed by the expert profile data; indeed, looking at the upper left part of Fig. 2, the author Jussi Karlgren is considered “an expert in” reputation management (backward score of 0.84), even if it is not his main field of expertise (score of 0.46).

All of this automatically extracted information enriches the experimental data, enabling a higher degree of reusability and understandability of the data themselves. In this use case, we can see that the expertise topics are connected via an owl:sameAs property to external resources belonging to the DBPediaFootnote 13 linked open dataset. These connections are automatically defined via the semantic grounding methodology described below and enable the experimental data to be easily discovered on the Web. In the same way, the authors and contributions are connected to the DBLPFootnote 14 linked open dataset.

Fig. 3
figure 3

The resource management area classes and properties

In Fig. 2, we can see how the contribution (KarlgrenEtAl-CLEF2012) is related to the experiment (profiling_kthgava gai_1) on which it is based. This experiment was submitted to the RepLab 2012 of the evaluation campaign CLEF 2012. It is worthwhile to highlight that each evaluation campaign in DIRECT is defined by the name of the campaign (CLEF) and the year it took place (e.g. 2012 in this instance); each evaluation campaign is composed of one or more tasks identified by a name (e.g. RepLab 2012) and the experiments are treated as submissions to the tasks. Each experiment is described by a contribution which reports the main information about the research group that conducted the experiment, the system they adopted and developed, and any other useful detail about the experiment.

Fig. 4
figure 4

The Experiment area classes and properties

We can see that most of the reported information are directly related to the contribution and allow us to explicitly connect the research data with the scientific publications based on them. Furthermore, the experiment is evaluated from the “effectiveness” point of view using the “accuracy” measurement which has 0.77 score. Retaining and exposing this information as LOD on the Web allow us to explicitly connect the results of the evaluation activities to the claims reported by the contributions.

Table 2 Main datatype properties of the resource management and contribution area classes reported in Figs. 3 and 5

4 Data modelling for enrichment and expert profiling

The detection of scientific topics related to the data produced by the experimental evaluation and the creation of expert profiles mainly concerns three areas covered by the evaluation workflow, which we call the “resource management area” (Fig. 3), the “experiment area” (Fig. 4), and the “scientific production area” (Fig. 5). As described above, DIRECT covers all of these workflow aspects, which leads to a rather complex system, the presentation of which is out of the scope for this paper [4]. Nevertheless, the conceptual model of the DIRECT resource management and scientific production areas has been mapped into an RDF model and adopted for enriching and sharing the data produced by the evaluation activities.

Within this model, we consider a Resource as a generic class sharing the same meaning of resource in RDF [92] where “all things described by RDF are called resources. [A resource is] the class of everything.” In DIRECT a Resource represents the class of everything that exists in IR experimental evaluation.

The resource management area models the more general and coarse-grained resources involved in the evaluation workflow—i.e. users, groups, roles, namespaces, and concepts—and the relationships among them. Furthermore, it handles the provenance of the data. All the classes of this area are defined as subclasses of the general Resource class and are represented in Fig. 3 along with the properties connecting them; for the sake of readability, we omitted from the figure the datatype properties which are reported in Table 2.

In Fig. 3, it is also possible to see how the classes of the management area are related to other vocabularies (reported in Table 3) in the LOD cloud throughout the schema:isSimilarTo and the owl:sameAs properties; the first one is used to establish a semantic relation between two similar concepts, whereas the second one is used to establish a formal equality between them. These relationships open up new entry points to the DIRECT dataset and new reasoning possibilities over the experimental data.

The User class, related to the Agent class of the foaf vocabulary, represents the actors involved in the evaluation activities such as researchers conducting experiments, organizers of a campaign, assessors, data scientists, and authors of scientific contributions. Furthermore, the User class as well as the foaf:Agent class may enclose also non-human agents such as software libraries. The function of a user in the evaluation workflow is defined by the Role class and the users can be grouped together via the Group class. A user can play none, one, or more roles: for instance, a user can be both an assessor as well as a researcher submitting experiments, i.e. a participant to the campaign. On the other hand, there are roles played by more then one user; for instance, a campaign can have one or more participants, e.g. researchers who carry out experiments for writing a paper. A group is a resource arranging together users with some common characteristics; for instance, there could be a group formed by all the users belonging to a specific research group or an ad hoc group created just for one project or collaboration. The Role class is related to role class in the org and swco vocabularies, whereas Group is related to the corresponding foaf one.

The Namespace class refers to a logical grouping of resources, allows the disambiguation of homonyms, and is related to the namespace class of the vann vocabulary (see Table 3). The use of namespaces in DIRECT is different from the namespace mechanism in eXtensible Markup Language (XML) and RDF which is used “to associate the schema with the properties in the assertions”Footnote 15; indeed, in DIRECT, namespaces are used to organize resources of the same kind but coming from different domains. For instance, in the context of experimental evaluation, we could host data coming from CLEF and TREC and assign them to two different namespaces, to clearly separate them. In the RDF model of DIRECT along with the general Resource we described above, there is another general class called Namespace Identifiable Resource as we can see in Fig. 3; this is a subclass of Resource always associated to a namespace. Thus, in the RDF model of DIRECT, there are two kinds of general resources, the first which has no namespace and the second which has one. In Fig. 3, we can see that User, Group and Role have a namespace, but the Namespace itself and Provenance-Event do not. The resources with a namespace are those that can be logically grouped or disambiguated using some of their inner characteristics such as affiliation for the users or venue for the contributions; the resources without a namespace are those that do not need to be grouped or disambiguated such as the provenance events which are handled internally to the system and thus do not need to be disambiguated (there are not two events with the same name or identifier) or logically grouped.

Table 3 RDF namespaces and prefixes of the vocabularies adopted in DIRECT for the resource management, experiment, and scientific production areas

Finally, the Provenance-Event class keeps track of the full lineage of each resource managed by DIRECT since its first creation, allowing users with adequate access permissions to reconstruct its full history and modifications over time. This class is related to the Activity class in the prov ontology (see Table 3). As shown in Fig. 3, Provenance-Event is composed of two object properties and three datatype properties, where:

  • who is the property associating the provenance event with the user who caused the event;

  • what is the property associating the provenance event with the specific resource originated by the event—note that every resource in the model can be related to a provenance event;

  • when is the datatype property associating the provenance event with the timestamp at which the event occurred;

  • why is the datatype property associating the provenance event with the motivation that originated the event, i.e. the operation performed by the system that led to a modification of the resource;

  • predicate is the datatype property associating the provenance event with the action carried out in the event, i.e. CREATED, READ, or DELETED.

Modelling provenance is key to the definition of expert profiles and topic extraction, because it is a means for controlling the quality and integrity of the data produced by the evaluation workflow. As we discussed above, the data produced by experimental evaluation are not only raw data, but they are the product of a series of transformations, which involve inputs from scientists and experts of the field. Retaining what was done with the data is crucial if we want to verify the quality, to reproduce the experiments [33], share [30], and cite [85] the raw data or their elaborations. Moreover, this data is used for scientific production, which in turn is exploited for expert profiling—two activities that must rely on high quality data. The Provenance-Event class allows us to record the five aspects (i.e. who, what, when, why, and predicate) required for keeping the lineage of data [41] and, consequently, the reliability of the information we extract and infer from this data.

In Fig. 4, we can see the classes and properties of the experiment area and the relationships of these classes with external classes in the LOD cloud. This area can be divided into two main parts, the first comprehending the Run, Track and Evaluation Activity classes modelling the experiments (i.e. runs in the experimental evaluation campaigns vocabulary) and the second comprising the Quality Parameter, Measurement, Measure, Descriptive Statistic and Statistic classes modelling the evaluation of the experiments.

The first part allows us to model an evaluation campaign composed of several runs submitted to a track which is part of an evaluation activity. We can see that the class Run has a recursive property called isComposedBy, which allows for defining an experiment as a composition of smaller experiments; such experiments are quite common in a typical IR scenario where a run is composed of one ranked list of objects for each topic in an experimental collection. In this case, a ranked list for a given topic represents an experiment and the run, which is the union of the ranked lists of all topics, represents a group of experiments which in DIRECT is defined as an “experiment of experiments”. The class Run allows us to handle a run as a whole or as single parts corresponding to each single topic in the collection.

The second part allows us to model the measurements and the descriptive statistics calculated from the runs. It is built following the model of quality for Digital Library (DL) defined by the DELOS Reference Model [38] which is a high-level conceptual framework that aims at capturing significant entities and their relationships with the digital library universe with the goal of developing more robust models of it. We extended the DELOS quality model and mapped it into an RDF model as detailed in [9]. The quality domain in the DELOS Reference model takes into account the general definition of quality provided by International Organization for Standardization (ISO) which defines quality as “the degree to which a set of inherent characteristics fulfils requirements” [66], where requirements are needs or expectations that are stated, generally implied or obligatory, while characteristics are distinguishing features of a product, process, or system [8, 9]. A Quality Parameter is a Resource that indicates, or is linked to, performance or fulfilment of requirements by another Resource. A Quality Parameter is evaluated by a Measurement, is measured by a Measure assigned according to the Measurement and expresses the assessment of a User. With respect to the definition provided by ISO, we can note that: the “set of inherent characteristics” corresponds to the pair (Resource, Quality Parameter); the “degree of fulfillment” fits in with the pair (Measurement, Measure); finally, the “requirements” are taken into consideration by the assessment expressed by a User.

Quality Parameters allow us to express the different facets of evaluation. In this model, each Quality Parameter is itself a Resource and inherits all its characteristics, as, for example, the property of having a unique identifier. Quality Parameters provide information about how, and how well, a Resource performs with respect to some viewpoint (e.g. “effectiveness” or “efficiency”) and resemble the notion of quality dimension in [24]. They express the assessment of an User about the Resource under examination. They can be evaluated according to different Measurements (e.g. “accuracy” as in Fig. 2 or commonly used measurements such as “precision” or “recall”), which provide alternative procedures for assessing different aspects of a Quality Parameter and assigning it a value, i.e. a Measure. Finally, a Quality Parameter can be enriched with metadata and annotations. In particular, the former can provide useful information about the provenance of a Quality Parameter, while the latter can offer the possibility to add comments about a Quality Parameter, interpreting the obtained values, and proposing actions to improve it.

Fig. 5
figure 5

The scientific production area classes and properties

One of the main Quality Parameters in relation to an IR system is its effectiveness, which is its capability of answering user information needs by retrieving relevant items. This Quality Parameter can be evaluated according to many different Measurements, such as precision and recall [83]: precision evaluates effectiveness in the sense of the ability of the system to reject useless items, whereas recall evaluates effectiveness in the sense of the ability of the system to retrieve useful items. The actual values for precision and recall are Measures and are usually computed using standard tools, such as trec_eval Footnote 16, which are Users, but in this case not human.

Furthermore, the Descriptive Statistic class models the possibility of associating statistical analyses with the measurements; for instance, a classical descriptive statistic in IR is mean average precision (MAP) which is the mean over all the topics of a run of the average precision (AP) measurement which is calculated topic by topic.

Lastly, another important class of the model is Concept which is defined as an idea or notion, a unit of thought. It is used to define the type of relationships in a semantic environment or to create a vocabulary (e.g. contribution types) and it resembles the idea of concept introduced by the Simple Knowledge Organization System (SKOS) [93, 94] to which it is related via the schema:isSimilarTo property. Concept is a subclass of Namespace Identifiable Resource and thus its instances are always associated with a namespace.

In DIRECT every vocabulary we create or import is handled via the Concept class. As an example, let us consider the term “Book” taken from the “Advanced Knowledge Technology reference ontology” which has http://www.aktors.org/ontology/portal# as Uniform Resource Identifier (URI) and prefix “aktors” (see Table 3). In Fig. 6, we can see how the model reported in Figs. 3 and 5 is instantiated for representing this term. We can see that the URI of the “aktors” ontology is retained by the URI of the instance of the Namespace class (which in the figure is renamed as “aktors URI” for convenience), whereas the prefix is represented by the datatype property ims:prefix. In Table 3, we report the vocabularies adopted in DIRECT for the resource management and scientific production areas.

Fig. 6
figure 6

The RDF graph of an instantiation of the model shown in Fig. 5; furthermore, it shows how the terms “Publication” and “Book” are associated with the terms in the Advanced Knowledge Technology reference ontology (i.e. aktors)

In Fig. 5 we can see the classes and the properties of the scientific production area; also in this case we show the relationships between the classes of this area and the external classes in the LOD cloud. This area of the RDF model is central for the expert profiling activity, because it handles scientific contributions, their relations with scientists and authors, and the scientific topics that can be extracted from them. Figure 5 reports three main classes which are Concept, Contribution and Link.

The Contribution class represents every publication concerning the scientific production phase of the evaluation workflow. We can see that it is related to Concept via the ims:contribution-type property which can be instantiated as shown in Fig. 6.

The Contribution class is related to four similar classes from external datasets: Document from the bibo and foaf vocabularies and Publication from the salt and swpo ones.

Fig. 7
figure 7

Two RDF graphs instantiating the model shown in Fig. 5. a An instantiation of the model shown in Fig. 5 for representing an expert profile. b An instantiation of the model shown in Fig. 5 for associating a contribution with a scientific topic and vice versa

The Link class connects two resources via the ims: has-source and ims:has-target properties with a typed relationship realized throughout a concept connected to the link via the ims:relation property. This allows us to create typed relationships between two generic resources involved in the evaluation workflow. We can instantiate the graph in Fig. 5 in several ways; a first example is shown in the right part of Fig. 6, where we represent two terms (i.e. “Publication” and “Book”) belonging to a vocabulary. This very example can be extended by representing a taxonomy of terms belonging to one or more vocabularies. In Fig. 6, we can see how the “Book” term can be related throughout an “is-a” relation to the more general term “Publication”. So, in this case Link is instantiated by a generic “Link” resource, which relates two concepts, i.e. “Book” and “Publication”, via the ims:has-source and ims:has-target datatype properties. The datatype property ims:relation allows us to define the type of the relationship—“is-a” in this case—between the two associated concepts.

The concept “Book” is associated with the instance “contributionX” of Contribution by means of the ims: contribution-type property. Moreover, in the lower part of Fig. 6, we can see how a contribution is associated with an author via the swrc:has-author property representing a user—i.e. “userY”—author of “contributionX”.

Link has two datatype properties: ims:score and ims:backward-score, which allow us to add weights on any typed relationship; both score and backward score are xsd:double in the interval [0, 1]. Indeed, we can establish a relation between user and concept with two scores on it to say that a user is expert in a given scientific topic. This lets us define expert profiles; for instance, we can say that “userY is an expert in Information Retrieval” where “userY” is an instance of the User class and “information retrieval” is a term defined as an instance of Concept; the score represents the strength of the relation between a user and a concept, and the backward score represents the strength of the relation between a concept and a user. This means that the relationship between User and Concept is not symmetric; for instance, we can say that “UserY” is an expert in “information retrieval” with score 0.9 and this means that information retrieval is the main area of expertise for the user. On the other hand, there may be people more expert in information retrieval than “UserY”, so the backward score can be set to be only 0.1, and this would mean that “UserY” is just one of the experts in “Information retrieval” and that we expect to find out other users with a higher expertise level (backward score). The RDF graph of the user profile just described is shown in Fig. 7a.

In Fig. 7b, we can see another possible use of Link, in this case for representing the relationship between a contribution and a scientific topic. Indeed, semantic enrichment techniques are employed for extracting scientific topics from the data produced by the evaluation workflow and then relating them with pertinent contributions. We can see that “contributionX” is related to the scientific topic “information retrieval” via an ims:relation called “feature”; also in this case the typed relation between contribution and concept is weighted; the score is set to 0.7 meaning that “contributionX” mainly talks about “information retrieval”, whereas the backward score is set to “0.3” meaning that among contributions about “information retrieval”, “contributionX” is not one of the most prominent ones.

Fig. 8
figure 8

The Turtle serialization of a contribution (i.e. CLEF2012wn-RepLab-KarlgrenEt2012b) returned by the LOD-DIRECT system

5 Accessing the experimental data

The described RDF model has been realized by the DIRECT system which allows for accessing the experimental evaluation data enriched by the expert profiles created by means of the techniques that will be described in the next sections. This system is called LOD-DIRECT and it is available at the URL:

The data currently available include the contributions produced by the CLEF evaluation activities, the authors of the contributions, information about CLEF tracks and tasks, provenance events, and the above described measures. Furthermore, this data has been enriched with expert profiles and expertise topics which are available as linked data as well.

At the time of writing, LOD-DIRECT allows access to 2229 contributions, 2334 author profiles, and 2120 expertise topics. Overall, 1659 experts have been individuated and on average there are 8 experts per expertise topics (an expert can have more than one expertise of course).

LOD-DIRECT serializes and allows the access to the defined resources in several different formats such as XML, JSON, RDF+XML, TurtleFootnote 17 and Notation3 (n3)Footnote 18.

The URIs of the resources are constructed following the pattern:

base-path/resource-name/id;ns

where,

  • base-path is http://lod-direct.dei.unipd.it;

  • resource-name is the name of the resource to be accessed as defined in the RDF model presented above;

  • id is the identifier of the resource of interest;

  • ns is the namespace of the resource of interest, this applies only for the namespace identifiable resources.

As an example, the URI corresponding to the contribution resource shown in Fig. 2 with identifier CLEF2012wn-Rep Lab-KarlgrenEt2012b is:

In Fig. 8, we can see the Turtle serialization returned by the URI above. The serialization of a contribution is composed of four main parts: (i) the prefixes, reporting all the required information about the vocabularies adopted by the RDF model to represent the given resource; (ii) the authors of the contribution, which in this case are four comprising “Jussi Karlgren” who is the expert in “Reputation Management” reported in the use case in Fig. 2; (iii) the serialization of the contribution itself, which includes information such as the title and the link to get the linked digital object; and, (iv) the metadata describing the RDF representation of the contribution.

The metadata reported in Fig. 8 is an instance of the metadata returned for each resource in the LOD-DIRECT system; this metadata is “intended as a bridge between the publishers and users of RDF data” as in the case of VoID (Vocabulary of Interlinked Datasets)Footnote 19. As a matter of fact, the LOD-DIRECT system employs the VoID description principles; for instance, the author and the rights related to the considered resource are described by means of the Dublin Core vocabulary (i.e. dc:creator and dc:rights) as prescribed by the VoID specification.

LOD-DIRECT comes with a fine-grained access control infrastructure which takes care of monitoring the access to the various resources and functionalities offered by the system. On the basis of the requested operation, it performs:

  • authentication, i.e. it asks for user credentials before allowing to perform an operation;

  • authorization, i.e. it verifies that the user currently logged in holds sufficient rights to perform the requested operation.

The access control policies can be dynamically configured and changed over the time by defining roles, i.e. groups of users, entitled to perform given operations. This allows institutions to define and put in place their own rules in a flexible way according to their internal organization and working practices.

The fine-grained access control to resources is managed via groups of users, which can have different access permissions. The general rules are as follows:

  • private resources: they can be read and modified only by the owner of the resource;

  • shared resources: they can be read and modified by the owner of the resource; then, a list of groups can share the resource with different access permissions, namely “read only”, which means that the users of that group can only read but not modify the resource, and “read/write”, which means that the users of that group can read and modify the resource;

  • public resources: they can be read by everybody; they can be read and modified by the owner of the resource; then, a list of groups can share the resource with different access permissions, namely “read only”, which means that the users of that group can only read but not modify the resource, and “read/write”, which means that the users of that group can read and modify the resource.

The access control infrastructure allows us to manage the experimental data which cannot be publicly shared such as log files coming from search engine companies.

6 Semantic enrichment

In this section, we describe several methods for semantically enriching experimental IR data modelled as described above, by analysing unstructured data available in scientific publications. Figure 9 presents an overview of the semantic enrichment of documents and authors based on term and topical hierarchy extraction. First, we propose a method to automatically extract expertise topics from a domain-specific collection of publications using an approach for term extraction in Sect. 6.1. Then, we present a preliminary approach for enriching expertise topics by grounding them in the LOD cloud in Sect. 6.2. An approach for expert profiling based on automatically extracted expertise topics is discussed in Sect. 6.3. In Sect. 6.4 we present several measures that can be used to rank experts for a given topic making use of an automatically extracted hierarchy of terms.

Fig. 9
figure 9

Data flow of the semantic enrichment approach

6.1 Expertise topic extraction

Topic-centric approaches for expert search emphasize the extraction of key phrases that can succinctly describe expertise areas, also called expertise topics, using term extraction techniques [28]. An advantage of a topic-centric approach compared to previous work on expert finding [23] is that topical profiles can be constructed directly from text, without the need for controlled vocabularies or for manually identifying terms. Expertise topics are extracted from a domain-specific corpus using the following approach. First, candidate expertise topics are discovered from text using a syntactic description for terms (i.e. nouns or noun phrases) and contextual patterns that ensure that the candidates are coherent within the domain. A domain model is constructed using the method proposed in [29] and then noun phrases that include words from the domain model or that appear in their immediate context are selected as candidates. Candidate terms are further ranked using the scoring function s, defined as:

$$\begin{aligned} s(\tau ) = |\tau | \log f(\tau ) + \alpha e_\tau , \end{aligned}$$
(1)

where \(\tau \) is the candidate string, \(|\tau |\) is the number of words contained by candidate \(\tau \), f is its frequency in the corpus, and \(e_\tau \) is the number of terms that embed the candidate string \(\tau \). The parameter \(\alpha \) is used to linearly combine the embeddedness score \(e_\tau \). In Table 4, we report the top-ranked expertise topics extracted from IR publications using the described method. These topics describe core concepts of the domain such as search engine, IR system, and retrieval task, as well as prominent subfields of the domain including image retrieval, machine translation, and question answering.

Only the best 20 expertise topics are stored for each document, ranking expertise topics based on their overall score \(s(\tau )\) multiplied with their tf-idf score. In this way, each document is enriched with key phrases, taking into consideration the quality of a term for the whole corpus in combination with its relevance for a particular document.

Table 4 Top 20 expertise topics extracted from IR scientific publications

6.2 Enriching expertise topics using LOD

Expertise topics can be used to provide links between IR experimental data and other data sources. These links play an important role in cross-ontology question answering, large-scale inference and data integration [75]. Also, existing work on using knowledge bases in combination with IR techniques for semantic query expansion shows that background knowledge is a valuable resource for expert search [44, 89]. Additional background knowledge, as found on the LOD cloudFootnote 20, can inform expert search at different stages. For example, manually curated concepts can be leveraged from a large number of domain-specific ontologies and thesauri. Also, the LOD cloud contains a large number of datasets about scientific publications and patent descriptions that can be used as additional evidence of expertise.

A first step in the direction of exploiting this potential is to provide an entry point in the LOD cloud through DBpediaFootnote 21, one of the most widely connected datasources that is often used as an entry point in the LOD cloud. Two promising approaches for semantic term grounding on DBpedia are described and evaluated in Sect. 7.2.1. Our goal is to associate as many terms as possible with a concept from the LOD cloud through DBpedia URIs—as shown in the use-case presented in Sect. 3. Where available, concept descriptions are collected as well and used in our system. Initially, we find all candidate URIs using the following DBpedia URI pattern.

http://dbpedia.org/resource/{DBpedia_label}

Here DBpedia_concept_label is the expertise topic as extracted from our corpus. A large number of candidates are generated starting from a multi-word term, as each word from the concept label can start with a letter in lowercase or uppercase in the DBpedia URI. As an example, let us consider the expertise topic “Natural Language Processing”, where all possible case variations are generated to obtain the following URI:

http://dbpedia.org/page/Natural_language_processing

To ensure that only DBpedia articles that describe an entity are associated with an expertise topic, we discard category articles and consider only articles that match the dbpedia-owl:title or the final part of the candidate URI with the topic. Multiple morphological variations are extracted and stored from our corpus for each expertise topic. Each of these variations is used to search for a URI, increasing the number of matches.

6.3 Expert profiling

Expertise profiles are brief descriptions of a person’s expertise and interests that can inform the selection of experts in different scenarios. Whenever we refer to an expertise profile throughout this work, we mean a topical profile. Although a person frequently writes about a subject area, the way they combine this area with other topics is more interesting, because a person is rarely an expert on every aspect of a topic [73]. In [25], several requirements are identified for expertise profiles including coherence, completeness, conciseness, and diversity. The same study states that an important requirement for expertise topics is that they have to be at the right level of specificity.

Following [21], we define a topical profile of a candidate as a vector of expertise topics along with scores that measure the expertise of a candidate. Therefore, the expertise profile p of a researcher r is defined as:

$$\begin{aligned} p(r) = \{s(r, t_1), s(r, t_2),\ldots ,s(r, t_n)\}, \end{aligned}$$
(2)

where \(t_1, t_2,\ldots ,t_n\) are the expertise topics extracted from a domain-specific corpus.

A first step in constructing expertise profiles is to identify terms that are appropriate descriptors of expertise. A large number of expertise topics can be extracted for each document, but only the top-ranked key phrases are considered for expert profiling, as described in the previous section. Once a list of expertise topics is identified, we proceed with assigning scores to each expertise topic for a given expert. We rely on the notion of relevance, effectively used for document retrieval, to associate expertise topics with researchers. Researchers’ interests and expertise are inferred based on their scientific contributions. Each expertise topic mentioned in one of these contributions is assigned to their expertise profile using an adaptation of the standard IR measure tf-idf [16]. The set of contributions authored by a researcher is aggregated into a virtual document that allows us to compute the relevance of an expertise topic for each researcher. In the case of multi-author publications, the authors are considered to contribute equally to each of the topics mentioned in the paper. This is not always the case; therefore, profiles tend to be more accurate when multiple publications authored by a person are available. An expertise topic is added in the expertise profile of a researcher using the following scoring function:

$$\begin{aligned} s_{ep}(r,t) = termhood(t) \cdot tfirf(t,r), \end{aligned}$$
(3)

where \(s_{ep}(r, t)\) represents the score for an expertise topic t and a researcher r, termhood(t) represents the score computed in Eq. 1 for the topic t,  and tfirf(tr) stands for the tf-idf measure for the topic t on the aggregated document of researcher r. In this way, we construct profiles with terms that are representative for the domain as well as highly relevant for a given researcher.

6.4 Expert finding

Expert finding is the task of identifying the most knowledgeable person for a given expertise topic. In this task, several competent people have to be ranked based on their relative expertise on a given expertise topic. Documents written by a person can be used as an indirect evidence of expertise, assuming that an expert often mentions his areas of interest. We rely on the tf-irf measure described in the previous section to measure the relevance of a given expertise topic for a researcher. Each researcher is represented by an aggregated document that is constructed by concatenating all the documents authored by that person. Therefore, the relevance score R(rt) that measures the interest of a researcher r for a given topic t is defined as:

$$\begin{aligned} R(r,t) = tfirf(t,r). \end{aligned}$$
(4)

Expertise is closely related to the notion of experience. The assumption is that the more a person works on a topic, the more knowledgeable they are. We estimate the experience of a researcher on a given topic by counting the number of publications that have the topic assigned as a top-ranked key phrase. Let \(D_{r,t}\) be the set of documents authored by researcher r that have the expertise topic t as a key phrase. Then, the experience score E(rt) is defined as:

$$\begin{aligned} E(r,t) = |D_{r,t}|, \end{aligned}$$
(5)

where \(|D_{r,t}|\) is the cardinality, or the total number of documents, in the set of documents \(D_{r,t}\). It can be argued that it is not only the number of publications that indicates expertise, but the quality of those publications as well. We leave for future work the integration of publication impact in this score, measured using citation counts modelled by the DIRECT conceptual model and available in the RDF graph of the exposed experimental data.

Relevance and expertise measure different aspects of expertise and can be combined to take advantage of both features as follows:

$$\begin{aligned} RE(r,t) = R(r,t) \cdot E(r,t). \end{aligned}$$
(6)

Both the relevance score and the experience score rely on query occurrences alone. But a topical hierarchy can provide valuable information about hierarchical relations between expertise topics and can improve expert finding results. Taxonomies are not always available and are difficult to maintain; therefore we consider an automatic approach for extracting hierarchical relations. Take for example the topical hierarchy presented in Fig. 10, which was automatically constructed using publications from the CLEF evaluation campaign using the method proposed in [64]. When searching for experts in image retrieval, we can make use of the information that image annotation and visual features are closely related expertise topics that are subordinated to the topic of interest. In the same way, when searching experts on the expertise topic question answering, we can use information about the subordinated terms QA system and answer extraction.

Fig. 10
figure 10

Topical hierarchy automatically constructed for the CLEF evaluation campaign

In the case that the subtopics of an expertise topic are known, we can evaluate the expertise of a person based on their knowledge of the more specialized fields. A previous study showed that experts have increased knowledge at more specific category levels than novices [88]. We introduce a novel measure for expertise called Area Coverage that measures whether an expert has in-depth knowledge of an expertise topic, using an automatically constructed topical hierarchy. Let Desc(t) be the set of descendants of a node t in the topical hierarchy, then the Area Coverage score C(it) is defined as:

$$\begin{aligned} C(i,t) = \frac{|\big \{ t' \in Desc(t) : t \in p(i) \big \}| }{|Desc(t)|}, \end{aligned}$$
(7)

where p(i) is the profile of an individual i constructed using the method presented in the previous section. In other words, Area Coverage is defined as the proportion of a term’s descendants that appear in the profile of a person. For expertise topics that have no descendent, Area Coverage is defined as 1.

Finally, the score REC(it) used to rank people for expert finding is defined as follows:

$$\begin{aligned} REC(i,t) = RE(i,t) \cdot C(i,t). \end{aligned}$$
(8)

This score combines several performance indicators, measuring the expertise of a person based on the relevance of an expertise topic, the number of documents about the given topic, as well as his depth of knowledge of the field, called Area Coverage.

7 Experimental evaluation

7.1 Experimental setup

7.1.1 Expert search datasets

Evaluating expert search systems remains a challenge, despite a number of datasets that have been made publicly available in recent years [18, 20, 86]. Traditionally, relevance assessments for expert finding were gathered either through self-assessment or based on opinions of co-workers. On the one hand, self-assessed expert profiles are subjective and incomplete, while on the other hand opinions of colleagues are biased towards their social and geographical network. We address these limitations by exploiting expertise data generated in a peer-review setting [27]. Our aim is to collect a representative dataset of experts in information retrieval along with their publications and expertise topics. We consider conference workshops in the related fields of IR, DL, and recommender systems (RS). About 25 thousand publications are gathered along with data about 60 workshops. Each workshop is associated on average with 15 experts and almost 500 expertise topics are manually extracted to describe these events.

To construct a test collection covering all these research fields, we used the DBLP Computer Science BibliographyFootnote 22. Our initial motivations for constructing a test collection around DBLP are twofold: (1) the fields of IR, DL, and RS are well covered in DBLP, and (2) a special version of the DBLP dataset, augmented with citation information, is available from the team behind ArnetMiner, which allows for investigations into the use of citation information for expert search.

To make the augmented DBLP collection suited to expert search evaluation, we need realistic topic descriptions as relevance judgements at the expert level. Workshops organized at major conferences covering the fields of IR, DL, and RS are used to collect relevance judgements. To identify relevant workshops, we visited the websites of the CIKM, ECDL, ECIR, IIiX, JCDL, RecSys, SIGIR, TPDL, WSDM, and WWW conferences, which have substantial portions of their program dedicated to IR, DL, and RS. We collect links to workshop websites for all workshops organized at those conferences between 2001 and 2012. This resulted in a list of 60 different workshops with websites.

As a starting point, a test collection covering the aforementioned fields is constructed using the augmented DBLP dataset released by the team behind ArnetMiner. This dataset is an October 2010 crawl of of the DBLP dataset containing 1,632,442 different papers with 2,327,450 citation relationships between papers in the datasetFootnote 23. As this augmented dataset contains publications from all fields of computer science, we filtered out all publications not belonging to IR, DL, and RS by restricting the collection to publications in relevant journals, conferences, and workshops. This step and all of the steps listed below were completed in June 2012.

The list of relevant venues was created in two steps. First, we generated a list of core venues by extracting all papers published at conferences used for topic creation: CIKM, ECDL, ECIR, IIiX, JCDL, RecSys, SIGIR, TPDL, WSDM, and WWW. We select these conferences, because as hosts to the topic workshops, they are likely to be relevant venues for PC members to publish in. This resulted in a dataset containing 9046 different publications from these core venues. However, restricting ourselves to these venues alone means we could be missing out on experts that tend to publish more in journals and workshops. We therefore extend the list of core venues with other venues tracked by DBLP that also have substantial portions of their program dedicated to IR, DL, and RS. Venues that only feature incidental overlap with IR, such as the Semantic Web conference, are not included. We also exclude venues that did not have five publications or more in the augmented DBLP dataset. While this does exclude the occasional on-topic publication in venues that are predominantly about other topics, we believe that this strategy will cover the majority of relevant publications. This additional filtering step results in a final list of 78 curated venues (core plus additional) covering a total of 24,690 publications.

In addition to citation information, the augmented DBLP dataset was also extended with abstracts wherever available. However, the team behind ArnetMiner was only able to add abstracts for 33.7 % of the 1.6 million publications (and 43.5 % of the 24,690 publications in our test collection). We therefore attempted to download the full-text versions of all 24,690 publications using Google Scholar. We constructed a search query consisting of the last name of the first author and the full title without surrounding quotesFootnote 24. We then extracted the download link from the top result returned by Google Scholar (if available). We were able to find download URLs for 14,823 of the 24,690 publications in our filtered DBLP dataset for which a recall of 60.04 %, where recall is defined as the percentage of papers in our filtered DBLP dataset for which we could find download URLs. While this is not as high as we would like, it does represent a substantial improvement over the percentage of abstracts present in the augmented DBLP dataset. Moreover, a recall rate of 100 % is impossible to achieve as tutorials, keynote abstracts, and even entire proceedings are typically not available online in full text, but they are present in the DBLP dataset.

Around 90.15 % of the download URLs obtained in this manner were functional, which means we were able to download full-text publication files for 13,363 publications (or 54.12 % of our entire curated dataset). We performed a check of 100 randomly selected full-text files to see if these are indeed the publications we are looking for and achieved a precision of 97 % on this sample. We therefore assume that the false-positive rate of our approach is acceptably low. This augmented DBLP collection is publicly availableFootnote 25.

Besides the Information Retrieval dataset described above, we report results obtained for similar datasets in two other computer science fields, including Semantic Web and computational linguistics. Table 5 gives an overview of the considered datasets in terms of number of documents, workshops, authors and expertise topics.

Table 5 Overview of workshop-based test collections for information retrieval (IR), computational linguistics (CL), and Semantic Web (SW)

These domain-specific datasets contain a large amount of scientific publications that are focused on a given field of research. This allows us to investigate expertise in a given research community, while previous studies on expert search put more effort into analysing expertise inside knowledge-intensive organizations. The UvT dataset, introduced in [20], contains information about the employees of Tilburg University that was collected from a publicly accessible expertise database. The UvT dataset is more heterogeneous than the workshop datasets, as it gathers information from manually provided summaries of research and courses, personal homepages, as well as publications. Table 6 gives an overview of the size of the UvT dataset. The UvT dataset is topically more diverse than the datasets presented in the previous section, covering broad areas of study such as economics, law, information technology, public administration, or criminology. Although expertise topics are available in Dutch and English, in our experiments we considered only 981 expertise topics available in English.

Table 6 Overview of the UvT Expert Dataset, including research descriptions (RD), course descriptions (CD), publications (PUB), and personal homepages (HP)

About 7 % of the publications are available as full content, with most publications being available as citations only. The large and diverse number of expertise topics combined with the limited availability of textual descriptions leads to challenges related to data sparseness. Nevertheless, the expert finding and expert profiling tasks are easier on the UvT dataset. This is due to the fact that most documents are high-quality summaries of expertise and that there are a relatively smaller number of people in the dataset. Additionally, there is a small number of overlapping expert profiles, because in a university less people have similar interests than in a research community.

7.1.2 Baseline approaches

The approaches proposed in this section are evaluated against two IR methods for expert finding and expert profiling [19]. Both methods model documents and expertise topics as bags of words and take a generative probabilistic approach, ranking expertise topics t by the probability P(t|i) that they are generated by the individual i [19]. The same probability is used for ranking expertise topics in a person’s profile, as well as for finding knowledgeable people for expert finding. The first model constructs a multinomial language model \(\theta _i\) for each individual, over the vocabulary of documents authored by them. This is similar to our approach that computes the relevance of a topic for an individual on a document that aggregates all the documents authored by that person.

The assumption is that expertise topics are sampled independently from this multinomial distribution. Therefore, the probability P(t|i) can be computed as:

$$\begin{aligned} P(t|i) = P(t|\theta _i) = \prod _{w \in t} P(w|\theta _i)^{n(w,t)}, \end{aligned}$$
(9)

where n(wt) is the number of times the word w appears in the expertise topic t. Smoothing using collection word probabilities is applied to estimate \(P(w|\theta _i)\). The smoothing parameters are estimated with an unsupervised method, using Dirichlet smoothing and the average number of words associated with people as the smoothing parameter.

The second model considered as baseline that is also introduced in [19] estimates a language model \(\theta _d\) for each document from the set \(D_i\) of documents authored by the individual i. Words from an expertise topic t are sampled independently, summing the probabilities to generate an expertise topic for each of these documents. In this case, the probability P(t|i) is calculated using the following equation:

$$\begin{aligned} P(t|i) = \sum _{d \in D_i} P(t|\theta _d) = \sum _{d \in D_i} \prod _{w \in t} P(w|\theta _d)^{n(w,t)}. \end{aligned}$$
(10)

Again, the probability \(P(w|\theta _d)\) is estimated using the same unsupervised smoothing method. In this case, the smoothing parameter for Dirichlet smoothing is the average document length in the corpus.

7.1.3 Evaluation measures

Given the tasks at hand, several evaluation measures for document retrieval can be used. The expert profiling and the expert finding tasks are evaluated based on the quality of ranked lists of expertise topics and of experts, respectively. From an evaluation point of view, this is not different from evaluating a ranked list of documents with binary relevance judgements—i.e. a document is either relevant or not with respect to a given topic. The most basic evaluation measures used in IR are precision and recall. The first measure is given by the ratio between the number of relevant documents retrieved and the total number of retrieved documents. The second is given by the ratio between the number of relevant documents retrieved and the total number of relevant documents for a given topic. Other frequently used effectiveness measures include:

Precision at N (P@N) [90] This is the precision computed when N results are retrieved, which is usually used to report early precision at the top 5, 10, or 20 results.

Average precision (AP) [58] Precision is calculated for every retrieved relevant result and then averaged across all the results.

Reciprocal rank (RR) [40] This is the reciprocal of the first retrieved relevant document, which is defined as 0 when the output does not contain any relevant documents.

To get a more stable measurement of performance, these measures are commonly averaged over the number of queries. In our experiments, we report the values for the mean average precision (MAP) and mean reciprocal rank (MRR). In this setting, recall is less important than achieving a high precision for the top-ranked results, because it is more important to recommend true experts than to find all experts in a field.

7.2 Experiments

7.2.1 Semantic grounding of expertise topics

Two approaches for grounding expertise topics on DBpedia are evaluated in this section. The first approach matches a candidate DBpedia URI with an expertise topic, using the string as it appears in the corpus. The second approach makes use of the lemmatized form of the expertise topic. Stemming was also considered, but this approach resulted in a decrease in performance, as stems are more ambiguousFootnote 26. To evaluate our URI discovery approach, we build a small gold standard dataset by manually annotating 186 expertise topics with DBpedia URIs. First of all, we note that about half of the analysed expertise topics have a corresponding concept in DBpedia. One of the main reasons for the low coverage is that DBpedia is a general knowledge datasource that has a limited coverage of specialized technical domains.

Although both approaches achieve similar results in terms of F-score, the approach that makes use of lemmatization (A2) achieves better precision, as can be seen in Table 7. Surprisingly, using lemmatization achieves a lower recall, but higher precision but this might be due to the small size of the dataset. To extract descriptions or definitions of concepts, we rely on the dbpedia-owl:abstract property, or the rdfs:comment property in the absence of the former. For now, we are interested in English definitions; therefore, we consider triples tagged with the property lang=’en’ alone. Even though English descriptions are available for a larger number of topics, this tag is not always present. Therefore, we can only retrieve descriptions for a smaller number of topics. A manual analysis of matching errors showed that expertise topics that include an acronym (e.g. “NLG system” instead of “Natural Language Generation system”) are more difficult to associate with a DBpedia concept, as often acronyms are ambiguous.

Table 7 Precision and recall for DBpedia URI extraction

Other general purpose data sources, such as FreebaseFootnote 27, or domain-specific data sources can be linked in a similar manner. A complex problem that we do not address in this work is the disambiguation of an expertise topic when multiple concepts from different domains can be matched. Usually, DBpedia provides a disambiguation page for such cases. In our implementation, we did not analyse concepts that redirect to a disambiguation page, grounding only those expertise topics that are specific enough to be used in a single domain.

7.2.2 Expert profiling

The topic-centric approach (TC) for expert profiling proposed in Sect. 6.3 can be applied for expert profiling without the need for controlled vocabularies, as expertise topics are directly extracted from text. Instead, the language modelling approach used as a baseline in this section can only be used on datasets where such resources are readily available. The results for the expert profiling task on the IR dataset are presented in Table 8.

Table 8 Expert profiling results for the language modelling approach (LM) and the topic-centric approach (TC)

The language modelling approaches achieve better results on the IR and the UvT datasets, with the LM2 approach outperforming the LM1 approach on most measures. The gap between the language modelling approaches and the TC approach is more narrow on the IR dataset. Not surprisingly, our method for extracting expertise topics is underperforming when applied to a corpus that covers diverse expertise areas, such as the UvT dataset. Another difference between these datasets is the number of documents that are available for each person. The LM1 and LM2 models achieve the worse results on the SW dataset, where only 8 % of the people are associated with more than three documents.

Table 9 Expert finding results for the language modelling approach (LM), experience (E), relevance and experience (RE), and relevance, experience and area coverage (REC)

7.2.3 Expert finding

We compare several topic-centric methods for expert finding with two language-modelling baselines. The results for the expert finding task are presented in Table 9. The expert finding methods evaluated in this section include experience (E), relevance and experience (RE) and relevance, experience and area coverage (REC). These methods are described by Eqs. 5, 6, and 8, respectively, in Sect. 6.4. The Area Coverage measure makes use of a topical hierarchy. Therefore we automatically construct a topical hierarchy for IR using the method proposed in [64].

Fig. 11
figure 11

Sample hierarchical relations for the IR domain

Figure 11 shows a small extract from this hierarchy that correctly identifies “information retrieval” as the root of the taxonomy as well as several subfields including “digital libraries”, “interactive information retrieval”, and “cross language information retrieval”.

Table 10 Graph size for topical hierarchies constructed for computational linguistics (CL), Semantic Web (SW), information retrieval (IR), and Tilburg University (UvT)

A short summary of the constructed topical hierarchies for each domain is presented in Table 10. Depending on the number of documents available in each dataset, a different number of expertise topics is extracted and subsequently considered for constructing a topical hierarchy. The CL dataset is the largest dataset, allowing us to filter edges in a pre-processing step based on the number of documents that provide evidence for the relation. An edge is added in the noisy graph only if at least three different documents provide evidence for the relation. This setting is not used for smaller datasets because it reduces the number of edges and the connectivity of the graph. For the same reason, the window size used to count co-occurrences of terms is larger for the smaller datasets than for the CL dataset. The topical hierarchy constructed for the IR domain is constructed by considering all the co-occurrences between two expertise topics in a window of five words. A larger window size would increase the number of edges, but the relations would be less reliable. Considering more than 98 % of the nodes are connected by an edge, we did not consider increasing the window size. Figure 12 presents an overview of node degree in the information retrieval hierarchy. More than half of the terms are specific terms that have no descendants, but a considerable number of nodes have several child nodes.

Fig. 12
figure 12

Overview of node degree for the information retrieval hierarchy (logarithmic scale)

We note that the topic-centric approaches (E, RE, REC) outperform language modelling approaches on domain-specific datasets such as the CL, SW, and IR datasets. Our experimental results lead us to the conclusion that the more specialized a dataset, the less reliable is the relevance-based assessment of expertise. In the case of the Semantic Web dataset, which is the most focused dataset, using the relevance-based measure (RE) even decreases performance compared to the expertise score (E). Language modelling approaches outperform topic-centric approaches on the UvT dataset alone, which is the most broad dataset among the four considered datasets. This is because expertise profiles have a larger degree of overlap when dealing with focused datasets that describe a narrow domain. For example, it is easier to distinguish between experts in history and mathematics using relevance-based methods, but more difficult to distinguish between two experts in Semantic Web that address similar topics in their publications.

Using a topical hierarchy by computing Area Coverage improves the results across all datasets except the IR dataset, in terms of MAP. In terms of P@5, the results are improved on all datasets except on the UvT dataset. These results confirm our hypothesis that automatically constructed topical hierarchies can inform expert finding.

8 Conclusion

In this paper, we discussed the data modelling and the semantic enrichment of IR experimental data, as produced by large-scale evaluation campaigns. We described in detail the evaluation workflow used for information access systems and proposed an RDF model for two areas of the workflow, namely resource management and scientific production. This model is used as a common basis for semantic enrichment and for augmenting the discoverability, accessibility, and reusability of the experimental data. Unstructured data in the form of scientific publications were used to inform the extraction of various types of semantic enrichment. Expertise topics were automatically extracted and used to describe documents and to create expert profiles. Several topic-centric measures for expert finding were proposed, allowing users to identify knowledgeable members of the community. In this way, we created new relationships among existing data, allowing a more meaningful interaction with them.

We introduced an evaluation dataset for expert search in IR, relying on scientific publications available online and on implicit expertise information about workshop committee members. Our experiments show that it is possible to construct expertise profiles using automatically extracted expertise topics and that topic-centric approaches for expert finding outperform state-of-the-art language modelling approaches on most of the considered datasets.

In particular, besides the methodological contributions described above, the main reusable deliverables of the paper are:

  • an accurate RDF data model for describing IR experimental data in detail, available at http://ims.dei.unipd.it/data/rdf/direct.3.10.ttl;

  • a dataset about CLEF contributions, extracted expertise topics and related expert profiles, developed according to the methods proposed in the paper;

  • the online accessible LOD DIRECT system, available at http://lod-direct.dei.unipd.it/, to access the above data in different serialization formats, RDF+XML, Turtle, N3, XML, and JSON.

Future work will concern the application of these semantic modelling and automatic enrichment techniques to other areas of the evaluation workflow. For example, expert profiling and topic extraction could be used to automatically improve and enhance the descriptions of the single experiments submitted to an evaluation campaign, which are typically not very rich and often cryptic—for example “second iteration with tuned parameters” as description—and to automatically link experiments to external resources, e.g. describing the used components, such as stemmers or stop lists, and systems. Finally, the RDF model defined within DIRECT opens up the possibility of integrating established DL methodologies for data access and management which increasingly exploit the LOD paradigm [45, 62, 68, 87]. This would enable broadening the scope and the connections between IR evaluation and other related fields, providing new paths for semantic enrichment of the experimental data. Furthermore, we shall extend the DIRECT provenance event section by keeping track of the role and the groups to which a user belonged where a specific action on a resource was taken.

DIRECT RDF model can also play a significant role in the call for better transparency and reproducibility in science [17]. Indeed, it can be paired up with data citation methodologies [34, 35, 78] to define a methodology to connect results in scientific papers with the actual data on which they are based as well as to sustain scientific claims as proposed in [85].

Additionally, we plan to improve the automatically constructed taxonomy used in this work by making use of hierarchical relations provided in the DBpedia category structure and a disambiguation approach for grounding expertise topics.

Lastly, when it will arise, we plan to tackle the problem of name entity disambiguation as the dataset grows and the number of users (i.e. contribution authors) expands with the use of the dataset. Indeed, this issue does not impact the current dataset given the relatively small size of the IR community we consider here, but it has to be taken into account if we enlarge the boundaries of the system.