Introduction

The recent trends in research assessment, the development of altmetrics (Cronin and Sugimoto 2014), the crucial role of data together with the complexity of research assessment, granularity and increasingly demanding policy needs call for new ways of data integration and management.

There have been several initiatives of governments and research projects on these matters. However, the main problems of integration of data on Science, Technology and Innovation (STI), such as the data quality issues; the comparability problems; the lack of standardization, interoperability and modularization; the difficulties in the creation of concordance tables among different classification schemes; the difficult and costly extension and update of the integrated database, are far from being solved.

The quantitative analysis of Science and Technology is becoming a “big data” science, with an increasing level of “computerization”, in which large and heterogeneous datasets on various aspects are combined. In this context, understanding and formally specifying the meaning of data is of paramount importance.

Within this framework, optimistic views, supporting “the end of theory” in favour of data-driven science (Kitchin 2014), have been opposed to more critical positions in favour of theory-driven scientific discoveries (Frické 2015) while a more balanced view emerged from a critical analysis of the current existing literature (Ekbia et al. 2015), leading the information systems community to further deeply analyse the critical challenges posed by the big data development (Agarwal and Dhar 2014). It has been rightly highlighted that “Data are not simply addenda or second-order artifacts; rather, they are the heart of much of the narrative literature, the protean stuff that allows for inference, interpretation, theory building, innovation, and invention” (Cronin 2013, p. 435).

The necessity of providing accountability of STI activities to sustain their funding in the current difficult economic and financial situation is increasingly asking for rigorous empirical evidence to support informed policy making.

The needs to overcome the logic of rankings and the new trends in indicators development, including granularity and cross-referencing, can be explored and exploited in open data platforms with a clear description of the main concepts of the domain (Daraio and Bonaccorsi 2015). The complexity of the multidimensionality of research assessment and scholarly impact (Moed and Halevi 2015) is questioning the traditional approach in indicators development. Diverse institutional missions, and different policy environments and objectives require different assessment processes and indicators. In addition, the range of people and organizations requiring information about university based research is growing. Each group has specific but also overlapping requirements (AUBR 2010, p. 51).

The assessment of research has to take into account a range of different types of research output and impact. See Table 1 for a non-exhaustive outline: it includes forms that are becoming increasingly important such as research data files, and communications submitted to social media and scholarly blogs. The first column indicates the main types of impact a particular output may have. A distinction is made between scientific-scholarly impact, and more wider impact outside the domain of science and scholarship, denoted as “societal”, a concept that embraces technological, economic, social and cultural impact.

Table 1 Types of research outputs, impacts and indicators (Source: adapted from Moed and Halevi 2015)

A more detailed list of possible outputs by research area is reported in the specifications of the Panel Criteria in the Research Excellence Framework in the UK (REF 2012, p. 51). See also AUBR (2010) and Moed and Halevi (2015) for further details.

It is also important to include the inputs in the research assessment process; they should be jointly analysed with the outputs to assess the overall impact of the process (see e.g. Daraio et al. 2015a, for a conditional multidimensional approach to rank higher education institutions).

To meet all these new trends and policy needs a shift in the paradigm of data integration for research assessment is needed. In this paper we advocate an Ontology-Based Data Management (OBDM) approach to integrate heterogeneous data sources, including big scholarly data (such as publications and citations) to support the assessment of research and develop “science of science” policy models.

The paper unfolds as follows. In the next section we illustrate the main problems of heterogeneous data integration. Section 3 presents the main advantages of an OBDM approach and outlines its implementation through Sapientia, the ontology of multidimensional research assessment. Section 4 illustrates the usefulness of an OBDM approach to specify STI indicators in an innovative way. Section 5 shows how an OBDM approach may be useful to develop science of science policy models, while Sect. 6 concludes the paper.

Difficulties in accessing and managing distributed and heterogeneous data

While the amount of data stored in current information systems and the processes making use of such data continuously grow, turning these data into information, and governing both data and processes are still tremendously challenging tasks for Information Technology. The problem is complicated due to the proliferation of data sources and services both within a single organization, and in cooperating environments. The following factors explain why such a proliferation constitutes a major problem with respect to the goal of carrying out effective data governance tasks:

  • Although the initial design of a collection of data sources and services might be adequate, corrective maintenance actions tend to re-shape them into a form that often diverges from the original conceptual structure.

  • It is common practice to change a data source (e.g. a database) so as to adapt it both to specific application-dependent needs, and to new requirements. The result is that data sources often become data structures coupled to a specific application (or, a class of applications), rather than application-independent databases.

  • The data stored in different sources and the processes operating over them tend to be redundant, and mutually inconsistent, mainly because of the lack of central, coherent and unified coordination of data management tasks.

The result is that information systems of medium and large organizations are typically structured according to a “sylos”-based architecture, constituted by several, independent, and distributed data sources, each one serving a specific application. This poses great difficulties with respect to the goal of accessing data in a unified and coherent way. Analogously, processes relevant to the organizations are often hidden in software applications, and a formal, up-to-date description of what they do on the data and how they are related with other processes is often missing. The introduction of service-oriented architectures is not a solution to this problem per se, because the fact that data and processes are packed into services is not sufficient for making the meaning of data and processes explicit. Indeed, services become other artifacts to document and maintain, adding complexity to the governance problem. Analogously, data warehousing techniques and the separation they advocate between the management of data for the operation level, and data for the decision level, do not provide solutions to this challenge. On the contrary, they also add complexity to the system, by replicating data in different layers of the system, and introducing synchronization processes across layers.

All the above observations show that a unified access to data and an effective governance of processes and services are extremely difficult goals to achieve in modern information systems. Yet, both are crucial objectives for getting useful information out of the information system, as well as for taking decisions based on them.

This explains why organizations spend a great deal of time and money for the understanding, the governance, the management, and the integration of data stored in different sources, and of the processes/services that operate on them, and why this problem is often cited as a key and costly Information Technology challenge faced by medium and large organizations today (Bernstein and Haas 2008).

In the next section we advocate for an OBDM (Lenzerini 2011) approach as a promising direction for addressing the above challenges.

Our proposal: an Ontology-Based Data Management approach (OBDM)

In this paper we argue that Sapientia, the ontology of the multi-dimensional research assessment with its underlying OBDM approach, may be a powerful tool to coordinate, integrate and maintain the data needed for STI policy development.

The key idea of OBDM is to resort to a three-level architecture, constituted by the ontology, the sources, and the mapping between the two. The ontology is a conceptual, formal description of the domain of interest to a given organization (or, a community of users), expressed in terms of relevant concepts, attributes of concepts, relationships between concepts, and logical assertions characterizing the domain knowledge. The data sources are the repositories accessible by the organization where data concerning the domain are stored. In the general case, such repositories are numerous, heterogeneous, each one managed and maintained independently from the others. The mapping is a precise specification of the correspondence between the data contained in the data sources and the elements of the ontology.

The main purpose of an OBDM system is to allow information users to query the data using the elements in the ontology as predicates. In this sense, OBDM can be seen as a form of information integration, where the usual global scheme is replaced by the conceptual model of the application domain, formulated as an ontology expressed in a logic-based language. With this approach, the integrated view that the system provides to information users is not merely a data structure accommodating the various data at the sources, but a semantically rich description of the relevant concepts in the domain of interest, as well as the relationships between such concepts. The distinction between the ontology and the data sources reflects the separation between the conceptual level, the one presented to the user, and the logical/physical level of the information system, the one stored in the sources, with the mapping acting as the reconciling structure between the two levels. This separation brings several potential advantages.

Firstly, the ontology layer in the architecture is the obvious mean for pursuing a declarative approach to information integration, and, more generally, to data governance. By making the representation of the domain explicit, we gain re-usability of the acquired knowledge, which is not achieved when the global schema is simply a unified description of the underlying data sources.

Secondly, the mapping layer explicitly specifies the relationships between the domain concepts on the one hand and the data sources on the other hand. Such a mapping is not only used for the operation of the information system, but also for documentation purposes. The importance of this aspect clearly emerges when looking at large organisations where the information about data is widespread into separate pieces of documentation that are often difficult to access and rarely conforming to common standards. The ontology and the corresponding mappings to the data sources provide a common ground for the documentation of all the data in the organisation, with obvious advantages for the governance and the management of the information system.

A third advantage has to do with the extensibility of the system. One criticism that is often raised to data integration is that it requires merging and integrating the source data in advance, and this merging process can be very costly. However, the ontology-based approach we advocate does not impose to fully integrate the data sources at once. Rather, after building even a rough skeleton of the domain model, one can incrementally add new data sources or new elements therein, when they become available, or when needed, thus amortising the cost of integration. Therefore, the overall design can be regarded as the incremental process of understanding and representing the domain, the available data sources, and the relationships between them. The goal is to support the evolution of both the ontology and the mappings in such a way that the system continues to operate while evolving, along the lines of “pay-as-you-go” data integration (Sarma et al. 2008). See Table 2 which summarizes the main advantages of the OBDM approach.

Table 2 Main advantages of an OBDM approach over a traditional “sylos”-based approach

The notions of OBDM were introduced in Calvanese et al. (2007), Poggi et al. (2008), Lenzerini (2011), and originated from several disciplines, in particular, Information Integration, Knowledge Representation and Reasoning, and Incomplete and Deductive Databases. The central notion of OBDM is therefore the ontology, and reasoning over the ontology is at the basis of all the tasks that an OBDM system has to carry out. In particular, the axioms of the ontology allow one to derive new facts from the source data, and these inferred facts greatly influence the set of answers that the system should compute during query processing. In the last decades, research on ontology languages and ontology inferencing has been very active in the area of Knowledge Representation and Reasoning. Description Logics (DLs, Baader et al. 2007) are widely recognized as appropriate logics for expressing ontologies, and are at the basis of the W3C standard ontology language (OWL). These logics permit the specification of a domain by providing the definition of classes and by structuring the knowledge about the classes using a rich set of logical operators. They are decidable fragments of mathematical logic, resulting from extensive investigations on the trade-off between expressive power of Knowledge Representation languages, and computational complexity of reasoning tasks. Indeed, the constructs appearing in the DLs used in OBDM are carefully chosen taking into account such a trade-off (Calvanese et al. 2007). As indicated above, the axioms in the ontology can be seen as semantic rules that are used to complete the knowledge given by the raw facts determined by the data in the sources. In this sense, the source data of an OBDM system can be seen as an incomplete database, and query answering can be seen as the process of computing the answers logically deriving from the combination of such incomplete knowledge and the ontology axioms. Therefore, at least conceptually, there is a connection between OBDM and the two areas of incomplete information (Imielinski and Lipski 1984) and deductive databases (Ceri et al. 1990).

The OBDM approach has been implemented in a research assessment framework within a research project funded by the University of Rome La Sapienza, which produced as an output Sapientia the ontology of multidimensional research assessment.Footnote 1

The main objective of Sapientia (the ontology of multidimensional research assessment) is to model all the activities relevant for the evaluation of research and for assessing its impact. For impact, in a broad sense, we mean any effect, change or benefit, to the economy, society, culture, public policy or services, health, the environment or quality of life, beyond academia (REF 2012). Sapientia 1.0 was closed on the 22nd of December 2014, and was organized in 14 modules (Overview, Agent, Activity, Research Activity, Educational Activity, Conferring degrees activity, Publishing activity, Preservation activity, Funding activity, Inspecting activity, Producing activity, Space, Taxonomy and Time), including around 350 symbols (concepts, relations and attributes).

We are consolidating our ontology (Sapientia), completing its documentation and investigating the interoperability of Sapientia with other existing initiatives, such as STAR Metrics, CERIF (http://www.eurocris.org), CASRAI (www.casrai.org), ISNI (www.isni.org) and so on. We found that our ontology is complementary with respect to the existing initiatives and the top-down approach we followed to its design and development is fully interoperable with existing initiatives cited above. Sapientia will be published on-line afterwards.

The current version of Sapientia, version 2.0, includes 11 modules that are organized according to Fig. 1, whose main agents and activities for each module are reported in Fig. 2.

Fig. 1
figure 1

The 11 modules of Sapientia 2.0: the ontology of multidimensional research assessment

Fig. 2
figure 2

Main agents and activities of Sapientia 2.0

As illustrated in Fig. 1, the Sapientia ontology models the main activities (Module 2) carried out by the agents (Module 1). It includes a core set of modules which are Research (Module 3), Education (Module 4) and Outcomes, including production, services and other third mission activities (Module 8). These activities are part of an extended set of modules which includes an ancillary module of Research (Module 4 Publishing) and other two modules containing relevant activities to foster the relationships among the core set of modules (i.e. Modules 6 Resources, including funding and projects, and Module 7 Review). The 11 modules that compose Sapientia are briefly described in Table 3.

Table 3 Description of the Sapientia 2.0’s modules

An OBDM approach to specify Science, Technology and Innovation (STI) indicators in an innovative way

The increased availability of data sources, the need to combine several assessment criteria and their actual use ask for an overarching structure to overcome the main problems in STI indicator development which are listed below (and summarized in Table 4, left column):

Table 4 Problems in STI design and benefits of an OBDM approach
  • Concepts are not clearly defined (e.g. what is a “publication”?)

  • Informal definitions can be based on everyday language

  • One concept name may refer to different concepts

  • Ad hoc definitions of indicators based on available datasets or specific user needs

  • Indicators non re-usable in future contexts

  • Database content is not fully transparent

  • Aggregate indicators cannot be decomposed into smaller units.

Table 4 (right column) reports the ways in which an OBDM approach may help in addressing the above mentioned problems. In Daraio et al. (2015b) we describe in details the ability of Sapientia to specify the performance indicators proposed by the Assessment of University-Based Research (AUBR 2010).

An OBDM approach offers the possibility to develop indicators according to the following dimensions (see Table 5).

Table 5 Dimensions of indicators in an OBDM framework

The main benefits of this approach for indicators’ designers and users (summarized in Table 4, right column) are the formal specification of the indicators which is made independently of the data and the opportunity to compute “comparable” indicator values at different level of aggregation. Moreover, it offers a reference system to check the quality and comparability level among the heterogeneous data sources and it permits an unambiguous way to define and compute the indicators. Finally, the knowledge on the indicator system (concepts and data sources) is embedded in a formal framework. This knowledge can be transferred more easily to new generations of producers and users.

Using Sapientia for science of science policy

The adoption of an OBDM approach, allows us to contribute to enriching the methodologies available for science of science policy (Fealing et al. 2011) and research assessment.

We consider the building of descriptive, interpretative, and policy models of our domain as a distinct step with respect to the building of the domain ontology. The ontology will intermediate the use of data in the modelling step, and should be rich enough to allow the analyst the freedom to define any model she considers useful to pursue her analytic goal.

Obviously, the actual availability of relevant data will constrain both the mapping of data sources on the ontology, and the actual computation of model variables and indicators of the conceptual model. However, the analyst should not refrain from proposing the models that she considers the best suited for her purposes, and to express, using the ontology, the quality requirements, the logical, and the functional specification for her ideal model variables and indicators. This approach has many merits, and in particular:

  • it permits the use of a common and stable ontology as a platform for building different models and indicators;

  • it addresses the efforts to enrich data sources, and verify their quality;

  • it makes transparent and traceable the process of approximation of variables and models when the available data are less than ideal;

  • it makes use of every source at the best level of aggregation, usually the atomic one (see examples in the following), allowing subsequent, multilevel and multidimensional aggregations.

In this framework, exploratory data analysis, and the building of synthetic indicators, are only an intermediate step of the modelling effort that aims to the interpretation of behaviours, the explanation of differences in performance, the identification of causal chains of phenomena. That leads to the development of a policy-design model, whose inputs are policy instruments, and whose outputs are performance indicators for research activities and economic welfare.

The learning and theory building process requires feedbacks that could also concern the ontology level: the addition of new concepts and data, through the specialization of general concepts or the enlargement of the ontology commitment, could reflect the intermediate achievements of the learning process such as the necessity of improvement of the theories submitted to test.

More often, however, a well-conceived ontology will resist to the competency test implied by new model and theories, and the most serious constraint to model development will be the impossibility of a complete mapping between the ontology and the sources, i.e. the lack of data. This is a negative result only for the short-term. In the medium and long term, the dialogue within the community of researchers that use the ontology as a workbench will result in a joint effort towards other stakeholders in order to improve detail, quality, and scope of data collection.

Moreover, the shared use of logically sound definition for indicators increase the ability of the analysts to compare their studies and to test old and new theories.

Consider as an example the important issue of the assessment of the effects of scale economies on the performance of a research institution and of its affiliates. The results can widely differ if you set the analysis at different levels of aggregation: all the public research and education institutions of single countries, single universities, faculties, let’s say, of Science and Technology, departments of Computer Science, research groups, or individuals within these groups.

Moreover, at different aggregation levels, the possible moderating variables or causes of different performances can widely differ. Legislation and regulation, public funding, teaching fees and duties matter at national level. Geography, characteristics of the local economic and cultural system, effectiveness of research and recruiting strategy, budgeting, infrastructures matter at the university or department level. Intellectual ability of researchers, history and stability of the group, ability to recruit doctoral students, worldwide network of contacts matter at the research groups and individuals level.

Time is a crucial dimension of research modelling. We pursue a modelling approach based on processes, i.e. collections of activities performed by agents through time, following Georgescu-Roegen (1970, 1972, 1979). Therefore, to represent the knowledge production activities, at an atomic level, we aim to consider both stock inputs such as the cumulated results of previous research activities (those available in relevant publications, and those embodied in the authors’ competences and potential), the infrastructure assets, and flow inputs as the time devoted by the group of authors to current research projects. Similarly, we aim to analyse the output of teaching activities, considering the joint effect of resources such as the competence of teachers, the skills and the initial education of students, and educational infrastructures and resources. Moreover, service activities of research and teaching institutions provide infrastructural and knowledge assets that have an impact on the innovation of the economic system; therefore, the perimeter of our domain should allow us to consider the different channels of transmission of that impact: mobility of researchers, career of alumni, applied research contracts, joint use of infrastructures, and so on. In this context, different theories and models of the system of knowledge production could be developed and tested.

To bridge the gaps existing in the literature, and to integrate existing bottom-up initiatives in a coherent theoretical-based platform, we suggest an OBDM approach.

We need a change in the overall approach to the assessment of Science and Technology: metrics and indicators can have negative effects on the scientific community because they encourage a reductionist philosophy; on the contrary, we propose using well-defined concepts and data to build interpretative models, in order to compare and discuss theories.Footnote 2 That can be useful both to promote a pluralistic community of analysts, and to build consensus on less superficial evaluation procedures of researchers and institutions.Footnote 3 Moreover, indicators are often produced in closed circles, collecting ad hoc databases, with no built-in interoperability, updating and scalability features.

We have to move towards an environment in which data are publicly available, collected and maintained on stable platforms, where ontologies give confidence on the precise meaning of data to people that propose models and to those that evaluate them. These repositories of knowledge can evolve following the analytical needs of the research community and the policy institutions, instead of starting from scratch each time a new research project starts. We propose our Sapientia ontology as a starting point to be opened, shared with the community and further developed and integrated with existing bottom-up initiatives as well as with new theories and paradigms.

Conclusions

The rapid expansion of big data and open data; the altmetrics movement; the complexity of research assessment and the more and more demanding policy needs ask for new ways of data integration and interoperability among many heterogeneous data sources, including Big Scholarly Data, such as publications and citations.

Although there have been several initiatives of governments and research projects, the main problems of integration of data on STI are far from being solved. The existing initiatives, indeed, do not solve the main problems related to the integration of heterogeneous sources of data, such as the data quality issues; the comparability problems; the lack of standardization, interoperability and modularization; the difficulties in the creation of concordance tables among different classification schemes; the difficult and costly extension and update of the integrated database built on independent and heterogeneous databases.

In this paper we argue that the ontology of the multi-dimensional research assessment (Sapientia) with its underlying OBDM approach may be a powerful tool to coordinate, integrate and maintain the data needed for STI policy development. The OBDM approach we propose is a form of integration of information in which the global schema of data is substituted by the conceptual model of the domain, formally specified through an ontology.

Our approach, implemented in the Sapientia ontology, offers a transparent platform on which to base the evaluation process; permits to define and specify in an unambiguous way the indicators on which the evaluation is based on; allows us to track their evolution over time; makes it possible the analysis of the feedbacks of the indicators on the behavior of scholars and allows us to find out opportunistic behaviors; provides a monitoring system to track over time the changes in the established evaluation criteria and their consequences on the research system. We claim that an higher availability and a more transparent views on the scholarly outcomes may improve the understanding of basic science from the broad society and can improve the communication of the research outcome to the public opinion, which, in the present economic phase, has an increasingly money-for-value approach about the funding of science.

Furthermore, our approach, by providing a stable but flexible and extensible platform, might be able to foster the involvement and contribution of scholars to the evaluation process and therefore may contribute to the development of the Web of Scholars.

Despite the fact that still a lot of research on this issue has to be carried out, we argue that this approach could be very promising for the resolution of important open questions that we have mentioned in this work and that a new line of research based on an OBDM approach could successfully contribute to solve some of the key issues raised in this paper.