Keywords

1 Introduction

Evaluation in science nowadays is turning into a routine work based on metrics [1]. In circumstances of a wide diversity of metrics, it is significant that they are chosen carefully, e.g. in compliance with research institution’s strategic goals.

Research indicators can be used for different purposes [2]: science policy making at the state level, distribution of research funding, organization and management activities, e.g. in Human Resource Management for recruiting or promoting employees involved in research, content management and decisions at individual researchers’ level, e.g. where and what to publish, and providing consumer information, e.g. university rankings that include science indicators. These usage examples correspond to the group of research performance indicators and do not include input indicators, such as number of researchers.

At the institutions level, the research evaluation can be performed to support the achievement of strategic goals. Whereas quantitative measures such as number of scientific papers, amount of funding, number of scientific staff and many others are commonly used for such evaluation, the strategy of the institution can be set to achieve ambitious scientific goals. Therefore, a question arises as to how more quality oriented aspects of the research outcomes can be measured. For example, at the state level to allocate funds to excellent institutions, measurement methods from three categories are used: peer review-based models, publication count-based models and citation-based models [3].

We will propose a data integration architecture for bibliometric information analysis, which is one of the research performance indicators. Bibliometric indicators can be also classified in more detail as quantity, quality and structural indicators [4]. A quantity indicator is, for example, number of publications. An example of a quality indicator is h-index. Structural indicators allow to evaluate connections, for example, co-authors from different fields, institutions or countries.

To supply an appropriate dataset for evaluation of both types of metrics for measuring quantitative and qualitative aspects, a suitable framework should be provided, that ensures that neither incomplete, nor faulty data are used, that metric computation formulas are discussed and are valid and the computed metrics are interpreted correctly. To provide such a framework with the best possible features, data from various available sources should be integrated to achieve an overall view on the scientific activity of an institution along with solving data quality issues.

The principles characterizing the best practice in metrics-based research assessment are given in the “Leiden manifesto” [1]. Among these principles, some of them should be considered when building a data collection and integration system for effective science evaluation, for example, data collection should be transparent, the institutions and persons that are evaluated can verify the data provided for evaluation, the indicator values should be updated regularly.

Research information management has many properties that are typical for data integration scenarios: many data sources with inconsistent data models, heterogeneity, and many involved stakeholders with diverse goals [5]. Knowledge from data integration field can be used to simplify the data collection and integration in research information field. For example, a uniform data model or standard can be used [5].

One such standard is CERIF (Common European Research Information Format), which includes information about projects, persons, publications, organizations, and other entities. Many research information systems in Europe are built interrelated with this standard [6]. However, there are other standards and models that are significant to describe research information, among them DOI (Digital Object Identifier) to identify publications and ORCID (Open Researcher and Contributor ID) to identify authors of publications.

Today there are many efforts trying to implement information systems to support research evaluation activities. Institutions develop their own or use commercial or non-commercial products to maintain data about research results. Many information sources have been used at the University of Latvia (LU) for a while to get the insight into the actual situation with research activities and their outcomes, but this process needed improvements by providing an integrated information oriented to scientific excellence. The requirements for research evaluation in Latvia are declared in the regulations issued by the government and prescribe how the funding for scientific institutions is calculated [7, 8]. As stated in these regulations, the productivity of scientific work is evaluated according to the number of publications indexed in Scopus or Web of Science (WoS). These quantitative data must be extended with data necessary for computation of qualitative indicators.

The paper presents a publication data management system for excellence-based research analysis at LU. The system integrates data available at the university information system with data from the library information system as well as with data obtained via API from Scopus and WoS databases. The paper discusses data integration flows and data integration problems including data quality issues. A data model of the integrated dataset is also presented. Based on this data model and integrated data, examples of quality oriented metrics and analysis results of them are provided.

2 Related Work

A research information system in Scandinavia [9] is an example of such system that is implemented and used in Denmark, Finland, Norway, and Sweden and mostly contains integrated, high quality bibliometric data. The system is used for performance-based funding. It is remarkable, that this system also has its own publication indicator that by weighting the results from different fields allows to compare them.

In the field of data integration, the recent approaches are focused on mappings between models and integration processes. Also in research information systems, mappings between different systems should be considered as important parts, they provide specifications for data integration processes [5].

The Polish Performance-research funding system allows to evaluate 65 parameters. 962 research units provided data about more than a million research outcomes for the 4-year period. The data collection process was performed through submission of the questionnaire through the Information System on Higher Education in Poland. The study [10] was performed to find out the most important metrics to facilitate the transition to more targeted system to meet the excellence requirements, where only the most important metrics are reported. The research showed that many of existing metrics are not significant for the evaluation.

Italian experience [11] shows the implementation of a research information system in Italy, where 66 Italian institutions introduced IRIS, that is a system based on DSpace [12] and customized for Italian environment. ORCID also was used at a national level. Entities, attributes and relations in this system are compliant with the CERIF ontology [13]. The collected huge amount of data allowed to understand the whole situation in research and, for example, to develop new publication strategies. The authors of the study mention also the problems with the data quality, when not all institutions control the data collection process and do not implement data validation processes of data provided by researchers.

3 Data Model

To support analysis and evaluation of the scientific activity of LU members involved in research, both employees and students, we propose a system architecture that integrates information about publications from all accessible data sources that include the library information system ALEPH, LU management information system LUIS, SCOPUS and WoS databases. To maintain publication information and support reporting and analysis, the publication data from multiple sources are linked and stored in a repository in LUIS. The data model of the repository is depicted in Fig. 1.

Fig. 1.
figure 1

Publication repository data model

The central class to store bibliographical data about a publication as well as number of citations in SCOPUS and WoS databases is Publication. The bibliographical data in this class are entered by the author of the publication or the faculty staff or they are populated during the data loading process from ALEPH or SCOPUS databases. For each publication indexed by SCOPUS or WoS, we also include the corresponding number of citations as well as publication identifiers used in both databases to maintain a link with SCOPUS and WoS as data sources.

The information about authors of the publication is reflected by the classes Author and SCOPUS Author. The class Author represents ordered authors of the publication, which are affiliated with LU as well as with other institutions.

Authors recognized as affiliated with LU are also linked to LU Person class, which stores personal and other information used for different functionality of LUIS. For foreign authors, we store only their name as it appears on the paper. If a publication is indexed by SCOPUS, we also collect author information from it, which includes name, surname, H-index and author ID assigned by SCOPUS, which is used in the author matching process.

We also store the information about the affiliation of the publication with LU department or faculty, which is represented as the class LU Department. This information is obtained automatically from data about the work place of the author and may be corrected by the responsible faculty or library staff. If the publication is indexed by SCOPUS, we also store the information about the institutions the publication authors are affiliated with in the class SCOPUS Affiliation. This information is necessary to analyze connections with co-authors from different institutions or countries.

For the analysis of the quality of publications, we use not only the citation number, but also other citation metrics provided by SCOPUS. The absolute values of such metrics are calculated for journals and serial conference proceedings yearly and their values are represented by the class Journal Citation Metrics. We also collect information about the open access status of the journal or conference proceedings and include it in the class Serial Title. In SCOPUS database, journals are also ranked among other journals that belong to the same subject areas according to their values for CiteScore metric [14], an alternative to WoS Journal Impact Factor. Journal rank information is represented by the class Journal Rank and it is connected with the corresponding subject area (class SCOPUS Subject). For reporting on publications of different OECD categories, we store the correspondence of SCOPUS subject areas and Field of Science and Technology (FOS) categories.

4 Scenarios of Obtaining Publication Data

To accumulate the most complete list of publications authored by LU staff and students in the central repository in LUIS, we gather publication data from different sources, link publications to the correspondent members of LU staff, correct errors and duplicates and provide the consolidated information using a set of reports used by the management of LU. In the following section the 4 scenarios of obtaining publication data as well as data flows related to each scenario are discussed.

4.1 Publication Data Added by Authors

LU employees, PhD and Master’s degree students are able to add information about their co-authored publications to the repository themselves. The process when publication data are entered by authors is depicted in Fig. 2. LUIS system maintains user profiles for all LU members. Among various other information about a user, a profile includes a section devoted to research, which in turn contains a list of author’s publications obtained from the repository. An author can supplement this list by adding newly published articles or articles, which were not loaded automatically. Before adding them, an author is automatically requested to search for his/her publications in the LUIS repository and library information system ALEPH, which are not linked with the author’s profile, to avoid creation of duplicates. If a desired article is found, it is possible to add it to the profile. In this case, the author does not need to supply any additional information about the article.

Fig. 2.
figure 2

Publication data added by authors

If, however, the article is not present in either of the systems, the author has to specify the type and subtype of the publication (for example, journal article, book chapter, book, etc.) and supply bibliographical information. Besides, an author must indicate the status of the publication: published, submitted for publication, developed or under development, attach publication file (full text or book cover) and indicate whether it can be accessed publicly at the e-resource repository of LU. After the author has finished entering publication data, they are transferred to the library information system ALEPH. To ensure the best possible data quality, members of the library staff validate publication data, correct any errors if present and approve a publication. Finally, publication data are synchronized back with LUIS and become available for evaluation and reports.

Besides entering new data, LUIS users can unlink mistakenly attached publications from their profiles, which were automatically loaded to the repository from other sources or confirm that author matching was performed correctly.

4.2 Publication Data Added by Faculty Staff

Data about publications authored by the faculty members can also be entered into the repository by specially designated faculty staff (Fig. 3). The procedure for adding data is similar to the one that is performed by authors. The differences are that faculty staff can record data about publications authored by other faculty members, correct erroneous links between publications and authors and adjust the list of affiliated LU departments.

Fig. 3.
figure 3

Publication data added by faculty staff

4.3 Publication Data Obtained from SCOPUS and Web of Science

There are two external sources of publication data that are used in the data loading process: SCOPUS and WoS citation database systems. SCOPUS offers API, which allows to search for publications authored by LU members and extract bibliographical data of such publications as well as various citation metrics. The extraction and loading process (Fig. 4) is run daily. Articles that were published during the last 2 years are inserted into the repository or updated daily, but all other publications are updated on a weekly basis. Bibliographical information is extracted from SCOPUS and loaded into the repository table which corresponds to the class Publication of the repository data model. Data about authors (unique author identifier, name, surname, H-index) and publication and author’s affiliation are loaded into tables which correspond to the classes Author, SCOPUS Author and SCOPUS Affiliation of the repository data model. Affiliations are associated with authors as well as with publications directly. In addition to bibliographical and author information, citation metrics are also obtained that include current number of citations of individual publications as well as citation metrics obtained for the particular journal or conference proceedings: Source Normalized Impact per Paper (SNIP) [15], the SCImago Journal Rank (SJR) [16], CiteScore [14]. Previously, it was possible to obtain Impact per Publication (IPP) metric [17], which is not available from SCOPS anymore, so this number is retained for previously loaded publications and is not loaded for the new ones.

Fig. 4.
figure 4

Publication data obtained from SCOPUS and Web of Science

The first step of the SCOPUS data loading process that is executed on any new publication is a recognition phase. The main goal of this phase is to identify publications that are already registered in the repository, but that are newly indexed by SCOPUS, to avoid creation of duplicates. The first criterion used for the recognition is Document Object Identifier (DOI) which is unique for every publication. If the matching publication with the same DOI is not found in the repository, the search based on the similar title and publication year is performed. To determine the existing publication with the most similar title in the repository, Jaro-Winkler similarity [18] is used because there may be different alternatives of title spelling as well as data quality issues are sometimes present. Different thresholds for Jaro-Winkler similarity were tested and experimental evaluation of matching results revealed that the most suitable threshold is 0,93 and currently this coefficient is used to consider titles of publications similar.

If the recognition process detects an existing publication in the repository or if a publication has already been previously updated with SCOPUS data, we update the number of citations for the publication as well as journal citation metrics and establish a link with a corresponding Scopus record for newly indexed publications by means of filling SCOPUS ID attribute of the Publication class (Fig. 1).

If a processed publication is new to the system, a new instance of the Publication class is created with all the bibliographical information obtained from SCOPUS database, publication authors are also represented as instances of Author and SCOPUS Author classes and citation metrics as well as journal rank data are created or updated if information about a journal has been previously loaded.

In case of a new publication, the author matching process is performed, when for each author affiliated with LU, a corresponding instance of LU Person class is searched for and, if found, it is associated with a corresponding instance of the Author class. The primary criterion for author matching process is the SCOPUS author identifier, which allows to uniquely identify authors, whose publications have been previously loaded into the repository. If author search by identifier is unsuccessful, the matching process uses the secondary criterion, which is a combination of author’s name and surname. Author matching by precise name and surname produces insufficient results, because publication authors tend to use different spelling of their names and surnames that not always corresponds to their full names. Furthermore, a full name can contain special characters of the local language, which may be substituted with English language characters in the publication in a different way. For example, Latvian letter ‘Š’ may be substituted with English letter ‘S’ or with two symbols ‘SH’. To solve this data quality issue, the author matching based on names and surnames is performed using Jaro-Winkler similarity between the full name as it appears in the publication and official author’s full name, i.e. LU Person instance with the highest Jaro-Winkler similarity coefficient that exceeds a threshold is linked to the publication. We use the same threshold for similarity coefficient 0.93, which was selected based on the experimental matching results. After a match is found, we also establish an association between the instance of the SCOPUS Author class and the corresponding instance of the LU Person class to use SCOPUS author identifier as the primary criterion for matching future publications.

When a new publication is loaded into the repository, the second process phase – publication data synchronization with the library information system ALEPH is executed. During this phase, publication data are exported to ALEPH, bibliographical information is supplemented and any errors are manually corrected by the library staff to maintain the best possible data quality and finally the updated data are imported back to the repository.

Another data source used in the integration process is WoS web services. The version of web services available at the University of Latvia does not include journal citation metrics and provides only limited bibliographical information, number of citations, full names and sometimes Researcher identifiers of publication authors if they have registered such identifier in WoS database. Since the affiliation of authors is not available, we have discovered that author matching process for WoS data produces too many incorrectly identified authors, therefore, a decision has been made to add new Web of Science publications to the repository manually. Besides, the integration process also matches publication data obtained from WoS with publications already available at the repository. Just as for publications loaded from SCOPUS, the primary criterion used for matching is DOI and the secondary criteria are title and publication year. The integration process regularly updates WoS number of citations for recognized publications.

4.4 Publication Data Added by Library Staff

There is a considerable number of publications that are not indexed by SCOPUS and are authored by LU members, especially in the humanities. The information about such publications is necessary to perform accurate evaluation of the scientific activity of the institution. Therefore, library staff members manually add bibliographical information about publications to the library system ALEPH (Fig. 5). This is done when a librarian comes across a new publication authored by any LU member in some journal or conference proceedings or when the information about a new publication is obtained from the list of recently indexed publications in Web of Science database, which is monthly distributed by Web of Science. When new publication data appear in ALEPH, a synchronization process is conducted that integrates bibliographical information into the repository to ensure that it always represents an overall view on publication data.

Fig. 5.
figure 5

Publication data added by library staff

The synchronization process includes author matching phase. During this phase, for each author of a publication, a corresponding LU person is searched for, using the same algorithm based on the full name similarity as in other matching phases. If the corresponding LU person is found, the publication is attached to his/her LUIS profile. We use the same minimal Jaro-Winkler similarity coefficient 0,93 to consider names similar.

In addition to entering new publications to ALEPH, library staff members are also responsible for correcting and supplementing bibliographical data for publications added by authors and by faculty staff and for publications data imported from Scopus database.

5 Data Analysis: Case Study

The context of the data collection and integration can be described with the total number of publications of LU researchers for the last 30 years that equals to 42417. 6967 publications out of them are indexed by Scopus and 7764 publications are indexed by WoS. The case study is performed using data that corresponds to the LU Faculty of Computing and the time frame that was chosen for research results evaluation was 2012–2016. We have already performed an initial evaluation of research performance by means of quantitative metrics [19]. To give a context for the further data analysis some numbers, e.g. publication count and Scopus publications, should be mentioned. The total number of publications is decreasing at the faculty from 99 in year 2012 to 79 in year 2015, but the number of Scopus publications is growing and in year 2015 was 55.

The goal of the case study was to define metrics based on the data attributes provided by the data model of the integrated publication information system, and with the goal to evaluate the quality of the publications, to find out the positive trends as well as the problems with the quality.

The quality aspects of a publication can be indirectly described with the source quality characteristics, e.g. journal quartiles that are computed from the citation count of all journal publications, because we can presume that the journal with the highest quartile Q1 will accept the best publications. Another group of quality indicators directly describe quality of the publications and are computed from citation counts of publications. Further in this section different analysis scenarios for research output evaluation with quality metrics that can be implemented with the new publication module and data integration infrastructure are described.

For the 1st analysis scenario the following research question was formulated: “How many faculty publications in Scopus are published in sources with and without computed quartiles?” Later more detailed analysis was performed to find out, how the publication count is divided among quartiles.

Results are shown in Fig. 6(a) and (b). The results show an unsatisfactory trend for the faculty, that in the year 2015 the number of publications in sources without quartiles was increasing. The detailed analysis showed that in all years, except the last year 2016, the biggest number of publications belongs to quartile Q3. Regarding the excellence, the number of publications with Q1 is increasing in the last 3 years.

Fig. 6.
figure 6

Publication count in Scopus sources (a) with and without quartiles (b) detailed count by quartiles

For the 2nd analysis scenario, the following research question was formulated: “How many faculty publications in Scopus are not cited comparing with all publications and the publications in sources with computed quartiles?” Figure 7 shows the trend that the proportion of uncited publications remains unchanged in sources with computed quartiles, but is growing among all Scopus publications.

Fig. 7.
figure 7

Publications that are not cited

For the 3rd analysis scenario the research question was formulated: “How many are there citation counts in Scopus sources with quartiles and without computed quartiles?” Figure 8(a) shows the trend that the citation count in Scopus sources without computed quartiles is decreasing, but at the same time the citation count in the years 2012–2015 is remaining stable. More detailed analysis (see Fig. 8(b)) shows a significant citation count of Q3 publications, however this can be explained with bigger amount of Q3 publications among all others.

Fig. 8.
figure 8

Citation count in Scopus sources (a) with and without quartiles (b) detailed citation count by quartiles

These results can help to decide at each individual researcher’s level to try publishing their works in sources with one quartile higher, but for the faculty, the shift from sources with Q3 to Q2 may be the most promising and realistic.

6 Conclusions

The main contribution of this paper is the architecture, that implements different flows of data and integrates them into one consistent system for research output evaluation. The architecture is based on the idea to ensure data quality, control, and also the integration process transparency, involving publication authors in different roles – as information providers or approvers.

The model of integrated dataset was determined mostly by existing information systems at the university and external interfaces (WoS and Scopus). The model is integrated with the global data model of the management information system of the University of Latvia, so the research evaluation can be extended with other types of metrics not only the bibliometric ones.

The proposed architecture ensures calculation of metrics for the publications evaluation both quantitative and qualitative ones. However, this paper concentrates mostly on the qualitative ones to support the scientific excellence. The quantitative metrics can help to set further goals, based on findings of performed analysis.