Keywords

1 Introduction

Information Science has its own scientific status as a social science. By its interdisciplinary nature, it presents interfaces with Mathematics, Logic, Linguistics, Psychology, Computer Science, Production Engineering, Graphic Arts, Communication, Librarianship, Administration, and other similar scientific fields [3]. Regarding the use of Data Provenance, both the Information Science and Computer Science use the structures of scientific workflows, which are abstractions related to the source of data, used as a support in the modeling of scientific experiments. Provenance is related to the audit, screening, lineage, and source of data. It can also be considered a metadata that describes the origin and all path taken to achieve the results of an experiment [10, 17].

This paper proposes a method for collecting Provenance data related to anemia indices. According to specialists of a particular hemotherapy center in Brazil (which due to privacy conditions, will be reported here as “X Hemotherapy Center”), anemia is a generic name for a series of conditions characterized by deficiency in hemoglobin concentration or in the production of red blood cells. Hemoglobin is a blood element with the function of carrying oxygen in the lungs to nourish all cells in the body.

A current study shows that 30% of the world’s population is anemic, especially children under 2 years old and women of different age groups, although it can also occur in men and the elderly. In addition, it is estimated that 27% to 50% of the population is affected by iron deficiency, especially in lower income and developing populations. In Brazil, the data may vary according to the study and the population group analyzed. But overall, it is estimated that 40% to 50% of children have anemia [25]. In this sense, it is important to emphasize that the data contained in the original database of the health institution under study in this paper do not establish systematic relationships between the stored variables for possible analysis of the statements generated by the specialists in the process of refusal of blood donors, regarding anemia rates.

There is no computational analysis of stored variables to uncover anemia index donations to chart possible future preventions, only expert-generated statements. These statements recorded by biomedical specialists do not always agree with database variables for possible broader analysis. Incorporating expertly defined statements regarding anemia rates, the possibility for obtaining a higher quality reduced dataset is evident. The analysis performed on the reduced dataset provided more reliable answers about a given biological phenomenon.

However, it was important to create a framework, which was able to facilitate the proposition of the method for data provenance activities and the storage of the expert statements, through an auxiliary database. Thus, the proposed method can ensure that statements made by experts during the process of blood donor refusal for anemia are reliable for new information flows and for the generation of new knowledge.

The proposed method is based on an adaptation of the Provenance Data Model (PROV-DM), which consists of a computational strategy capable of ensuring that expert-generated statements are passed on from the original X Hemotherapy Center database to an auxiliary database. Its main goal is to provide for a broader analysis and improve the quality of blood donations. In the end, the proposed method was able to manage the statements generated by specialists during the process of blood donor refusal for anemia indices, which were obtained from reports generated by the X Hemotherapy Center database from 2000 to 2018. The structuring of the method is based on data provenance stages, the use of scientific workflows and the needs found throughout the research for the treatment of digital data.

This research was developed in the scope of the Doctoral Program in Information Science from the Federal University of Santa Catarina, Brazil.

2 Literature Review

2.1 Information Science

Information Science is an area that is directly or indirectly linked to information technologies in the use of its methods of organization and representation of information in research development. Information technologies are key elements in the development of Information Science, as the creation of technological tools promotes the development of theories in order to achieve the goals set by this science in relation to the problems it is dedicated to solving [19].

The focus of Information Science implies both sociological and epistemological approaches, focused on phases such as: generation, collection, organization, interpretation, storage, retrieval, dissemination, transformation, and use of information. Information Science is an interdisciplinary science that assumes several disciplines of technological knowledge and try to contribute to the generation of new scientific knowledge [3, 7].

2.2 Anemia in Blood Donors

In 1999, members of the United Nations International Children’s Emergency Fund (UNICEF), the United Nations University (UNU), the World Health Organization (WHO) and the Micronutrients Initiative (MI) showed that 3.5 billion people worldwide have iron deficiency anemia and that iron deficiency may be present in 80% of the world’s population [23].

One of the most frequently observed factors in assessing the presence of anemia in blood donor candidates in hemotherapy centers throughout Brazil is hematocrit levels, the percentage of volume occupied by red blood cells and hemoglobin, which are the main items extracted from the statements in the original X Hemotherapy Center database, along with the fit and unsuitable candidates for blood donations and their respective screening. In Brazil, Ordinance RDC 153 of July 2004, enacted by the Ministry of Health, establishes the minimum acceptable hemoglobin and hematocrit values for a blood donation. These values are: 13 g/dl hemoglobin and 39% hematocrit for men and 12.5 g/dl hemoglobin and 38% hematocrit for women [13].

In the same Ordinance, the Ministry of Health also determined the minimum interval that must be respected between blood donations. This interval should be eight weeks for men and twelve weeks for women, respectively, because the shorter the interval, the greater the chance of developing anemia [13].

2.3 Scientific Workflows

Scientific experiments consist of observing a phenomenon through data analysis, and using the results obtained to prove or disprove a hypothesis. Due to the need to organize, process, control, and analyze the experiment, its representation is made through a cycle whose steps are composition, execution, and analysis. A scientific workflow is an abstraction of this process, which allows the formal specification of the steps to be performed in a scientific experiment [11].

An example of using scientific workflows would be to capture the steps taken to create a new drug, i.e. the source of the data that led to the creation of such a formula. In this sense, whenever this formula was improved, we would have the original data for reuse and replication of the experiment.

To benefit from provenance data, this data has to be captured, modeled, and stored for future reference. Information on the provenance of stored data can be managed by various Scientific Workflow Management Systems (SGWfCs) [15, 16]. Some SGWfC, such as Taverna [18], Kepler [2] and Pegasus [11] allow you to capture workflow steps during their execution. However, these systems often adopt proprietary models to capture the provenance generated in executions [8].

In this paper, a specific workflow was developed to demonstrate data capture using an auxiliary database without the need to change the original database.

2.4 Data Provenance

Data provenance is the complementary documentation of a given data that contains the description of “how”, “when”, “where”, and “why” it was obtained and “who” obtained it [5]. When buying a work of art, it is important to know its origin from its inception, including all former owners, i.e. this information will be essential to establish the value of this work of art. The same is true of data where data provision makes it possible to ensure data quality and accuracy [21].

In this sense, whenever provenance is automatically captured, it can be divided into levels [10]: a) workflow: involves the execution description of a process, i.e. the tasks that are part of it, is used by the vast majority of solutions with SGWfC and, in this case, must be adapted to capture the data from the different processes executed; b) activity: can occur in two ways. In the first, each executed process/program changes to capture the provenance data. In the second, specific programs can be created to monitor the execution of a given process and capture the provenance data; c) operating system: uses the data provided by the system, storing it in a specific database for provenance analysis.

By using data provenance it is possible to keep a complete record of how the calculation or processing was performed and it is essential to [6]: (a) ensure repeatability, (b) catalog the result, (c) avoid duplication of effort, and (d) retrieve data sources from output data. The main benefits of provenance for data quality are [4]: a) communicates data quality: reliability, suitability, accuracy, timeliness, redundancy; b) improves data interpretation as a function of source recognition; c) contributes to the justification of the use of a given data; d) reduces the possibility of errors in judging the accuracy of the data; e) allows non-data expert users to understand the processing steps; (f) identify the process used to conduct the creation of scientific data; g) allows updating of data from relational views; h) allows modification of relational view schemas; i) allows the use of historical data sources.

In this sense, the application of data provenance can be observed in the most varied areas, such as digital libraries, food industry, journalism, the traceability of information in social networks and the transparency of commercial applications, among others [9].

Provenance of Knowledge

The term provenance of knowledge includes the source of the so-called meta-information, which is based on obtaining a description of the origin of part of the knowledge, including a description of the reasoning method used to generate it. However, data provenance and knowledge provenance have the same concerns and motivations, differing as to the purpose of the record that will be captured [20].

The provenance of knowledge provides two aspects: a) a personal and more abstract view of a document and its derivations, specifically for the experiment and the person, with the direct contribution of the scientist; and b) a more specific understanding of the data processing domain or its execution process, and may receive contributions from both the scientist and the note-taking curators [22, 23].

In this work, provenance of knowledge is related to the context of observing the statements on anemia rates described by the experts in the reports provided by the X Hemotherapy Center on donors who have become unfit for blood donation, determining the reliability of the researcher’s reasoning about a given dataset. The provenance of knowledge was a term used to demonstrate and record the rules and reasoning used in the sample derivation processes from the reduced dataset, obtained at the X Hemotherapy Center X, in relation to the data relating anemia rates in unfit blood donors.

3 Related Works

In a PhD thesis written in 2012 at the University of São Paulo, the author proposes a model for describing data provenance for knowledge extraction in hemotherapy information systems based on the Open Provenance Model (OPM), designed to manage provenance records. Other similar applications can be found in the paper entitled “Laboratory and clinical genomic data sharing is crucial to improving genetic health care: the position statement of the American College of Medical Genetics and Genomics” [12]. In this research, the institution responsible for the study presents clinical level patterns by which statements about gene/disease associations and the clinical significance of variants were captured, by means of data provenance techniques, in statements made by experts in shared genomic data systems.

4 Proposal

This paper proposes a method for collecting provenance data related to anemia indices, by adapting some components taken from PROV-DM. PROV-DM’s main function is to describe people, entities, and activities involved in the production of data. In addition, the PROV-DM model provides the conditions for provenance to be demonstrated and exchanged between different systems. For this purpose, a data provenance application was created. This application uses an auxiliary database to store provenance data related to anemia indices. It abstracts some attributes from the original database through researcher analysis. When searching the database, the provenance process assists in the process of tracking and signaling from the data source, as well as their movement between different data sources [21].

The proposed method sought to store the statements related to amounts of blood donation candidates with anemia rates considered unfit for blood donations, taken from reports provided by the X Hemotherapy Center system, in Brazil, from 2000 to 2018. The X Hemotherapy Center board has provided 19 years of reports from its blood donation registration system, containing various attributes and at least 80 reasons for refusing blood donors. All information provided has preserved the confidentiality of blood donors.

These reports were transformed into a CSV file by a software application created with the Java Eclipse IDE. Also, a local PostgreSQL database was created to serve the auxiliary database, i.e. the local repository of provenance. This local repository was populated with records taken from the original database provided by the X Hemotherapy Center X. Without changing the structure of the original database, It was possible, to observe the predominance of anemia indices found in blood donors at the X Hemotherapy Center X during 19 years.

Finally, the workflows were generated with Taverna Workbench Core, and the JasperReports library was used to generate a report that was, subsequently, transformed into a reduced dataset (see Table 1).

Table 1. Number of donors (fit, unfit, anemic unfit, percentage of anemic unfit and screening).

4.1 Case Study

PROV-DM is divided into six components that contain both the elements and the possible relationships between them. They are [24]: a) Entities and Activities. Entities can represent any object (real or imaginary) and Activities represent the processes that use and generate Entities; b) Agent and Responsibilities: Agents are Entities that influence, directly or indirectly, the execution of the Activities, receive attributions from other Agents and may have some kind of connection (ownership, rights, etc.) over other Entities; c) Derivations: Describes the relationship between different Entities during the transformation cycle performed by the Activities, allowing to demonstrate the dependency between the used and generated Entities; d) Alternative: Describes the relationship between different views of the same Entity; e) Collections: These are Entities that have members, which are also Entities, and may have their provenance shown collectively; f) Annotations: Provides mechanisms for adding annotations to elements of the model.

Figure 1 presents the proposed method. Four steps are proposed in order to accomplish the task: i) Entities; ii) Activities, Agents, and Derivations; iii) Alternative Collections; and iv) Notes.

Fig. 1.
figure 1

Proposed method

These steps comprise the adaptation of some PROV-DM components for provenance management, the creation of a specific workflow for data extraction, and the analysis of 19 years of anemia indices found in candidates who were considered unfit for blood donations.

To better explain the proposed method presented in Fig. 1, the following subsections present each step with its respective components.

Step 1 (Entities)

Data provided by the X Hemotherapy Center data source are collected and selected so that they can undergo the second step of this transformation process.

Step 2 (Activities, Agents, and Derivations)

In this step, “ACTIVITIES, AGENTS, and DERIVATIONS” are represented by a workflow created specifically for the cycle of activities required to prepare the data collected before to be manipulated and studied, regarding the statements reported by the experts. Here the workflow aims to extract data that will be seen as reliable to perform analysis for blood donor improper anemia indices (rejections) and, soon after, the extracted data is transformed into a CSV file for the creation of an auxiliary database to generate tables and graphs (data preparation and transformation process).

So, the “ACTIVITIES” are represented by the dataset collected from the X Hemotherapy Center over 19 years, containing dates, times, consumption, processes, transformation, modification, relocation, and use of the original data in relation to anemia rates. The activities are performed by the agents. The “AGENTS” are represented by both blood donor candidates and health specialists who appear in X Hemotherapy Center reports. The “DERIVATIONS” represents the transformation of the X Hemotherapy Center, after the application of the method proposed here, in an entity that will serve as a reference model for the application of this method in other Brazilian hemotherapy centers with the same structure.

The second step starts with the workflow created specifically for the process performed at the X Hemotherapy Center, which is the process of collecting provenance data related to anemia rates, so that it can be evaluated and analyzed regarding their origin, thus generating new knowledge.

Step 3 (Alternative and Collections)

“ALTERNATIVE” represents the view of the X Hemotherapy Center in the declarations of anemia indices of candidates for blood donation considered inept. By declaring the reports provided, it was possible to quantify in order to generate the results of the analysis of the data declared by the experts. “COLLECTIONS” represents the collection of anemic candidate data to be inserted into the local source data repository. In here, one can also apply provenance of knowledge concepts for possible audits of source data. It is important to notice the “COLLECTIONS” component stores data that constitute documents, in which each document has its own provenance, but the file itself also has its origin information: who kept it, which documents contained it at what time, how it was assembled, etc. Therefore, in addition to the procedures for collecting the provenance data, i.e. the provenance of knowledge applied to this method, it was possible to provide an overview of the provenance of the data described in the reports on the anemia indices of candidates for unfit donations. This made possible a better understanding of the risks and the reasons for the rejection of the blood donations.

Step 4 (Notes)

After storing the provenance data in a local repository, in the fourth step, “NOTES” represents the relationship of the important points on anemia indices, such as: analysis, consultations, exploration, annotations, and reuse of data for further research. In here, it is possible to generate reports that can be cross-referenced with other data, as needed by health specialists. Consequently, this step creates an interaction between refined data and expert reporting.

5 Results

The reports provided by the X Hemotherapy Center are annual from 2000 to 2018, dated January 1 to December 31 of each year to better simplify and reduce the presentation of the data during these 19 years. The body of each report contained a series of attributes, from which only the data that had the potential to generate the expected results was selected.

The selected attributes were: i) number of male and female fit donors as well as the number of male and female unfit donors, all aged from 16 to 60 years old or older, including first-time, repeat or sporadic donors; ii) anemia indices of male and female unfit donors, i.e. low hematocrit and low hemoglobin, considered the reasons for refusal according to statements made by experts in the submitted reports, which in fact built the set of information necessary to generate the process of data provenance and provenance of knowledge; and finally iii) the screening performed by the experts in each year surveyed, i.e. the discovery of diseases through blood donation at the X Hemotherapy Center. These attributes can be better observed in Fig. 2 below.

Fig. 2.
figure 2

Auxiliary database with selected attributes.

The auxiliary database presented in Fig. 2 demonstrates the selected attributes taken from 19 years of reports provided by the X Hemotherapy Center. They are a massive set of information that was possible to be retrieve and group into a reduced dataset in order to be analyzed.

The attributes selected are the following (in Brazilian Portuguese acronyms): a) ano (donation reference year); b) indanemfem (female anemia index); c) indanemmasc (male anemia index); d) quantaptomasc (amount of male fit donors); e) quantaptofem (amount of female fit donors); f) quantinapfem (amount of female unfit donors); g) quantinapmasc (amount of male unfit donors); and finally h) trimed (screening by the attending physicians).

Table 1, shown below, is populated with data extracted from the outcome of the analysis undertaken on the auxiliary database. This database contains statements about the anemia indices of unfit donors. All the attributes shown in Table 1 helped perform the development of the specific workflow to create the provenance data method for the X Hemotherapy Center.

Table 1 shows the number of eligible and unsuitable blood donation candidates, both male and female, as well as the percentage of anemic unsuitable blood donation candidates. Importantly, candidates unfit for blood donations were rejected for anemia rates, i.e. were also filtered from at least 80 reasons for refusal before becoming eligible for blood donations.

These reasons for refusal are diverse, ranging from something simple as a fever, weight loss, flu manifestations, among others, to more complex reasons such as diabetes, heart disease, cancer, HIV, etc. In this work, we gathered the unfit donors who had several reasons for refusal. From this subset, we extracted those unfit by anemia, by presenting that percentage for each year. Table 1 also shows the amount of screenings performed by doctors each year, which in other words means the number of blood donation candidates who discovered disease during the donation process.

The 19 years of data analyzed revealed 197,551 blood donation candidates. Out of this total, 114,813 were male and after blood tests 89,657 were found fit and 25,156 were found unfit for blood donations. Anemia rates were present in 1011 candidates, totaling 4.02% of candidates unfit for blood donations. Of the remaining 82,738 female blood donation candidates, 57,633 became fit and 25,105 became unfit for blood donations after blood tests. Besides, there were 4,039 anemia inducing female candidates, totaling 16.09% of donors unfit for blood donations.

Table 1 also shows that from 2001 to 2009 there was no screening, but anemia rates continued to fluctuate between male and female donors, with a female predominance. In 2018, the highest number of tests was observed, evidence for some unidentified specific reason in relation to blood donations that year. The year 2001 is the year in which the anemia rates between men (11.54%) and women (47.51%) represent the highest rates observed. If compared to 2018, when there was the highest screening rate, the anemia rate was well below the 2001 average.

These comparisons help draw estimates and thresholds for new studies in the area of hemotherapy from a data provenance perspective, which could ultimately contribute to the prevention of anemia rates in the X Hemotherapy Center. In order to provide a more detailed comparison, Fig. 3 shows a graph of the profile of anemic unsuitable donor candidates during the 19 years analyzed.

Fig. 3.
figure 3

Refusals due to anemia (male and female unfit)

It can be observed that men’s anemia rates are lower than women’s. This can be demonstrated as a result of the interval between blood donations from women and men. Actually, the X Hemotherapy Center has reported that women tend to be at risk of developing anemia earlier than men. This happens due to more frequent blood donations done by women in a shorter period of time, or even repeated donations.

Other factors that are associated with higher rates of anemia in women are pregnancy and the monthly blood loss from menstruation. For men, the possibilities are that continued blood loss caused by some type of bleeding, associated with blood donation, or regular blood donation may be related to the risk of developing anemia [1]. According to the literature, the recommendations to reduce anemia rates in all male and female population are related to the iron supplementation after blood donation and the increasing of the waiting time between donations [13, 14].

6 Conclusions

This paper proposed a method for data collection in a Brazilian hemotherapy center, based on the data provenance approach. This method proved to be important for generating a reduced dataset, in order to allow knowledge extraction and mining of a volume of data generated over a 19-year period.

The application of the concept of data provenance along with the provenance of knowledge and the scientific-workflow techniques for the development of the proposed method led to the conclusion that these elements together may, in fact, contribute to the advancement of research in Brazilian hemotherapy centers. They helped guide data collection in relation to the anemia index statements found in the 19 years of data presented in the reports provided by the X Hemotherapy Center. Moreover, they contributed to generate knowledge through the analysis performed in the anemia rates found. It could be observed that most of the population studied was female, who develops more frequently anemia rates in conjunction with other factors after blood donations, which may be a research factor for the discovery of other diseases. The method proposed here can be applied to another Brazilian hemotherapy center that has the same features and security policies presented here.

The main contributions of the proposed method are: a) the improvement of the analysis that discovers anemia indices that make blood donation unfeasible; b) the provision of a beneficial view of donor groups in their hemotherapy centers; c) the raising of the quality of blood donations by creating preventive mechanisms for blood donor evasion; d) the permission, when necessary, to query the local source data repository, in order to create data-quality metrics and to perform an audit processes for this data. These contributions demonstrate that the use of data provenance together with the provenance of knowledge are differential requirements of what can be found in the literature and, in fact, contribute to the relevance of the results found in this research.

After several searches in the literature and considering the few works found, it became clear to us the relevance of the studied subject, since we could not find similar proposals. We believe that methods for data collection that are able to highlight, synthesize and explain elements in the area of Hemotherapy may be a guide for the development of new research in this area.

It also became clear that Computer Science approaches, such as the Provenance of Data, combined with an Information Science viewpoint could be very useful to the context of hemotherapy information systems. Computer Science provides the technological support for the development of data provenance. On the other hand, Information Science provides the methods and techniques for informational treatment, making use of technological applications to apply the provenance of knowledge.

7 Future Perspectives

Some paths can be envisaged in the follow up of this research: a) Data collection done directly on the hemotherapy center’s database with the help of an already-connected auxiliary database (a local provenance data repository composed of anemia declarations). This would be done without modifying the original database structure, i.e. by generating an automatic CSV file (or others) in a cloud computing structure capable of handling large amounts of data, providing better data quality for future research; b) Automating the data description process, i.e. generating predictions for more complex analysis by using data provenance as a preventive factor for anemia indices that result in blood-donation refusals; c) Improving the data provenance method by performing data-cross-referencing processes (reasons for refusal of blood donations in all hemotherapy centers in Brazil); and d) Integrating more hemotherapy centers in this research, or even performing the study in other regions of Brazil, by adapting the proposed method whenever is necessary.

Another future perspective means using data provenance for preventing anemia problems, by indicating the reasons why they occur more frequently in certain regions of Brazil. We can also envisage the evaluation of how data provenance models could assist in generating more complex and complete analysis, in order to discover the most prominent anemia rates, by Brazilian region.

For all of the aforementioned scenarios to become true, we believe that Brazil should improve its data storage infrastructure and computational tools applied to the Brazilian hemotherapy centers. Furthermore, it would be advisable to think about the creation of research institutes all over the country, in order to study and to prevent anemia rates and other reasons for refusing blood donations. That indeed would be a challenge.