1 Introduction

Integrating data-driven digital technologies in conjunction with smart infrastructures for management and analytics, increasingly, offer huge opportunities for improving quality of life [40] and industrial competitiveness [56]. However, the enormous amount of data generated in scientific and industrial domains, demands the development of computational methods for ingestion, integration, and analysis, as well as for the transformation of big data into knowledge. The problem of data integration has been extensively addressed by the Database community [15, 22]. As a result, a vast amount of integration frameworks [11, 21, 28, 31, 37] have been developed; they implement data integration systems following the local-as-view (LAV) and global-as-view (GAV) paradigms[32]. Further, query processing has also played a relevant role in solving data integration on the fly. Graph-based traversal [33, 7], and federated query processing [1, 16] are representative approaches for enabling data integration at query execution time. Although these approaches have made remarkable contributions, the problem of scaling up to big data transformation remains unsolved. The lack of techniques able to manage structured and unstructured sources (e.g., clinical notes, images, scientific publications) is the main drawback of existing approaches.

Fig. 1
figure 1

Motivating Example. Heterogeneous sources of knowledge. a Unstructured data sources, e.g., clinical notes, medical images, and clinical tests, encode invaluable knowledge about a patient medical condition. b Factors impact on the effectiveness of a treatment; they need to be identified to increase a patient survival time. c Various biomedical repositories maintain knowledge collected by the scientific community about facts that can contribute to the prescription of effective treatments. Data sources range from structured (e.g, COSMIC), to unstructured (e.g., PubMed); and short texts in structured data sources may encode also relevant knowledge (e.g., drug interactions). Heterogeneity problems across sources need to be solved for extracting the required knowledge. a Electronic health records, b impacts in treatment effectiveness, c biomedical data sources

Our Research Goal: Our main objective is to tackle data integration of structured and unstructured data in a way that the meaning of integrated data can be described, explored, and used to uncover relevant insights. We focus on biomedical data sources in the context of the EU H2020 funded project iASiSFootnote 1 and show how the problem of data integration may hinter the prescription of personalized treatments. Given a collection of data sets (structured and unstructured), the problem of data integration is to identify if two entities in the data sets match or do not match the same real-world entity. Integrating data sets requires the recognition and resolution of interoperability conflicts across these data sets, as well as fusion policies for merging equivalent entities [10]. Considering the wide nature of entities, the state of the art has focused on integration methods that reduce manual work and maximize accuracy and precision [11, 18, 19]. To overcome interoperability conflicts generated by the wide variety of existing formats–short notes or scientific publications–several unstructured processing techniques have been proposed. Natural language processing (NLP) contributes to integrating structured and textual data by providing linguistic annotation methods at different levels [36, 38, 41], e.g., syntactic parsing, named entity recognition, word sense disambiguation, and entity linking. Further, visual analytics techniques facilitate the extraction and annotation of entities from non-textual data sources [27, 5]. Annotations from ontologies and controlled vocabularies extracted from unstructured data represent the basis for determining relatedness among the annotated entities by the mean of similarity measures, as well as for identifying matches between highly similar entities.

Approach: The main idea of this paper is to present a knowledge-driven framework that resort to knowledge extraction, ontologies, and data integration techniques in order to create a knowledge graph. It comprises data and the knowledge that describes the main characteristics of the integrated data. The proposed approach represent a building block for the support of clinicians during disease diagnosis and treatment prescription.

Contributions: The principal contributions of this paper are the presentation of the results of applying the knowledge-driven framework to various biomedical data sources, as well as the promising outcomes observed by analyzing the generated knowledge graph. Although the framework as a whole is not available as open source, the components to perform entity linkingFootnote 2 and knowledge graph managementFootnote 3 are publicly available. The remainder of this article is structured as follows: Sect. 2 motivates the data integration problem over biomedical data sets. Sect. 3 describes our knowledge-driven framework, and Sect. 4 summarizes the principal results of implementing this framework in the iASiS project. Related work is presented in Sect. 5, and finally, Sect. 6 concludes and give insights for future work.

Fig. 2
figure 2

Definition of a Knowledge Graph. a A knowledge graph is presented as the intersection of the formal models able to represent facts of various types and levels of abstraction using a graph-based formalism. b Knowledge representation models are characterized according to the represented facts and levels of abstraction. a Knowledge graph, b a spectrum of knowledge representation

2 Motivating Example

We motivate our work with a set of myriad sources of knowledge about the condition of a lung cancer patient (Fig. 1), as well as typical integration problems caused as a result of well-known data complexity issues, e.g., variety, volume, and veracity. Electronic health records (EHRs) (Fig. 1a) preserve the knowledge about the conditions of a patient that need to be considered in order for effective diagnoses and treatment prescriptions. Albeit informative, EHRs usually preserve patient information in an unstructured way, e.g., textual notes, images, or genome sequencing. Furthermore, EHRs may include incomplete and ambiguous statements about the whole medical history of a patient. In consequence, knowledge extraction techniques are required to mine and curate relevant information for an integral analysis of a patient, e.g., age, gender, life habits, mutations, diagnostics, treatments, and familial antecedents. In addition to evaluating information in EHRs, physicians depend on their experience or available sources of knowledge to predict potential adverse outcomes, e.g., drug interactions, side-effects or resistance (Fig. 1b). Diverse repositories and databases make available crucial knowledge for the complete description of a patient condition and the potential outcome (Fig. 1c). Nevertheless, sources are autonomous and utilize diverse formats that range from unstructured scientific publications in PubMedFootnote 4 to dumps of structured data about cancer related mutations in COSMICFootnote 5. To illustrate, the effect of the interactions between two drugs is reported in DrugBank like short text, e.g., the effect of the interactions between Simvastatin and Paclitaxel. In order to detect the facts that can impact on the effectiveness of a particular treatment, e.g., Paclitaxel, the physician will have to search through these diverse data sources and identify the potential adverse events and interactions. Data complexity issues like data volume and diversity impede an efficient integration of the knowledge required to predict the outcomes of a treatment.

The proposed knowledge-driven framework resorts to techniques of knowledge extraction and representation to create a knowledge graph where data from disparate data sources is integrated. A knowledge graph represents entities and their relations, and ontologies and controlled vocabularies are utilized to describe the meaning of relations, as well as for annotating entities in a uniform way in the knowledge graph. Unified Medical Language System (UMLS), the Human Phenotype Ontology (HPO), and the Gene Ontology (GO) are exemplar ontologies. Furthermore, entity linking techniques are part of the framework to allow for the linking of entities in the knowledge graph, e.g., the drug Paclitaxel, to equivalent entities in existing knowledge graphs, e.g., in DBpediaFootnote 6 and in Bio2RDFFootnote 7. The linked knowledge graphs composed a federation, and a federated query engine is able to execute queries against the various knowledge graphs. Finally, (un)supervised techniques are built on top of the knowledge graphs for the support of conscientious diagnosis and personalized treatments.

3 Our Approach

3.1 Preliminaries

Fig. 2 presents the main characteristics of a knowledge graph. First, a knowledge graph is depicted as a data structured that represents data, knowledge, and actionable insights using a graph data model (Fig. 2a). Graph data models enable for the representation of entities and their relations, as well as naturally model mono- and multi-valued attributes, the neighborhoods of an entity, different types of relations, and relations recursively specified. Moreover, graph data models naturally extend to a large number of relations between two entities and enable the traversal and exploration of these connections. Based on these features of graph models, they stand for suitable data models for representing different types of concepts in a knowledge graph. We define a knowledge graph as follows:

Definition 1

A Knowledge Graph is a directed graph defined as triple KG = (O, V, E), where:

  • O is an ontology that comprises classes and relations, as well as rules that define the meaning of the relations.

  • V is a set of nodes in the knowledge graph; nodes in V correspond to classes or instances of classes in O.

  • E is a set of directed labeled edges in the knowledge graph that relate nodes in V. Edges are labeled with relations in O.

As stated in the previous definition, nodes in a knowledge graph can be composed of entities of representing items of data, abstract concepts, or the combination of both. This property enables for the characterization of a spectrum of knowledge graphs as indicated in Fig. 2b. This spectrum goes from less to more expressive graphs. Data graphs correspond to less expressive knowledge graphs; they comprise nodes representing entities and edges depicting the relations between them. Semantics of the relations is not encoded in any way in the graph. Ontologies include abstract concepts or classes—represented as nodes—and predicates representing the relations of these classes—edges in an ontology; the meaning of the predicates is represented using rules. Knowledge bases model knowledge about facts and abstract concepts but not necessarily using a graph data model; rule based formalisms like Datalog [8] or PSL [20] have been used to represent knowledge bases. Finally, knowledge graphs comprise not only facts about entities and their relations, but also about the classes to which these entities belong to and the meaning of these relations. Differently to knowledge bases, knowledge graphs are represented using graph data models; thus, they are able to naturally model data—entities—and knowledge—meaning of relations—as first-class citizens. Additionally, knowledge graphs can be modeled using diverse knowledge representation formalisms; the selection of the formalism depends on the type of statements that will be expressed in a knowledge graph. For example, the Resource Description Framework (RDF) is a metadata data model that resorts to the idea of making statements about resources in expressions of the form subject-predicate-object, known as triples. Subjects are represented as resources in the form of URIs or blank nodes; predicates define the relation between subject and object; they are in the form of URIs while objects can be of any type. RDF Schema is an extension of the basic RDF that allows for the definition of classes, relations, as well as hierarchies of classes and relations. Moreover, more expressive formalisms like the Ontology Web Language (OWL), make available a larger number of operators which enable the representation not only of classes, relations, and hierarchies, but also class and property constraints, negative statements, general equivalence relations, and restrictions of cardinality. In the knowledge graphs considered in this paper, operators from RDF, RDFS, and OWL are used. Further, some predicates have been also utilized to express metadata about classes and relations. For example, predicates rdfs:label, rdfs:comment, dcterms:modified, and dcterms:creator describe labels, comments, last modification date, and the creator of classes and properties, respectively. Data sources depicted in Fig. 1 are characterized by various conflicts that hinter a scalable solution of the problem of data integration. Heterogeneity conflicts include:

  1. (i)

    Structuredness depends on the degree of the sources being structured.

  2. (ii)

    Schematic is present whenever various schema are utilized by the data sources.

  3. (iii)

    Domain occurs if different interpretations of the same universe of discourse are followed.

  4. (iv)

    Representation takes place whenever different representations are used to model the same concept.

  5. (v)

    Language exists among two or more data sources whenever different languages are utilized for modeling data or metadata.

  6. (vi)

    Granularity depends on the graininess used to represent the data in different data sources.

Fig. 3
figure 3

Knowledge Graph Overview. Big data sources are ingested, curated, and integrated into a knowledge graph. Diverse knowledge extraction methods enable to transform unstructured data and describe the extracted facts using ontologies. Federated query processing and visualization tools enable the exploration of the knowledge graph, and knowledge discovery techniques facilitate the uncovering of relevant patterns

3.2 A Knowledge-driven Framework

We devise a knowledge-driven framework able to transform and integrate heterogeneous data into knowledge graphs. Fig. 3 depicts an overview of the framework; it is composed of four main components: Data Ingestion, Semantic Data Integration, Exploration and Visualization, and Evaluation and Knowledge Discovery.

1-Data Ingestion: big data is collected from different data sources; collected data is mainly characterized by the three dominant dimensions of the Vs model: volume—very large data sets; variety—sources in multiple data formats and models; and veracity—data with potential biases, ambiguities, and noise. To overcome interoperability issues caused by data variety, distinct knowledge extraction methods are part of the framework. Typical extractions methods include:

  1. (i)

    Natural language processing to extract facts from unstructured data sources and represent the extracted knowledge in the form of triples, i.e., subject, predicates, and objects [41]. Ontologies and controlled vocabularies are used to guide the extraction process as well as to annotate the extracted facts with the terms.

  2. (ii)

    Visual analysis and image processing to extract relevant facts from non-textual material like videos and images [27, 5].

  3. (iii)

    Genomic Analysis to identify mutations and genetic variations from microarrays [30, 53].

Once data is ingested, different techniques are used for data curation, e.g., statistical methods for completing missing values [46]–multiple imputation and maximum likelihood estimation, clustering techniques for duplicate detection [26], and crowdsourcing for data curation [4].

Fig. 4
figure 4

An Example of a Pipeline for Data Integration. a The pipeline receives unstructured data (step 1) and structured data sources (step 2). EHR Analysis extracts relevant facts and annotate them using UMLS terms, e.g., CO144576 and C0074554 represent Paclitaxel and Simvastatin, respectively. Entity and Predicate Linking are performed to extract the effect of drug-drug interactions on data collected from DrugBank. Mappings between identifiers of drugs in UMLS and DrugBank enable the integration of the patient with drug interactions (step 3). b A portion of the RDF subgraph representing a patient and the interactions of his prescribed drugs. a Data integration pipeline, b a portion of a patient in the KG

2-Semantic Data Integration: The integration of the matching entities is performed over the knowledge graph by exploring concepts, relations, taxonomies, and rules represented in the knowledge graph. First, collected and curated big data is modeled using a unified schema and stored in a knowledge graph. Then, entity recognition and linking are employed for transforming textual values in the knowledge graph, e.g., descriptions and comments, into structured facts. Finally, different methods are combined to curate and complete the represented facts. Knowledge graph creation relies on mapping-driven algorithms guided by mapping rules that describe entities using a unified schema. Additionally, controlled vocabularies and ontologies used by the knowledge extraction tools are represented as RDF triples as well; links between these ontologies are also included in the knowledge graph to enable the identification of entities in different vocabularies. Similarity-based methods are performed for entity matching; similarity measures are able to exploit the knowledge encoded in the knowledge graph. Hybrid approaches combine reasoning processes on top of the knowledge graph with the wisdom of experts, and enable curation and knowledge completion [2].

3-Exploration and Visualization: SPARQL endpoints enable the independent access of knowledge graphs; they are Web services that provide Web interfaces to query RDF data following the SPARQL protocol. Queries against federations of SPARQL endpoints are posed through federated SPARQL query engines; they are devised following the generic mediator and wrapper architecture [54, 55]. Lightweight wrappers translate SPARQL subqueries into the required SPARQL endpoint calls and translate endpoint answers into the query engine internal structures. The mediator rewrites original queries into subqueries that can be executed by the SPARQL endpoints. Furthermore, the mediator gathers the results of evaluating subqueries, and combines the results to produce the answer of the query. The federated query engine is able to exploit the semantics encoded in the knowledge graph during the execution of the tasks of source selection, and query decomposition, optimization, and execution. Visualization tools facilitate the exploration of patterns in the knowledge graph.

4-Evaluation and Knowledge Discovery: Machine learning methods are utilized to identify patterns in the knowledge graph. These methods are enhanced with contextual knowledge represented in the knowledge graph with the aim of identifying accurate predictions whose meaning can be described.

4 The Knowledge-driven Framework for Supporting Personalized Medicine

iASiS is a 36-month H2020-RIA project that has started in April 2017. iASiS aims at transforming clinical and pharmacogenomics big data into actionable knowledge for the support of personalized medicine in two life-threatening diseases: lung cancer and dementia. The knowledge-driven framework depicted in Fig. 3 is applied to integrate anonymized clinical data, biological sample analysis, medical images, genomics, medications, and scientific publications into the iASiS knowledge graph. UMLS and HPO are used for annotating concept extracted from unstructured items of data. the instantiation of the framework is as follows:

1-Data Ingestion: Knowledge extraction methods (developed by different partners) of the iASiS project are described as follows:

  1. (i)

    Electronic Health Record (EHR) Analysis: NLP methods (Menasalvas et al. [36]) resort to named entity recognition to extract relevant entities from unstructured clinical notes and to annotate the extracted concepts with terms from UMLS. These techniques allow for the extraction of the 39 properties from 739 lung cancer patients [50].

  2. (ii)

    Genomic Analysis: Data mining tools, e.g., catRapid (Livi et al. [34]), identify protein-RNA associations with high accuracy. Publicly available datasets, e.g., data from GTEx, GEO, and ArrayExpress, are used for the integration with transcriptomic data; genes are annotated with identifiers from different databases, e.g., HUGO or Uniprot/SwissProt, as well as with HPO.

  3. (iii)

    Image Analysis: Several machine learning algorithms (Ortiz et al. [44]), are applied to learn predictive models able to classify medical images and detect areas of interests, e.g., lung cancer tumors or imaging biomarkers. Further, image annotation methods semantically describe these areas of interest using ontologies [12, 47]. Open Data Analysis: An NLP pipeline is followed to extract UMLS terms from scientific publication in PubMedFootnote 8 and relations between the extracted terms (Nentidis et al. [42]). This pipeline resorts to MetaMapFootnote 9 for UMLS term extraction and SemRepFootnote 10 for relation extraction. The NLP pipiline has enabled the collection of 166,073 UMLS terms from 250,688 publications.

2-Semantic Data Integration. Data collected from biomedical open data sources and the data sets generated from the knowledge extraction methods are integrated in this step. Open data sources include COSMICFootnote 11, DrugBankFootnote 12, SIDERFootnote 13, and STITCHFootnote 14. Albeit structured, the open data sets may contain unstructured fields that encode valuable knowledge, e.g., the description of the interactions between two drugs from DrugBank or the approved indications of a drug in DBpediaFootnote 15. Entity and predicate linking methods (Sakor et al. [51]) are employed to extract entities and relations and to link them to terms in UMLS or DBpedia. A unified schema is used to represent the data in the iASiS knowledge graph. GAV mappings expressed using the RDF Mapping Language (RML) [14], specify mapping rules to transform data into RDF triples in the iASiS knowledge graph. Fig. 4a depicts with an example, the pipeline followed to create RDF triples and to perform data integration. EHR analysis [36] is performed to extract relevant facts from the clinical notes and represent these facts using UMLS. For simplicity, we just present some of facts: age, gender, toxic habits, chemotherapy drugs, drugs for comorbities, familial antecedents, and mutated genes (EGFR, ALK, ROS1). The execution of the RML mappings enable the creation of an RDF graph describing the patient and its relations. Note that drugs for chemotherapy and comorbities are annotated with the corresponding UMLS terms, i.e., C0144576 and C0074554 for Paclitaxel and Simvastatin, respectively. In addition, entity and predicate liking [51] is performed and effect of the interaction between Paclitaxel and Simvastatin is represented as an RDF graph (step 2). Since, this data is extracted from DrugBank, drugs are identified with a DrugBank identifier. Matching between the UMLS and DrugBank identifiers are found by performing string matching between the name of the drugs in DrugBank and the preferred names in UMLS. Matchings between UMLS and DrugBank identifiers—represented as dashed lines—are used for generating an RDF graph that relates UMLS identifiers of Paclitaxel and Simvastatin with the effects and impact of the interactions (step 3). Fig. 4 presents the final RDF graph where the patient described in the clinical notes and the interactions between his treatments are represented in an RDF graph. The same data integration procedure is performed for associating a patient with the side-effects of his/her prescribed drugs, the scientific publications in PubMed where his/her conditions, treatments, and biomarkers are reported; information about the diseases associated with his/her mutations; and potential mutations that may impact the effectiveness of his/her treatments. Moreover, the entity and predciate liking techniques by Sakor et al. [51] are also utilized to link entities in the iASiS knowledge graph with equivalent entities in DBpedia and Bio2RDF. Only for drugs, the approach by Sakor et al. [51] was able to identify 960 correct links to DBpedia out of 968 Drugs, while DBpedia Spotlight [13], a state-of-the-art entity linking tool, only identified 929 correct links.

The current version of the iASiS knowledge graph has 1,3 Billion triples, 46 RDF classes, in average 6.98 relations per entity, and each class is connected in average to 2.87 classes. Classes include Drugs, Publications, Mutations, lung cancer Patients, Biomarkers, Genes, Side Effects, Proteins, Enzymes, Transporters, and Annotations. The class annotation has 69,910,644 instances related to the rest of the classes in the knowledge graphs. To generate the iASiS knowledge graph, 103 RML mapping rules were defined and curated by four knowledge engineers. As a result of following the semantic data integration pipeline illustrated in Fig. 4a, the lung cancer patients were linked to the interactions between their prescribed drugs. Fig. 5 presents the density distribution of pairs of drugs that interact in the lung cancer treatment; almost 50% of the patients are taking at least one pair of drugs whose interaction has been registered by DrugBank. In average, the patients in the iASiS knowledge graph receive treatments with 1.7 reported interactions. This information is extremely valuable for the clinicians because by traversing the knowledge graph, they can easily identify the potential interactions of the drugs and prescribe more effective and less toxic treatments.

Fig. 5
figure 5

Integration of Drug Interactions. Frequency density of pairs of drugs prescribed to patients in the knowledge graph that have known interactions (source DrugBank). As observed, there is at least one drug interaction for almost 50% of the population, and in average there are 1.7 interactions per patient

3-Exploration and Visualization. MULDER is a federated query engine [16] that enables the execution of queries against the federation composed by the iASiS knowledge graph, DBpedia, and Bio2RDF. MULDER receives as input, queries in SPARQL and performs the tasks of source selection, and query decomposition and optimization by exploiting the meta-data about the classes and the connections of these classes in the knowledge graphs. Moreover, MULDER relies on adaptive physical operators, e.g., symmetric join [18] and gjoin [1], and is able to produce results incrementally as soon as they are collected from the knowledge graphs. In order to illustrate the features of MULDER, we report on the results of an experiment over the complex queries of the LSLOD [25] benchmark. The state-of-the-art federated query engine ANAPSID[1] is included in the study. ANAPSID and MULDER resort to the same set of physical operators. Thus, it is expected than the differences observed between them is produced as a consequence of producing efficient plans. LSLOD [25] is a benchmark composed of ten knowledge graphs from the life sciences domainFootnote 16. They include: ChEBI (the Chemical Entities of Biological Interest), KEGG (Kyoto Encyclopedia of Genes and Genomes), DrugBank, TCGA-A (subset of The Cancer Genome Atlas), LinkedCT (Linked Clinical Trials), SIDER (Side Effects Resource), Affymetrix, Diseasome, DailyMed, and Medicare. The goal of the experiment is to evaluate the performance of MULDER in large data sets from the biomedical domain and in complex queries. We evaluate the efficiency in terms of the continuous generation of query answers, and use the measure dieft@t proposed by Acosta et al. [3]. This metric measures the continuous efficiency of an engine in the first t time units of query execution; it is computed as the AUC (area-under-the-curve) of the answer distribution until time t. Additionally, we report on multiple metrics that evaluate the overall performance and completeness, i.e., inverse of time for the first tuple (\(\text{TFFT}^{-1}\)), inverse of total execution time (\(\text{ET}^{-1}\)), number of answers (Comp), and throughput (T); all of them are “higher is better”. Fig. 6 reports on the results of these metrics. As observed, in queries CQ2 and CQ8, ANAPSID did not produce any results before reaching the time-out (300 secs.). In the rest of the queries, both MULDER and ANAPSID are able to produce all the query answers. With the exception of CQ4 and CQ10, MULDER continuously produces results faster. Surprisingly, MULDER and ANAPSID generated the same plans for CQ4 and CQ10; however, the implementation of the physical operators impacts on a faster execution of these plans in ANAPSID. These results suggest that MULDER plans allow for a continuous performance during the answer generation process. In the context of the iASiS framework, this feature is extremely relevant because users demand to receive answers fast and continuously.

Fig. 6
figure 6

Performance of Federated Query Engines. ANAPSID and MULDER are compared in terms of continuous behavior. Axes correspond to: inverse of time for the first tuple (\(\text{TFFT}^{-1}\)), inverse of total execution time (\(\text{ET}^{-1}\)), number of answers produced (Comp), throughput (T), and dieft@t. All metrics are ‘higher is better’. Complex queries are from the LSLOD benchmark. ANAPSID produces empty results for CQ2 and CQ8. MULDER plans exhibit better performance than ANAPSID plans (CQ1, CQ3, CQ5, CQ6, and CQ7). Plans for CQ4 and CQ10 are the same, but ANAPSID query engine has a better continuous performance than MULDER

4-Evaluation and Knowledge Discovery. Knowledge discovery techniques are used to uncover patterns in the iASiS knowledge graph. Patterns include common characteristics of patients depending on their toxic habits, familial antecedents, or comorbities. We define a similarity measure as a function that quantifies the similarity of two patients. The patient similarity combines similarity values of the main characteristics of the two patients: age, gender, mutated genes, toxic habits, the evolution of a tumor, the mutations, and the patient performance status (ecog). Similarity values between these characteristics are computed based of different similarity measures:

  1. (i)

    Lists are compared using Spearman’s rho while the Jaccard similarity coefficient is utilized for sets;

  2. (ii)

    similarity between drugs is computed based on the chemical structure of the drugs (SIMCOMP)Footnote 17;

  3. (iii)

    side effects are compared using the Human Phenotype Ontology similarity (HPOSim)Footnote 18; and

  4. (iv)

    The UMLS similarity measureFootnote 19 is used for UMLS terms.

Fig. 7
figure 7

Knowledge Analytics. a A function able to quantify the similarity between two lung cancer patients is described in terms of frequency density; the function takes into account treatments, and the evolution of the tumors, mutations, and patient performance. The reported results suggest that a large number of patients react similarly to the treatments. However, more studies are required to validate this observation. b Communities of lung cancer patients and the summary of the observed features age, toxic habits, and EGFR mutations. Distributions of the observed features differ from the whole population, enabling the study of patients with unique characteristics. a Density distribution patient similarity, b communities of lung cancer patients

The combination of the similarity values is computed in terms of a triangular norm. Fig. 7a depicts the density distribution of the similarity values for pairs of lung cancer patients in the iASiS knowledge graph. We can observe that a considerably large portion of the population of patients have relatively high values of similarity, suggesting that a large number of patients have similar reactions to the prescribed treatments. Further analysis with clinical partners is required to validate the meaning of observed values of similarity. Furthermore, we apply community detection algorithms to discover patterns between patients that share similar properties in the iASiS knowledge graph. We resort to semEP (Semantics Based Edge Partitioning Problem) [45] for computing communities of patients based on the values of similarity. It creates a minimal partitioning of the input graph, such that the density of each community is maximal. The community density represents the degree of similarity of the entities in a community. Fig. 7b reports on the results of computing semEP against the iASiS knowledge graph. Main properties of the patients involve mutations of lung cancer related genes, e.g., EGFR; demographic attributes, smoking habits, treatments, and tumor stages. The studied population is composed of 739 patients. The goal of the study is to identify the four communities of patients—out of 13 communities—with characteristics that differed from the whole population; the Kolmogorov-Smirnov test was used to rank the communities. Fig. 7b reports on four communities of patients; using a heatmap plot the percentage of patients in each community or cluster is described in terms of age, gender, EGFR mutation, and smoking habits. For example, patients in Cluster-1 are not current smokers and a considerably number of them are non-smokers; in addition, the biomarker EGFR is negative for many of them. The results are initial and require further study from the clinical partners of the project. However, they suggest that these techniques have the power of uncovering patterns between the observed features of patients.

5 Related Work

The problem of devising data integration frameworks has extensively treated in the literature [23]. The mediator and wrapper architecture proposed by Wiederhold [54] and the data integration system approach presented by Lenzerini [32], represent the basis for the state of the art [15, 24]. The community of Semantic Web have proposed various approaches that enable the integration and processing of Web data. KARMA [31] is a semi-automatic tool able to generate mapping rules between structured sources in different formats, e.g., CSV, JSON, and XML, and a unified schema. Albeit effective during the mapping definition phase, KARMA does not provide any support for the steps of data integration, curation, management, and analytics. DIG [19] and MINTE [10, 11, 18] also enable the creation of knowledge graphs, but they mainly focus on solving the problem of entity matching effectively. LDIF [6], LIMES [43], Sieve [37], Silk [29], and RapidMiner LOD Extension [49] also tackle the problem of data integration. However, they resort to similarity measures and link discovery methods to match equivalent entities from different RDF graphs. With the aim of transforming structured data in tabular or nested formats like CSV, relational, JSON, and XML, into RDF knowledge graphs, diverse mapping languages have been proposed. Exemplary mapping languages and frameworks include RDF Mapping Language (RML) [14], R2RDF [52], and R2RML [48]. Additionally, a vast amount of research has been conducted to propose effective approaches for ontology alignment [17, 39, 9], as well as to effectively perform curation of knowledge graphs [2, 35, 4]. Our knowledge-driven framework while generic, facilitates the integration of existing components; thus, it can benefit from these tools to effectively solve the problem of transforming data into actionable knowledge.

6 Conclusion and Future Directions

We present a knowledge-driven framework able to integrate knowledge extraction, semantic data integration, query processing, and knowledge analytics for supporting decision and policy making. We have described the application of the framework in the biomedical domain and shown the potential for uncovering patterns that can enable the explanation of treatment interactions and patient characterization. The framework is part of the iASiS platform, and clinicians are starting the process of evaluation of outcomes. Although we focus on the biomedical domain, the general knowledge-driven framework has also been applied in other domains [11], e.g., law enforcement, job market application, and smart manufacturing. Similarly, we observed that the framework is not only easy to configure, but also provides accurate results. We hope that our proposed techniques will help clinicians and data practitioners in the complex tasks of extracting valuable knowledge from heterogeneous datasets. In the future we plan to define a hybrid approach able to combine the wisdom of the domain experts and users, and the accuracy of machine learning approaches, to facilitate the evaluation of the knowledge graph and the uncovered insights.