Keywords

1 Introduction and Motivation

The management of data generated in several disciplines, including Oceanography and Meteorology, is currently facing great challenges. Among other facts, this is triggered by the recent exponential increase in its volume and diversity of sources, due to the growth of technology and advances in remote ocean observatories [1]. In addition, there is a great diversity in data types that must be handled together. This includes physicochemical, geological, meteorological and biological data, which must be integrated, and the analysis/information products for scientific, governmental, and productive purposes must be based on integrating all of them to be meaningful [2]. Taking into account the definition of Big Data (BD) [3], both ocean observation and weather data fit within the “5V" characterization of BD (volume, velocity, value, veracity, and variety). Therefore, data management in this context can be considered as a typical Big Data case [4]. In scientific activities, this situation presents both challenges and opportunities regarding the access and integration of data they need to conduct novel research activities that may trigger new discoveries enabled by the integration of multidisciplinary information sources [5, 6]. In the context of the Horizon 2020 program (H2020)Footnote 1 of the European Union, and at the National level in the strategic plan Argentina Innovadora 2020, established by the Ministry of Science, Technology and Innovation (MINCyT) of Argentina, BD and data science are considered fundamental disciplines to address the complexity and scope of the issues that require an interdisciplinary approach and a broad projection in the use of information. In the research activities focused on the South Atlantic, data collection campaigns are scarce, and an adequate information management system is not readily available. Therefore, it is necessary to develop systems capable of managing data integration and delivery, both for the direct and indirect use by the participating research groups and institutions, and for external users that require information (f.e., governmental, third parties, etc.).

One of the advantageous features of BD is its ability to manage information in schema-free formats that are both agnostic with respect to technological aspects, and that allow further schema-evolution that will be typically be the case in Natural Sciences. This allows the use of practical internal representations that facilitate specific purposes, for instance the management of datasets in graph form. The Semantic Web (SW) [7] provides solutions to these needs by enabling the Linked Data (LD) Web [8] where data objects are uniquely identified and the relationships between them are defined explicitly. LD is a powerful and compelling approach to store, disseminate and consume scientific data from various disciplines [6, 9, 10]. LD enables the publication, exchange and connection of data on the Web and offers a new way of integration and interoperability. Recently the term knowledge graph (KG) emerged [11], which has been used in research and business, generally in close association with SW technologies, LD, large-scale data analysis and cloud computing. The popularity of KGs is related to the launch of Google Knowledge Graph in 2012Footnote 2, and through the introduction of other large databases by major technology companies, such as Yahoo, Microsoft, AirBnB and Facebook, which have created their own KGs to enhance semantic searches [12]. Not only in the industry there are successful uses of KGs, in the oceanographic domain and in the Life sciences in general there is a growing recognition of the advantages of SW technologies [13,14,15,16,17,18].

Related to these problems, two previous works were developed for the creation of an Oceanographic linked dataset, both were developed jointly with the Centro de investigación y transferencia Golfo San Jorge, (CIT-GSJ-CONICET): the proposal of publication of oceanographic campaign metadata [19], and the definition of initial steps for the development of an oceanographic KG called OceanGraph KG [20]. Based on the experience gained in this previous work, a series of recommendations related to interoperability and information integration of The Integrated Ocean Observing System (IOOS) [21] was proposed.

This paper describes in a general way the OceanGraph KG and its recent efforts focused on the integration of heterogeneous oceanographic and meteorological data. In Sect. 2 we present the underlying idea of OceanGraph KG and its main features. In Sect. 3 we discuss its usefulness through case studies. Finally, in Sect. 4 lessons learned and future guidelines are presented.

2 OceanGraph KG Overview

The first version developed to integrate heterogeneous data taking advantage of a KG was described in [20]. OceanGraph bases its main structure on the relationships established between the selected datasets. The main classes that we define and reuse are: campaigns, occurrences, papers, researchers, environmental variables and positions. If a researcher consults OceanGraph, the expected results could recover one or more oceanographic campaigns in which she/he was involved from National Marine Data System (NMDS)Footnote 3, datasets they collected (from Global Biodiversity Information Facility (GBIF)Footnote 4 and Ocean Biogeographic Information System (OBIS)Footnote 5, and papers written by themself (from Springer Nature SciGraph)Footnote 6. In the same way, the user could query data related to the occurrence of a species and the KG must retrieve in which campaigns it was observed, the information of the person who collected it, the exact place and date and associated variables that may be of importance (e.g., weather or other environmental conditions during the collection).

2.1 Ontologies and Vocabularies Used

To ensure that our data will be available to multiple scientific communities, the resource description should adopt well-known standards. Next, we will describe the main resources related to the oceanographic domain and we will see the selected standards to model information on agents and organizations. Different data providers use their own ontologies and reuse existing ones.

- National Environmental Research Council’s (NERC) Vocabulary Server (NVS) [14] provides access to standardized lists of terms which are used to facilitate data mark-up, interoperability and discovery in the marine science domain. NVS is published as Linked Data on the web using the data model of the Simple Knowledge Organization System (SKOS)Footnote 7.

- GeoSPARQL [22] defines an ontology that supports geospatial semantics, developed by the Open Geospatial Consortium (OGC)Footnote 8. The definition of this ontology (based on well-known OGC standards) is intended to provide a basis for the standardized exchange of RDF geospatial data that can offer query capabilities and qualitative spatial reasoning using the W3C standard SPARQL [23].

- Darwin Core Standard [24] provides a stable, direct and flexible structure for compiling and sharing biodiversity data from different sources. OceanGraph, uses it to describe properties and concepts related to occurrences of marine species.

- Geolink [15] dataset includes diverse information, such as port stops made by oceanographic cruises, physical sample metadata, funding for research projects and staff. This dataset is based on an ontological design pattern (ODP). This ODP it is generic enough to adapt it to the modeling needs established by OceanGraph.

- BiGe-Onto [25] is an ontology designed to manage Biodiversity and Marine Biogeography data. BiGe-Onto uses the idea of occurrence (the observation of a species in a place at a given time), since the censuses are observations of SES at a specific time and place, we consider that BiGe-Onto fits to nature of our data. BiGe-Onto also reuses different appropriate vocabularies to represent information from these domains. In particular, Darwin Core (DwC) [24] is the most important thereof, and reuses several classes that will be considered here: Occurrence, Event, Taxon and Organism. BiGe-Onto also reuses foaf:Person void:Dataset and dcterms:Location. Our ontology models occurrences that are related to other concepts through the following relationships.

  • bigeonto:associated. Each of the occurrences are described according to the existence of an organism, which was observed at a specific place and time. The organism and the taxon are related through bigeonto:belongsTo property.

  • bigeonto:has_event. The occurrence has a location (since they are species observations) and they are given by the relation bigeonto:has_location, which belongs to a specific environment bigeonto:caracterizes. The Relations Ontology (RO).Footnote 9 defines the relationships between bigeonto:Environment and the classes of the Environment Ontology (EnvO) [26].

  • dwciri:recordedBy. This property enables non-literal ranges in comparison to its analog dwc:recordedBy, so it allows to relate URIs that describe people, groups or organizations involved in the occurrence, e.g. relate a person to their ORCID.

  • dwciri:inDataset. Allows the occurrences to be related to the data set to which they belong.

- SSN/SOSA [27] To describe the sensors and their oceanographic observations, we use the Semantic Sensor Network (SSN) ontology, and especially the Sensor, Observation, Sample and Actuator (SOSA) ontology that describes the elemental classes and properties, for example (depth, temperature, salinity, etc.). Both vocabularies are suitable for a variety of applications, like large-scale scientific monitoring, satellite imagery, among others. The SSN ontology is an OWL vocabulary developed by the W3C, in collaboration with the Open Geospatial Consortium (OGC), so its adoption guarantees its reuse in many other applications.

2.2 Cross-linking

A challenge, in order to improve the discovery of information, is to generate links between the different URIs of the KG. The interlinking of OceanGraph data sets was carried out semi-automatically. It is common for people who participated in an oceanographic campaign, after it, to publish their results in scientific journals. Even more complex is the case of a person who publishes a datapaper (scientific paper that describes data), this is made up of the publication itself, plus the primary data that supports it in OBIS or GBIF. OceanGraph allows people or species to be linked in different repositories, thus ensuring semantic interoperability between data sets. To generate the links we use the SILK frameworkFootnote 10, which uses the declarative language Silk-LSL (Link Specification Language) with which the user can establish the type of RDF links that must be discovered between the different data sets and the conditions that must be met, e.g. to relate researchers who obtained data from a campaign with the results published in OBIS or GBIF, the Levenshtein distance is used to disambiguate entities by calculating the similarity between them.

This operator receives two inputs: dwc:recordedByFootnote 11 and foaf:name, if there is enough match that the people are the same, SILK generates the link between them using the axiom owl:sameAs. Figure 1 shows the relationships used to integrate OceanGraph datasets.

2.3 Availability

One of the most important design decisions when developing a KG is the platform that supports it. After several performance comparisons, we decided to use GraphDBFootnote 12 since it allows a quick integration of new sources of information, analyzes structured data in CSV, XLS, JSON, XML or other formats, it allows to generate data in RDF and store it in a local or remote SPARQL endpoint, and last but not least, it allows to clean the input data with a generic script language. GraphDB allows users to explore the hierarchy of RDF classes and its instances (Class hierarchy menu). In the same way, we can check the relationships between the KG classes and visually explore how many links were created between different class instances (Class relationship). To access the OceanGraph dataset, the user must authenticate themselves on http://web.cenpat-conicet.gob.ar:7200/login, using the following credentials (user: oceangraph password: ocean.user). OceanGraph KG is also available for download in [28] under CC BY 4.0 license. Table 1 summarizes the main links to explore the knowledge graph in various ways.

Table 1. Main features of OceanGraph KG.

3 Big Data Use-Cases

As a result of the process described in the previous sections, a set of nodes and links were created to connect references from the input data to entities and relationships within the KG. We extended this generic approach to integrate different functionality modes that are typical in BD contexts.

3.1 Complementing Information with SN SciGraph

As the development and adoption of novel research devices is growing exponentially, it’s getting harder to track all the documents related to a given scientific subject. SciGraph dataset integrates data sources from Springer Nature. SciGraph collects information about research landscape: research projects, publications, conferences, funding agencies and others. This dataset [29] includes around 35 million records and is refreshed on a monthly basis.

It is often necessary to connect researchers or other stakeholders that contribute to the same subject. This is specifically the case in the oceanographic domain, in which is required to determine researchers who are part of an oceanographic campaign, and connect their subject with other researchers from another part of the world who are working on the same subjects. In the particular case study of this paper, the research subject is physical oceanography.

Fig. 1.
figure 1

Conceptual diagram of OceanGraph KG. For simplicity, only the main object properties are shown, which allow relationships between the classes of each data set to be established.

We will explore the instances of the sg:Subject class and their related subjects using the core#narrower property. As can be seen in Fig. 2, there are five subjects directly related to physical oceanography (ocean science, marine biology, climate sciences, etc.)

Fig. 2.
figure 2

Exploring terms related to the concept physical oceanography using the GraphDB visual interface.

3.2 Macroecological Analyzes

A very common requirement of macroecological analyzes, particularly those that consider the environmental drivers of species distributions, is to match occurrences of species’ with environmental variables, and how distributions are expected to shift as the climate changes. This case study shows an example of how KG information can be exploited using the relationships between occurrences with environmental variables, for the example we will use the body of water temperature as a study variable. In particular, we need to associate the following variables: (i) the occurrence of a species under study, (ii) the region of interest, (in our case Golfo Nuevo), (iii) a specific time frame and (iv) the measurements of the water body temperature.

The first step is to define the region under study, to later and then recover the occurrences of the chosen species in a specific time frame. To handle temporal concepts, we use Time Ontology [30]. Since NERC provides URIs for each of the variables that we need to analyze, we only need to search for the URI of the body water temperature, which is defined as: SDN:P01::TEMPCU01. Table 2 shows an RDF fragment that includes the concepts involved in performing the analysis.

Table 2. RDF serialization of the concepts involved in macroecological analysis.

In Listing 1.1, you can see the query that we implemented using SPARQL, it associates the occurrence of Merluccius hubbsi (a fish species of specific scientific and productive interest) with the temperature in a particular region. To do this, we define Golfo Nuevo, as an instance of (geo:Polygon), then look for observations of Merluccius hubbsi, which has its location associated and are instances of the class (geo:point). One of the advantages of adopting GeoSPARQL is that we can perform spatial operations, e.g. to determine if a point is contained within a polygon, for this we use the provided function (geof:sfWithin). As a last step, we must obtain the temperature (also georeferenced) and define it by NERC as TEMPCU01. To execute the query in GraphDB, see the following linkFootnote 13. This specific example shows how our proposed data integration effort around KGs, bridges the gap between the sometimes isolated existing data collection initiatives worldwide, and a centralized and uniform data access that may be automated. A standardization like the provided by our proposal further enables the next and more fruitful BD stages, including massive automated data analysis, online real-time actionable dashboards, and visual analytics.

figure a

4 Conclusion

Based on the results of this experience, KGs proved to be powerful and flexible enough to integrate diverse data sets. However, the integration process required to correctly map input data into a KG can be exhausting, since automated techniques have so far been unable to fully understand the semantics of input data. Through the OceanGraph development process, we learned a few lessons on how LD can contribute to addressing important BD challenges, especially within the area of oceanographic data.

First, the amount of linked datasets grows every year and is interrelated over a growing entanglement of scientific information. This presents new challenges, which require considering scalability and performance as crucial aspects for any future facility [31]. Around this issue is where LD needs to incorporate BD techniques and methodologies, specifically in the data management aspects.

Second, from the BD perspective, it is also a priority to start incorporating linked data results. Currently only a few large companies are able to take advantage of BD [32], which is unfortunate since individual scientists, small research groups, nongovernmental agencies, and other stakeholders that are engaged in potentially relevant activities are in a disadvantageous situation among the large commercial interest groups. In this foreseeable scenario, some questions that arose in other contexts begin to be visible. Among others we can mention [33]: How can particular users delve into BD in a fruitful manner? Having found useful data, How to make it understandable to laypersons with little or no prior data science knowledge? How to handle data in a way that grants no privacy or licensing breaches? How can data generated from different cultures and over different languages (or even charsets) be rendered useful? What standards for data and metadata are necessary? How to link data from different repositories? What governance standards should be supported or even enforced to grant privacy, traceability, auditing, and other technical, ethical and legal features that systems like this must implement?.

BD is doomed to arrive into the realms of worldwide scientific enterprises, but its value will increase, and all the community will be able to take advantage of it, only when it becomes transparent and often usable by the largest number of users [34]. From this perspective, it is necessary to consider that BD, at least in the context of scientific enterprises, requires multi- and interdisciplinary integration, and, within such a decentralized scenario, the new challenges are also associated with meaning.