1 Introduction

Ocean environment is highly related to the ecological system and human lives, by furnishing humankind with diverse resources and services. Ocean serves as oxygen supply, climate regulation, carbon sequestration, food and medicine supply, etc., which is of great significance to the survival and development of human society.

With the rapid development of information technologies, data acquired from ocean observation platforms grows exponentially every day. Ocean data is mainly obtained through various observation devices from land, sea surface, underwater, aerospace, etc. It is an accumulation of a large amount of data from different time periods, scales and regions. Compared with normal data, ocean spatiotemporal data emphasizes more on the dynamic process. The spatiotemporal process of ocean data is mainly reflected in ocean phenomena. The spatiotemporal process of ocean phenomena not only exists in a certain spatial scope, but also holds a certain continuity in time. The characteristics of different temporal states are different, and the characteristics in different moments are different. Some characteristics change constantly.

The spatiotemporal process of ocean environmental data acts as a primary role in ocean environment research. There are a variety types of ocean data, and the format and record type of different sources and types of data are different. In practice, it is often necessary to use multiple formats of data, and the different formats of data bring great inconvenience. After obtaining ocean data, researchers tend to process the data differently and extract valuable information from it. In order to obtain valuable information that meets the needs, it is necessary to reorganize and represent data in a readable and operable way, so that the data can be further exploited and utilized.

Data representation is the transfer of our experience of the actual world into the computational domain, and it is the way how data is stored, processed, and transmitted [1]. However, the inherent complexity of ocean data brings a great challenge to representation. The traditional representation methods are not able to present the spatial and temporal features of ocean in an effective way. Therefore, multiple updated and enhanced representation are proposed to describe the dynamic and flexible process of ocean data. Among these methods, semantic web [2] gradually become the main trend for representing spatiotemporal data currently.

Semantic web can be originated back to 1956, when Richens first carried out semantic net or semantic network [3]. It is first used for knowledge-based system reasoning and problem solving. After that, MYCIN [4] was designed as an early expert diagnosis system base on rules. Then, RDF [5] and OWL [6] was introduced, as the core schema of semantic web. A series of open domain semantic web or ontology were brought in, including Cyc [7], Freebase [8], DBpedia [9], YAGO [10], PROSPERA [11].

In 2012, knowledge graph [12] was first introduced by Google. Since then, knowledge graph gained great popularity and further exploration. A knowledge fusion framework KnowledgeVault [13] is developed on the basis of knowledge graph for large-scale knowledge. A knowledge graph is essentially a large semantic web for describing concepts, entities and their relationships in the objective world [14]. Knowledge graph provides a more human-like method to represent information and knowledge in the computer world. Knowledge graph is labelled as large-scale, semantic-rich, high quality, structure friendly, etc.

The graph-based structure of knowledge graph can effectively represent and store the spatiotemporal characteristics of ocean data by entities and relationships. Geographic knowledge graph usually reflects spatiotemporal features in data. Typical geographic knowledge graph includes LinkedGeoData [15], LinedSpatiotemporalData [16], etc. In the construction of a ocean knowledge graph, to embed the spatiotemporal element into the structure, advanced theories techniques or methods are applied, including spatiotemporal entity recognition, spatiotemporal disambiguation, semantic extension and so on. In this survey, Section 2 introduces the representation methods on ocean spatiotemporal data based on the characteristics, Section 3 elaborates design concept of a knowledge graph, Section 4 introduces the construction steps of a ocean spatiotemporal knowledge graph. In Section 5, ocean data processing methods of single-node and multi-node is comprehensively explained. In Section 6, the performance evaluation of ocean spatiotemporal knowledge graph is presented, and Section 7 makes a summary to this survey.

2 Representation on ocean data

Data representation is a reflection of real-world data in a computer-readable and operable version, providing an approach to analyze raw data. Simple data representation includes binary digits, numeric data, character data, etc. As data formats, data structure and data volume growing more complex, the representation methods are also required to be “upgraded” to different forms, including tables, graphs, vectors, functions, distribution, data models, etc. In terms of ocean data representation, the heterogeneity and spatiotemporal characteristics should be fully depicted by the representation.

2.1 Ocean data characteristics

Ocean data is vast and diverse, including meteorology data, hydrology data, hydroacoustics data, seafloor topography and geomorphology, ocean chemistry data, etc. Apart from the characteristics of big data, that is, high volume, high variety, high velocity, high value, high veracity and high validity, ocean data is also characterized by multiple ocean properties. Major ocean data characteristics that impact ocean data representation includes:

  • High volume. Various ocean observation programs cover almost all the oceans worldwide and carry out huge amount of periodic and real-time data collection. The volume of ocean data is increasingly growing, and the overall volume has reached EB level.

  • Heterogeneous. The sources of ocean data acquisition are from a wide range, including ocean surveys, observation platforms, remote sensing and so-on. The formats and quality of these data also vary from their observation methods, extraction models, structure, application and analysis. These characteristics have made ocean data heterogeneous and high-dimensional.

  • Dynamic. Ocean is an obvious dynamic system with rapid changing data flow. With the advancement of observation methods and devices, and the improvement of data processing, ocean data are collected by seconds, which results in information in ocean database changing constantly and the data updates getting more frequently.

  • Spatiotemporal. Ocean data carries both spatial and temporal attributes inherently. In spatial scale, ocean data involves nearshore, offshore, polar regions, sea surface, deep and distant ocean data, etc. In temporal scale, ocean data includes variability ranges from seconds, minutes, hours, days to seasons, years even multiple centuries. Ocean data embodies different characteristics at different spatiotemporal levels.

Therefore, ocean sptaiotemporal data representation are required to include all these characteristics to describe ocean data more correctly and make a better prerequisite for further ocean data analysis.

2.2 Data representation methods

Researchers have performed different representation methods to depict spatiotemporal ocean data and continuously made improvements on these methods. For example, map is a typical representation for spatially distributed data, but map generally only reflects the surface information whereas no information on deeper layers of earth. To solve this limitation, Chung et al. [1] analyzed three classical presentations-probability measures, Dempster-Shafer evidential belief functions and fuzzy logic functions, by applying favorability functions to represent information of m layers of the earth, in 1993. In [17], a spatial data representation with dynamic graphics was proposed, with a classification method where maps can incorporate dynamism [18]. In 1996, Tuohy et al. [19] proposed a geophysical data representation method with interval B-spline function, which facilitates data archiving and reduces data storage. Spline is a piecewise polynomial curve, which functions well in multi-dimension data interpolation. In 2010, Bibby et al. [20] proposed a hybrid representation method for ocean environment, where stationary objects are represented by point features and trajectories of dynamic objects are represented by cubic splines.

However, these methods either cannot present the spatiotemporal process completely or cannot present the dynamic process in a perfect shape. Therefore, graphed-based semantic web has been wildly applied to representation on ocean spatiotemporal data, for graph-based semantic is a more proper method in representing dynamic and heterogeneous data.

The use of semantic web can be originated back to 2000s [21,22,23]. There have been several different methods to involve spatiotemporal and other features into ocean data representation. Raskin [21] developed a semantic web (SWEET) for geo-terminology by building a collection of spatiotemporal ontologies. MacGregor [22] designed a semantic primitive (SEW) especially for contextualized data, to conduct abstractions on related resources. A mapping between semantic web and geospatial data processing standards were established for Spatial Data Infrastructure (SDI) [16].

In terms of information deluge in recent years, [24] presented an agile data architecture (CRISIS) for real-time data representation of multi-source heterogeneous ocean data streams with semantic web technologies in 2018. Later, [25] presented an reorganized and enhanced version of [24], including an isolation of functionalities to utilize multi-source querying and the discovery of alarms. Wang et al. [26] designed a formalized geographic knowledge representation (GeoKG) that describes the evolution of spatiotemporal data. Ren et al. [27] propounded an unified semantic model (OEDO) to represent heterogeneous ocean data by metadata.

3 Design of ocean spatiotemporal knowledge graph

Knowledge graphs are structured semantic knowledge bases for effectively and comprehensively describing concepts and the complex relationships between them in the physical world in a structured way, by aggregating a large amount of knowledge and creating connections between information, thus realizing quick response and knowledge reasoning. In terms of domain, knowledge graphs are usually divided into general knowledge graphs and domain knowledge graphs. General knowledge graph can be regarded as a structured encyclopedic knowledge base that contains a large amount of common sense knowledge in the real world with high convergence. Domain knowledge graph, also referred as industry knowledge graph or vertical knowledge graph, is usually oriented to a specific domain base on industry data, which has been widely used in the industrial field. We focus only on domain knowledge graph in connection with ocean spatiotemporal data in the survey. The logical structure of a knowledge graph consists of data layer and schema layer.

3.1 Data layer

Data layer stores real-world data. Data forms includes structured data , demi-structured (XML, json), and unstructured data (images, recordings or videos).In data layer, data or facts are stored in RDF (Resource Description Framework). RDF provides a unified standard for describing entities and resources, which is also a method of data representation. RDF is formally represented as an SPO(Subject, Predict, Object) triple, which stands for a piece of knowledge in a knowledge graph. RDF consists of nodes and edges. Nodes represent entities/resources or attributes, and edges stand for the relations between entity and entity or entity and attribute. Triples can be presented as “entity-relation-entity” or “entity-attribute-attribute value”.

Entities are the basic elements of the knowledge graph, which refer to specific names of people, organizations, places, dates, times, etc. Relation is a semantic relationship between two entities, which is an instance of the relationship defined by the schema layer. An attribute is a description of an entity and is a mapping relation between an entity and an attribute value. However, RDF is limited in representation on how to distinguish classes and objects and on how to define and describe the relations of classes or attributes. Based on RDF, researchers have developed RDFS (Resource Description Schema) [28] and OWL (Web Ontology Language) [6]. RDFS is a set consisted of predefined vocabulary that can describe RDF, while OWL is more like an extension version of RDFS that provides fast and agile data modeling with effective reasoning.

3.2 Schema layer

Schema layer is on top of data layer, which is the core structure of knowledge graph. Schema layer is managed by ontology. Schema layer acts as the conceptual model and logical foundation of the knowledge graph, and provides the specification constraint for the data layer. Mostly, ontology is adopted as the schema layer of knowledge graph, and the data layer of knowledge graph is constrained with the rules and axioms defined by ontology. The knowledge graph can also be regarded as an instantiated ontology, and the data layer of the knowledge graph is an instance of the ontology. In the schema layer of the knowledge graph, nodes represent ontology concepts and edges represent relations between concepts.

3.2.1 Ontology

Ontology is originated from a branch of philosophy. In computer science and information technology, ontology refers to a specification vocabulary for a shared domain of discourse — definitions of classes, relations, functions, and other objects [29]. An ontology provides a shared vocabulary, which can be used to model the the type of objects or concepts and their properties and relations that exist within a given domain [30]. The purpose of ontology is to capture the knowledge in related domains, identify the commonly accepted vocabulary, describe the semantics of concepts through the relations between concepts, and provide a consensus understanding of the knowledge.

Knowledge in ontologies is represented formally through classes, relations, functions, axioms, and instances. Perez et al. [31] organized ontologies using a taxonomy that summarizes five basic modeling meta-speak.

  1. 1.

    Class or concept: Class or concept refers to any transaction, such as job descriptions, functions, behaviors, strategies, and reasoning processes. Semantically, it represents a collection of objects whose definition includes the name of the concept, a collection of relations with other concepts, and a description of the concept in natural language.

  2. 2.

    Relation: Relation is the interaction between concepts in the domain, formally defined as a subset of the n-dimensional Cartesian product. For example, subClassOf relations.

  3. 3.

    Function: Function is a special type of relations. The first (n − 1) elements of the relation can uniquely determine the nth element, formally defined as F : C1 × C2 × ... × Cn− 1 × Cn. For example, memberOf is a function, memberOf(x,y) means y is the member of x.

  4. 4.

    Axiom: Axiom represents the eternal truth assertion, such as concept B belongs to the scope of concept A.

  5. 5.

    Instance: Instance represents elements, or in semantic, instance represents object.

There are many existing ontologies, and the process of constructing ontologies varies according to the consideration of their target domains and specific projects. Since there not exists an official standard for ontology construction, researchers have proposed a series of principles for constructing ontologies in practice. Some of the ontology construction principles that have proved to be pragmatic. The five principles proposed by Gruber in 1995 [32] are the most influential. These construction principles provide the basic idea and framework for constructing ontologies. However, the obvious shortcoming is that they only deliver a rather vague standard. It is now generally accepted that the process of constructing a domain-specific ontology requires the involvement of domain experts. Principles for ontology construction include:

  1. 1.

    Clarity and Objectivity : Ontologies should offer clear and objective semantic definitions on defined terms by objective definitions and natural language documents.

  2. 2.

    Completeness : Definition of the term should be complete and fully expresses the meaning of the described term.

  3. 3.

    Coherence: The inferences drawn from the terms are compatible with the meaning of the terms themselves, i.e., they support reasoning consistent with their definitions without contradiction. The axioms defined and the documents illustrated in natural language should also be consistent.

  4. 4.

    Maximum Monotonic Extendibility: Adding general or specialized terms to an ontology does not require modifying its existing conceptual definitions and content, and supports defining new terms based on existing concepts.

  5. 5.

    Minimal Ontological Commitments: Ontological commitments should be minimal and should hold as few constraints as possible to the modeled objects. The commitment in ontology refers to the consensus on how to use the shared vocabulary in a consistent and compatible way. In general, ontology commitments are sufficient to satisfy specific knowledge sharing needs, which can be ensured by defining the least constrained axioms and defining the vocabulary needed for communication only.

3.2.2 Spatiotemporal ontology

Spatiotemporal ontology are required to present spatial attributes and temporal attributes of related entities. Spatiotemporal ontology is more than just an “enhanced” ontology, it also needs to combine business scenarios and domain knowledge, as well as semantically and spatially extended knowledge concepts, entities, and relationships based on the characteristics of spatiotemporal knowledge. In addition to defining semantic linkages, spatiotemporal knowledge mapping must also address the description of spatial and temporal interactions, and the important challenge in the design of spatiotemporal knowledge mapping is how to map spatiotemporal and semantic relations. Galton [33] summarized that a fully spatiotemporal ontology must extend the field-based and object-based ontologies in spatiotemporal domains, especially with the natural phenomena that inhabits the data. However, spatiotemporal information processing today faces two major problems: challenges in information integration led by incompatible terminology, and a deficiency in interoperability among the different systems [34].

At present, there are two ways to design ontologies related to spatiotemporal data. The first one is to add or optimize spatiotemporal related entities and relations to them to extend the original semantics, based on the existing ontologies . Bittner et al. [35] proposed an ontology theory that can describe dynamic spatiotemporal processes and constant enduring entities. To enhance the exchange and integration of semantic heterogeneous of spatiotemporal data. Bittener [34] specified the meanings of terms that describes the basic types of entities and relations almost used in every domain and developed a formal logical-based ontology using an axiomatic theory.

Some research are developed on the basis of existing open-domain ontology. YAGO2 [36] is a spatially and temporally enhanced version built from Wikipedia, WordNet and GeoNames, by adding YAGO (YAGO unifies Wikipedia and WordNet [37] with high coverage that significantly improving the efficiency of information extraction, which combines extensive lexicons in Wikipedia and taxonomy from WordNet.) a temporal dimension and a spatial dimension for both entities and facts. In [38], researchers provided a timely YAGO that also extracts temporal facts from Wikipedia. In terms of information integration, an ontology with spatiotemporal entities integrated is developed to fit dynamic phenomena [39]. Kurte et al. [40] offered an ontological framework that integrates spatiotemporal dimensions for describing dynamic patterns. Hornsby et al. [41] proposed a method of tracking spatiotemporal changing based on the object identity. The semantic of this research applied systematic derivation to semantics associated with changes, and were able to extract more types of dynamic spatiotemporal changes compared to its former research. In [42], a structured, spatiotemporal data querying over some Open Data sets were proposed, by adding geo-entities, temporal entities and links between them.

Another method is to reconstruct a unified spatiotemporal ontology. Grenon [43] proposed a realist formal spatiotemporal ontology, where he presented ontology as a theory that the framework can be applied to various spatiotemporal domains. Carstensen [44] presented a new proposal for the design of spatiotemporal ontologies which has its origin in cognitively motivated spatial semantics. He also leveraged selective attention to ontologies that leads to defining an ontological upper structure covers spatiotemporal domains. In [45], researchers developed ontologies that solve semantic ambiguation of spatiotemporal entities in particular. Grenon et al. [46] presented another modular ontology to describe the changing and dynamic features as well as snapshots of time. Arpinar et al. [47] provided a geospatial ontology (SWETO) that integrates analytics, including spatial, temporal and thematic dimensions of information.

4 Construction of ocean spatiotemporal knowledge graph

There are two methods of constructing a knowledge graph: top-down and bottom-up. Top-down is to define the ontology and data schema for the knowledge graph first, and then extract the ontology and schema information from high-quality data to add to the knowledge base with the help of structured data sources such as encyclopedic websites. Bottom-up is to propose resource schemas from public data through certain technical means, select the schema with higher confidence, add them to the knowledge base after manual review, and then construct the top-level ontology schema afterwards.Bottom-up organizes entities inductively to form bottom-level concepts, and then gradually abstract upward to form top-level concepts. This method can be converted into a data schema based on existing standards, or generated based on mapping of high-quality domain data sources. At present, knowledge graph construction generally adopts bottom-up method, thus we will only talk about bottom-up in this survey.

The basic processes of spatiotemporal knowledge graph is shown in Figure 1. There are six steps of the construction: knowledge modeling, knowledge storage, knowledge extraction, knowledge fusion, knowledge computation and application. It starts with raw data processing, where the data may be structured, unstructured and semi-structured. Then knowledge elements, that is, entities and relations, are extracted by a series of automated or semi-automated techniques and are stored in the schema layer and data layer of the knowledge base.

Figure 1
figure 1

Knowledge graph construction steps

4.1 Knowledge modeling and knowledge storage

Knowledge modeling is abstraction based on knowledge characteristics and actual demand of the industry, under the mode of knowledge graph. Knowledge modeling is more like the same process as representation, which has been discussed in Section 2.

Knowledge storage will directly influence efficiency of data querying and application. At present, there are generally two methods for knowledge storage. The first one is to store through a standardized storage format such as RDF, which has been discussed in Section 3. Another approach is to use graph databases for storage, and we will discuss this in Section 5 in detail.

4.2 Knowledge extraction

The entities, attributes and relations among entities are extracted from various types of data sources, based on which the ontological knowledge representation is formed. Knowledge extraction is a technique to automatically extract structured information such as entities, relationships and entity attributes from semi-structured and unstructured data. For different types of data sources, the techniques used for knowledge extraction are different. For structured data (e.g. maps, gazetteers), spatial entities, attributes and their relations are automatically extracted from the database by establishing mapping relations between concepts in the database and ontologies in knowledge graph and rule-based reasoning. For semi-structured data (e.g. tables from webpages and list data), corresponding template extractors can be established to realize knowledge extraction. For unstructured data (e.g. webpage text or other text information), the existing knowledge graph can be used to build a training set by remote supervision, and the extract by using deep learning algorithms. Knowledge extraction includes entity extraction, relation extraction and attribute extraction.

4.2.1 Entity extraction

Entity extraction, also called named entity recognition (NER), can identify named entities from text database automatically. Main tasks of entity extraction is to identity the named entities and classify them.

DeepDive [48] is a knowledge extraction tool developed by Stanford University. It extracts structured knowledge from less structured data and reason statistically without machine learning algorithms. In [49], researchers developed knowledge extraction that links ontological classes to the influenza-related spatiotemporal text data on Twitter. In [50], an knowledge extraction approach was presented that combines temporal information retrieval and spatial information retrieval in text documents.

4.2.2 Relation extraction

The text corpus obtained after entity extraction is a series of discrete named entities (nodes). To collect semantic information, it is necessary to extract the association relations (edges) between entities from the related text to link multiple entities or concepts to form a web-based knowledge structure. According to the dependence on annotated data, entity relationship extraction methods can be classified into supervised learning methods, semi-supervised learning methods, unsupervised learning methods and open extraction methods.

Supervised learning is a fundamental entity relation extraction method. The main idea is to train machine learning models on labeled training data and then to classify the relation of the test data. Supervised learning methods include rule based methods, feature based methods and kernel based methods. The rule based method needs to summarize the corresponding rules manually or through machine learning methods according to the different domains involved in the text to be processed, and then use the template matching method for entity relationship extraction. Spatiotemporal rule based relation extraction have to extract spatiotemporal relations in text corpus based on syntactical rules [51, 52]. Feature based method is simple and effective. The main idea is to extract useful information (including lexical and syntactic information) from the context of relations instances as features, construct feature vectors, and train entity relationship extraction models by computing the similarity of the feature vectors. Kernel based relation extraction includes word sequence kernel function methods, dependency tree kernel function methods, shortest path dependency tree kernel function methods, convolutional tree kernel function methods and the combined kernel function methods. Kernel based methods are more widely used for spatiotemporal relation extraction, for its effectiveness in analyzing heterogeneous data and dealing with massive number of documents [53].

Semi-supervised relation extraction summarizes entity relationship sequence patterns from the context containing the relations, and then uses the relationship sequence patterns to discover more relationship seed instances to form a new set of relations. Semi-supervised method assist researchers in labeling professional spatiotemporal data without expert knowledge [54].

Unsupervised relation extraction method does not need to rely on entity relation annotation corpus. It consists of two steps: relationship instance clustering and relation type selection. Lu et al. [55] proposed an unsupervised learning methods based on variational autoencoder to extract information from spatiotemporal data.

Open relation extraction can avoid manual corpus construction for specific relationship types, and can discover the relation type and extract relation automatically . By mapping high-quality entity relation instances to large-scale text, training data can be obtained from external domain-independent entity knowledge bases (such as DBPedia, YAGO, OpenCyc, FreeBase) according to text alignment. Open relation extraction is effective for intrinsic difficulty in training individual extractors for every single relation [56].

4.2.3 Attribute extraction

Attribute extraction is to extract the attribute information of a specific entity from different information sources. Data mining method can be used to mine the relations between entity attributes and attribute values directly from the text.

4.3 Knowledge fusion

After knowledge extraction, spatiotemporal knowledge from different data sources have certain complementarities and differences, such as non-uniform classification systems, ambiguities in geospatial entities, different details of feature descriptions, conflicted entity relations, and other information redundancy and inconsistency issues. Knowledge fusion is an effective way to solve the problem of knowledge graph heterogeneity by associating the semantic understanding of different identified entities in different data to the same entity.Techniques of knowledge fusion includes entity disambiguation, and entity linking.

Knowledge fusion is an effective method to improve the quality of knowledge, disambiguate knowledge and get the true value of knowledge, especially for heterogeneous data [57]. Spatiotemporal knowledge fusion includes more step on time series cleaning, spatiotemporal cleaning of stale data [58] and stream data cleaning [57].

4.4 Knowledge computation

After information extraction and knowledge fusion, a series of basic fact representation has been acquired from raw chaotic data. The next step is to obtain a structured, networked knowledge system and update mechanism through knowledge computation. Main steps of knowledge computation involves ontology construction and knowledge reasoning. We have discussed ontology in Section 3. Knowledge reasoning is mainly used for completing the knowledge graph and verifying the quality.

In addition to the ontology, reasoning based on general rules and common sense is widly used in knowledge graphs. Spatiotemporal knowledge graphs are capable of temporal reasoning and spatial reasoning. Temporal reasoning can supplement the target query with temporal constraints so the result meets the temporal demand. It can be regarded as a constraint satisfaction problem, where the variables represent temporal objects and the constraints between variables correspond to the temporal relations between objects. Similar to temporal reasoning, the spatial reasoning process yields the understanding of multiple spatial objects and object-embedded spatial properties. Spatial reasoning contains the reasoning of multiple spatial relations, such as topology, direction, distance, etc. Other logic-based geo-knowledge language was added in aid of declaring spatiotemporal reasoning [59]. Mantle et al. [60] implemented ParQR, a parallel, distributed Qualitative Spatial Temporal Reasoning (QSTR) with Apache Spark to reasoning through massive spatiotemporal data. A incorporation spatiotemporal reasoning is presented in [61], which infers spatiotemporal representations over underlying ontology.

4.5 Ocean knowledge graph application

The construction of ocean spatiotemporal knowledge graph present the ocean related information in a structured way, which helps us learn more of variation and prediction of ocean environment. Based on the structured ocean knowledge, more supportive and executable decisions can made.

5 Management of spatiotemporal ocean data

Storing and analyzing ocean and marine environmental data is an important way of understanding our planet and preparing us in advance for potentially adverse ocean conditions in the future. In addition, the marine spatiotemporal data collected from various sources (such as meteorological satellites, road-based weather stations, meteorological hot air balloons, buoys, various ships, underwater sensors, etc.) has reached the petabyte level, and traditional centralized data processing has gradually been unable to adapt the need for ocean spatiotemporal data management. How to store and utilize these ocean spatiotemporal big data is an urgent problem to be solved at present. Ocean spatiotemporal data management can be divided into two categories in terms of the number of nodes, namely the single-node storage and processing model and the distributed multi-node storage and processing model. The two types of models are introduced below.

5.1 Single-node

The traditional relational database management system RDBMS is a typical single-node processing model. Therefore, many researchers have developed some spatiotemporal RDBMS that support spatiotemporal data storage based on traditional RDBMS, and have been widely used in the industry, such as PostGIS of PostgreSQL [62], Oracle [63], IBM DB2 Spatial Extemder [64], SQLite [65], MySQL Spatial [66], SpatiaLite of Microsoft SQL Server [67], etc. Besides, new hardware GPUs also assistant in accelerating graph computation [68, 69]. These spatiotemporal RDBMSs are stable, mature, and efficient, including efficient SQL query engines. Among them, only PostgreSQL’s PostGIS and Oracle Spatial support the storage and processing of spatial raster data. PostgreSQL’s PostGIS, Oracle Spatial, and SQL Server, among others, provide OGC [70] and support the full set of spatial relational and analytical functions defined in the ISOSQL/MM (part-3) [71] standard. Therefore, queries such as spatial joins, query spatial extents, etc., can be performed in these databases.

However, these spatiotemporal database systems developed and expanded based on traditional RDBMS lack distributed data storage and processing capabilities like traditional RDBMS. These single-node data services are limited by I/O bottlenecks, lack parallel computing capabilities, and are difficult to scale horizontally. As the amount of marine spatiotemporal data increases, their corresponding latency and performance continue to decline, making it difficult to process PB-level marine spatiotemporal data. Moreover, the marine spatiotemporal data has the characteristics of complex sources, diverse structures and different qualities, making it difficult to model them in spatiotemporal RDBMS. Although traditional RDBMSs can scale horizontally through data sharding, it is still difficult to store data in tabular format to support distributed storage and processing of ocean data.

5.2 Multi-node

Multi-node data processing refers to the use of distributed computing technology to process data, and distributed computing is a concept relative to centralized computing. A distributed network consists of several computers that can communicate with each other, each with its own processor and storage device. The huge computing tasks that were originally concentrated on a single node are distributed to the computers in the distributed network for parallel processing in a load-balanced manner.As shown in Figure 2, each cluster of a distributed storage system generally has a master control node, and the load balance of data on each node is realized through the master control node scheduling. Worker nodes send information about node load to the master node through heartbeat. The master node calculates the workload of the worker nodes and the data to be migrated, generates migration tasks and puts them in the migration queue for execution.

Figure 2
figure 2

A cluster of distributed storage systems

In order to ensure the high reliability and high availability of the distributed storage system, multiple copies of the data of each node need to be replicated and backed up, as shown in Figure 3. Generally, there is only one primary replica, which can provide read/write services, and there can be multiple backups replica, which provide read-only services. In a distributed data processing system, data can be synchronized to multiple storage nodes through a replication protocol, and data consistency between multiple copies can be ensured.

Figure 3
figure 3

Distributed storage system data replication

Use multi-node management methods for ocean spatiotemporal data, including spatiotemporal databases based in part on traditional RDBMS, and new data processing methods. This new approach to data processing was proposed by Carlo Strozzi [72] in 1998 and called it NoSQL. It is a brand-new database revolution, which advocates the use of non-relational data storage, which is very suitable for the semi-structure of ocean space-time data, unstructured data format and large amount of data. Although NoSQL can not completely replace traditional RDBMS, it has a very wide range of applications in the field of ocean spatiotemporal data storage and processing. For example, database Redis [73], Oracle NoSQL [74] based on key-value pairs. Column Family (Wide-Column) database Cassandra [75], HBase, etc. document database MongoDB [76], Couchbase [77], etc. graph database Nebula [78], Neo4j [79], etc.

For traditional RDBMS, researchers continue to provide new extensions to meet the processing needs of marine spatiotemporal data. In this article, we mainly introduce the PostGIS extension of PostgreSQL, which supports OGC-compliant spatiotemporal SQL queries. Horizontal sharding of ocean spatiotemporal data to enable horizontal scaling when ocean spatiotemporal data exceeds the capacity of a single node, and read scalability can also be achieved by leveraging pgpool (Pgpool-II [80]) and streaming replication. However, ocean spatiotemporal data can be distributed among multiple nodes through data sharding, which can effectively reduce the I/O bottleneck [81]. Balancing I/O bottleneck between single machines in microservices can help in reaching Quality-of-Service (QoS) [82] goals.

In addition, there are several ways to achieve horizontal scaling and parallel acceleration of queries through data sharding, for example PostGIS can integrate with Citus and PostgresXL [83], or use PL/Proxy [84], etc. PostgreSQL has added a built-in sharding function after version 9.6, called foreign data wrapper (FDW), which enables PostgreSQL to access data from external source data. Therefore, ocean spatiotemporal data can be stored on different nodes of the cluster in a distributed manner, where each data partition can be accessed directly from disk or main memory through FDW. In addition, in the latest PostgreSQL 12, the use of PostGIS 3.0 can support functions such as parallel sequence scan, parallel join, and parallel aggregation for parallel spatial query processing.

Traditional RDBMSs store data in tables, and it is difficult to support today’s marine spatiotemporal data in multiple formats from many different sources. However, researchers added the JSON and JSONB data types to PostgreSQL support in 2012 and 2014, respectively. Also, SQL/JSON compliant with the SQL-2016 standard was introduced in the latest PostgreSQL 12. So now we can query and index ocean spatiotemporal data using JSON and JSONB in PostgreSQL [85].

Finally, spatiotemporal databases based on traditional RDBMS can consider using distributed file systems such as HDFS to support distributed processing capabilities for themselves, or use in-memory computing frameworks such as Spark [86] and Flink [87] to accelerate computing.

NoSQL database system has added distributed support at the beginning of its design, which has many advantages such as fault tolerance, scalability, high availability, and high flexibility. Currently, NoSQL that supports spatiotemporal data storage and processing includes Redis, Oracle NoSQL, MongoDB, Couchbase, Neo4j, Nebula, TigerGraph, Cassandra, etc. Among them, Redis is a key-value storage system, which operates based on the Geo Set data structure constructed by Sorted Set, and implements a geohash spatial index that can speed up query processing. Oracle NoSQL supports a SQL-like query language that supports all common geometry objects, geohash indexes, and a set of operators for working with spatiotemporal data.

Couchbase and MongoDB are a distributed document-oriented high-performance NoSQL database management system that natively supports processing spatiotemporal data. Both Couchbase’s GeoCouch [88] extension and MongoDB support common GeoJSON objects such as point, linestring, polygon, and collections. Couchbase’s GeoCouch extension is developed based on T-Trees, allowing BBox to perform spatiotemporal queries and supporting SQL-like The query language N1QL. MongoDB does not have a SQL-like query language, but provides a set of spatiotemporal operators such as nearSphere, geoIntersect and geoNear to perform spatial queries.

Nebula Graph [78] is an open-source, distributed, and easily scalable native graph database that can carry ultra-large datasets with hundreds of billions of points and trillions of edges, and provides millisecond-level queries. It adopts a shared-nothing architecture. It supports scaling up and down without stopping the database service. It introduced full support for Geospatial Data in version 2.6, including storage, computation, and indexing of oceanic spatiotemporal data. Nebula Graph currently supports marine spatiotemporal data of the Geography type, which models geographic location information represented by pairs of latitude and longitude coordinates in the earth space coordinate system. It also supports the efficient SQL-like query language nGQL. It also supports spatiotemporal function query operations (contain, cover, intersect, and so on) on common geometric objects (point, linestring, polygon, and collections).

6 Performance evaluation on spatiotemporal ocean data

In terms of the massive volume of spatiotemporal data, the processing of spatiotemporal data becomes a key problem. Performance evaluation on saptiotemporal data mainly considers its interactive performance, which reflects in response time, and system scalability.

In [89], an evaluation on distributed spatial database GeoMesa and ElasticSearch is conducted based on number of records returned and response time concerning number of records, area of query polygons and size of temporal window, respectively, where the results show that GeoMesa queries outperform ElasticSearch queries. Yu et al. [90] implemented a spatiotemporal computing frame work GeoSpark and proves it outperforms SpatialHadoop in spatial co-location, in terms of response time. Researchers also design benchmarks specially for evaluation of spatiotemporal databases such as SEQUOIA [91] and Paradise Geo-Spatial [92]. Makris et al. [93] evaluated the spatiotemporal data performance of NoSQL database MongoDB and open source RDBMS-PostgreSQL, where results reveal a better performance of PostgreSQL in all queries compared with MongoDB. In [94], researchers conduct performance evluation on five Spark based spatial analytics systems (Magellan, SpatialSpark, Simba, LocationSpark, GeoSpark) with different spatial queries and datatypes. Among these, GeoSpark was proved to be the most complet spatial analytic system with all queries and data types supported.

7 Summary

The dramatically high-rate growth of ocean data have lead to the challenge of ocean data processing. The inherent heterogeneity, spatiotemporal involved and constant changing of ocean data lead to difficulty in representing ocean data as well. While traditional methods no longer functional satisfy the ocean data processing demand, graph-based structure of data processing are extensively adopted. In this survey, we systematically illustrate the processing methods on ocean spatiotemporal data. We discuss about data representation methods, design and construction of ocean knowledge graphs. Main methods on spatiotemporal data representation and knowledge graph construction are summarized in the table below (Table 1). In addition, we compare different management techniques of ocean spatiotemporal knowledge as well as performance evaluation on ocean spatiotemporal data.

Table 1 Summary of data processing methods on ocean spatiotemporal data