1 Introduction

Richard Wossidlo was a German folklorist in Mecklenburg (1859–1939). During his ethnological field research, he wrote all facts on small paper slips. Fig. 1 shows him in front of the box shelf with his nearly two million field research notes.

Fig. 1
figure 1

Richard Wossidlo in front of his box shelf with field research notes. (© Karl Eschenburg)

These small notes contain a lot of well-defined references to scholars, narrators, contributors, places, villages, landscape, points and periods in time, literature, research papers, etc.

Richard Wossidlo not only collected all information on every day life in Mecklenburg, he also recorded all variations of narrative materials, and from whom he learned things, where and when. This was done not only by himself but also for all narrators and contributors he was corresponding with. He also put remarks on related research on such paper slips. As a result, he ended up with a so-called convolute of such notes on small paper slips enclosed in an envelope and then organized by different topics in the boxes. Fig. 2 shows such a box with some convolutes.

Fig. 2
figure 2

Field research notes in envelopes and boxes from Wossidlo’s archive

A main outcome of Richard Wossidlo’s research was the edition of the “Mecklenburger Wörterbuch”, an ethnological and linguistic encyclopedia for the Mecklenburg landscape.

Fig. 3
figure 3

Excerpt from lexeme “Austharke” in the Mecklenburg Encyclopedia

To support the folklorists and ethnologists in their research, all the field research notes and related documents collected by Richard Wossidlo have been digitized, exposed on 35mm film for long-term preservation, and a digital archive system for the community was built.

A main effort of the Wossidlo Digital Archive (WossiDiA) project was the digitization process [53]. The main steps are shown in Fig. 4. The WossiDiA system contains not only all these different, digitized documents. It also allows for managing, linking, navigating, and querying these different, highly interconnected documents.

For the folklorist or ethnologist it is of immense interest, how the encyclopedia entries, like the one shown in Fig. 3, are derived from the set of these small notes.Footnote 1 A real challenge is to collect, aggregate and summarize all the information scattered over a network of notes. Therefore, right from the beginning in the development of the WossiDiA system, a graph based approach was used [52] in addition to a semi-structured XML document model [2].

Collecting all the related information from the paper slip network is done by applying graph based operations like content, structure and type based graph filtering combined with hyperedge contraction, aggregation and summarizing. The resulting graph can then be mapped to a logical document similar to published lexeme. This document is a set of hyperedges with a certain content model (cf. Sect. 3.2).

The WossiDiA system was the main motivation and source of inspiration which led to the development of Hydra.PowerGraph. Hydra.PowerGraph consists of a hypergraph database system PowerGraph and a digital archive framework Hydra.

  • PowerGraph is build as an extension of the object-relational database system PostgreSQL. It uses PostgreSQL’s spatio-temporal extensions and object-relationl features like user defined types and inheritance.

  • Hydra (a HYpergraph of Documents in a Relational Archive) contains building blocks for digitization workflows, supports spatio-temporal contexts, GIS-based presentation of query results and has a set of frequently used object types for persons, places, events and others requested by digital archive developers.

Fig. 4
figure 4

WossiDiA digitization workflow

The remainder of this article is organized as follows: First of all, we are motivating why a hypergraph based model is preferred over a more structured, constrained database or a simple graph database model in building a large archive system for data from ethnological field research. The introduction will briefly indicate WossiDiA as our main PowerGraph usage scenario, for which the hypergraph system will be utilized prototypically.

Then, we outline the fundamentals of such a directed hypergraph model in a semi-formal way in Sect. 3. It is followed by a basic description of the architecture and the query processing techniques implemented on top of an object-relational database system in Sect. 4. Sect. 5 illustrates how directed hypergraphs were used as a modelling paradigm in the WossiDiA system and presents some statistics of the graph data.

Related work is discussed in Sect. 6. The article is summed up by a presentation of open questions and further research and forthcoming projects in Sect. 7.

2 Motivation – Hypergraphs for the Study of Complex Cultural Phenomena

Before we will motivate directed, typed hypergraphs as a modelling paradigm for digital archives, we give a short introduction to CIDOC/CRM.

2.1 CIDOC/CRM

CIDOC/CRM [18] is a conceptual model primarily designed for the exchange of historical information in the course of archives, museums, and libraries. The conceptual model is meant for the abstract description of objects (called domain entities) and their relationships (properties) to other entities (called range) and maybe vice versa. The model supports some basic types like strings, date, timestamp, and coordinates of places, which are mandatory for the spatio-temporal aspects of historical information.

Fig. 5
figure 5

The “Austharke” and exchange of gifts scenario expressed with CIDOC/CRM

The small example in Fig. 5 depicts the CIDOC/CRM representation for the exchange of gifts, an excerpt from the Wossidlo archive (cf. Fig. 6 for the hypergraph representation in the WossiDiA system). The rectangles represent the entity types with their instance’s values below. Properties are shown as labeled arcs directed from the domain entity to their range.

CIDOC/CRM concepts also include some kind of inheritance for building specialization hierarchies on entity and property types and is self-describing and extensible by entities having type properties ([P2] has type). This allows for substituting entities of a sub-type where its super-type is allowed in a property, e. g. substituting [E22] Man-Made Object if [E19] Physical Object is requested as the domain entity or range of a property. CIDOC/CRM comes along with hundreds of these pre-defined entities and properties. Nevertheless, since the main design goal was data exchange, not querying and retrieval, it only defines structures and does not contain any query language facilities.

Obviously, the hypergraph model is much more concise and compact than the conceptual CIDOC/CRM. Nevertheless, CIDOC/CRM is just a conceptual model never ment for storing or querying data. One always has to map it onto a logical model like the relational one or a graph data model.

Now, we will argue why a hypergraph model is more concise and compact not only in modelling digital archive data but also usable as an implementation or logical model.

2.2 The hypergraph model

Hypergraphs are an unusually apt model for complex cultural phenomena such as folklore [44, 55]. In this graph model, which departs significantly from standard single mode graphs, multiple classes of nodes are connected across a series of potential relationships. In “normal” graphs, a single edge can connect any two nodes. In contrast, hypergraphs generalize this idea by introducing hyperedges which can connect any arbitrary set of nodes. Consequently, a hypergraph can be seen as a set of hyperedges with each hyperedge consisting of a set of nodes [11]. Thus, hypergraphs can express everything that a normal graph can but also allowing the connection of all directly related nodes with a single hyperedge. In this manner, many-to-many relationships between several objects/nodes, usually represented by strongly connected components in a normal graph, are represented as a single hyperedge in a hypergraph. Because of this core characteristics, hypergraphs are well suited to model complex heterogeneous data sets, with many types of objects and their multiple relationships. Hypergraphs are not only a commonly accepted formalism in database theory [9, 19] but are also used as a formal basis in several semantic data models [41].

2.3 The WossiDiA hypergraph model

In the Wossidlo Digital Archive WossiDiA (cf. Fig. 1) hypergraph model, nodes can represent stories, storytellers, places, collectors, named entities, keywords, story actants. The relationships across and between these entities are captured as hyperedges. A hypergraph representation for the aforementioned example (exchange of gifts) is given in Fig. 6.

Fig. 6
figure 6

Excerpt from the WossiDiA system: topology and activities connected with topic “Gabentausch” (German for gift exchange) modelled with hyperedges

While these structures have certain characteristics that make them somewhat harder to work with than in single mode graphs, advances in tensor decomposition and network theory have made hypergraph models a powerful representation of complex data that does not unnecessarily reduce or compress the high dimensionality of these data. A central observation from network science for the application of hypergraphs to heterogeneous cultural data is that local phenomena are repeated within a huge set of similar subgraph with slightly different peculiarities.

Another powerful feature of the hypergraph model for folklore is that standard relations can be modeled efficiently, while avoiding the limiting factors of standard relational databasesFootnote 2. For any arbitrary query, one wants to be able to construct a summary document representing the information needs of that query, such as: What is the network of contributors in the Wossidlo archive and how does that network depend on context (location, time, topics, social background, etc) over time? Since no single document exists to answer this question, the construction of such a summary can grow rapidly in complexity. In graph theory, this summary operator is related to the concept of graph minors [13] which can be constructed from a graph by applying sequences of edge removals, edge contraction and deleting singular nodes. Importantly, the hypergraph model supports both content and structure based queries through typed nodes and hyperedges and provides adequate operators for querying both.

2.4 Directed and typed hypergraphs

The WossiDiA hypergraph model (cf. Sect. 3) consists of directed and typed hypergraphs. Direction adds more semantics to the hyperedges, where a directed hyperedge can be seen as a mapping of a node set to another node set, which is similar to functional dependencies, like one-to-many or many-to-many relationships in more common relational database models.

A major addition to the WossiDiA hypergraph model [43] is the typing of nodes and hyperedges, which groups similar objects (nodes within the hypergraph) with similar properties, e. g., persons, places, datetime, tags, etc., and stores them more efficiently in specialized node containers. Query performance is improved by search paths and optimized type-specific operators. For example, top-k-queries return the k best spatial objects stored as nodes of type place which are near a certain place. The WossiDiA hypergraph system also uses typed hyperedges resembling the content models of semi-structured data in XML. Regarding query evaluation, there are two modes, typed and untyped querying. The first mode restricts graph processing to certain node and hyperedge types, e. g., if one is only interested in information about the places a story was recorded at.

2.5 Querying hypergraphs

Since each hypergraph can be represented by a bipartite graph (cf. Sect. 3.1), operators and algorithms for normal graphs can be exploited for hypergraph processing. In addition, there are numerous specialized hypergraph algorithms, such as the k‑shortest hyperpath algorithm [25] which take advantage of the hyperedge concept for fast search. Additional algorithms include shortest paths computation, node connectivity testing, cycle detection using the specialized Graham reduction, sub-graph matching, node and hyperedge contraction, and some forms of graph products. These operators are used as building blocks for more descriptive graph query languages like GraphQL [33] or SPARQL [49].

2.6 Requirements leading to PowerGraph concepts

In the end, requirements from the project setting led to the development of Hydra.PowerGraph. The following list shows a mapping of those requirements to PowerGraph concepts:

  • The system should support various appropriate metadata representations, e. g.:

    • structural metadata describing the archive topologies, e. g. slip, envelope, box, shelf (will be mapped to different node and edge types in PowerGraph),

    • administrative and technical metadata on digital artefacts (node types),

    • descriptive metadata for persons, places, events, activities (node types)

  • The ethnological thesaurus or system of concepts classifying the field research notes (edge types: part-of, is-related-to, …)

  • Setting everything into the right historical context by spatial data, timespans with granule (node and egde types)

  • Representing fuzzy data and uncertainty (special edge types)

  • Relating all contributors and observers from the field research network (linking, edge types)

  • Mashup with other systems like OPACs, document repositories, GOV and GND databases (node types)

Next, we will see to what extent a hypergraph based data model can cope with these requirements.

3 A Data Model Based on Directed Typed Hypergraphs

There are several reasons for using directed hypergraphs as a logical data model for the large amount of highly interconnectedFootnote 3 data in the WossiDiA digital archive.

Nodes are used for modeling entities or objects of the application domain, whereas hyperedges represent relationships among a set of nodes. The hyperedge arcs (or links) represent different roles an object (node) can play in a certain relationship. Thus, nodes represent the ground truth or the static information particles while hyperedges are more on the dynamic and evolving side of information representation.

In the PowerGraph system, the directed hypergraph model is typed: nodes and hyperedges are associated with a type. Nodes, arcs/roles and hyperedges can have attributes, which are typed, too.

Fig. 7
figure 7

Activities involving person, place, timespan, and role modelled as hyperedges

3.1 Introducing Directed Hypergraphs

3.1.1 Undirected Hypergraphs

Following Claude Berge [11] a hypergraph is a family \(({E}_{1},{\dots},{E}_{m})\) of non-empty subsets of nodes or vertices \(V\), i. e. \(G=(V,E)\) with \(E{\subseteq}P(V)\setminus{\emptyset}\) (cf. Fig. 8a)Footnote 4. The incidence graph is a bipartite graph with hyperedges mapped to node partitions and incidences mapped to edges (Fig. 8b). Undirected hypergraphs have applications in database schema design, social network (peer-groups) analysis, protein complex networks, chemical reaction and many other life science problems. Fig. 7 shows a simple typed but undirected hypergraph from the WossiDiA system.

Fig. 8
figure 8

Different representations of hypergraphs. a subsets of nodes, b bipartite edge incidence, c back/forward-arcs, d arc incidence

3.1.2 Directed Hypergraphs

Directed hypergraphs (Ausiello et al. [7,8,9]; Gallo et al. [22, 23]) are a generalization of directed graphs (digraphs) and they can model binary relations among subsets of a given set. The basic idea in the generalization of digraphs is the subdivision into source and destination, which is trivial in common digraphs since both consist of exactly one element each.

A directed hypergraph then can be defined as \(G=(V,A)\) with a set of vertices \(V\) and the set of hyperarcs (hyperedges) \(A\), where a hyperarc is a pair \((T,H)\), \(T\) and \(H\) being subsets of vertices \(V\). \(T\) is called the b‑arc (backward-arc or tail) and \(H\) the f‑arc (forward-arc or head) of the hyperarc. Similar to undirected hypergraphs, a directed hypergraph can be visualized using hyperarcs (Fig. 8c) or based on arc incidence (Fig. 8d).

Additionally, Gallo postulates \(T\) and \(H\) to be disjoint, i. e., \(T{\cap}H={\emptyset}\) but allows both to be empty sets. In contrast, Ausiello defines a hyperarc to be a pair \((T,h)\) with \(h{\in}V\), \(T\neq\emptyset\), and does not exclude \(h{\in}T\). The destination set is restricted to one vertex. Gallo’s definition [23] is more general since it allows for m:n-relationships whereas Ausiello [9] restricts to m:1-relationships. In this sense, Ausiello is a special case of Gallo.

3.1.3 Directed Hypergraphs in PowerGraph

The definition of a hyperedgeFootnote 5 follows Gallo but with an essential extension: we introduce the concept of roles in a hyperedge, to allow a vertex or node to participate in a hyperedge multiple times as long as the role differs. With the set of roles \(R\), we define the hypergraph \(G\) as a pair \((V,A)\) with a set of vertices or nodes \(V\) and a set of hyperedges \(A=(T,H)\) with \(T{\subseteq}V\times R\) and \(H{\subseteq}V\times R\), i. e., incidence is defined not only on nodes but (role, node) pairs. Furthermore, \(T\) and \(H\) are not disjoint. There can be nodes which are element of both, \(T\) and \(H\), i. e., if \((v,r){\in}T{\cap}H\) the incidence is bi-directional.

Table 1 compares our definition of a directed hypergraph with that of Gallo and Ausiello.

Table 1 Comparison of different directed hypergraph definitions

3.1.4 Directed and Typed Hypergraphs in PowerGraph

The type system of the directed hypergraph model is defined by sets of node types \(\Gamma^{V}\), edge types \(\Gamma^{A}\), roles (link types) \(R\) and a set of mappings \(\{\alpha^{V},\alpha^{A},\rho^{V},\rho^{A},\delta\}\) with: \(\alpha^{V}:V\mapsto\Gamma^{V}\), \(\alpha^{A}:A\mapsto\Gamma^{A}\), \(\rho^{V}:R\mapsto\Gamma^{V}\), \(\rho^{A}:R\mapsto\Gamma^{A}\), and \(\delta:R\mapsto\{-1,0,1\}\). \(\delta\) defines the direction for b‑arcs as \(-1\), f‑arcs as \(1\) and as bi-directional if \(0\).

Following integrity constraints must hold for the types of a hyperedge:

  • \(\forall(v,r)\in T{\cup}H:\rho^{V}(r)=\alpha^{V}(v)\)

  • \(\forall(v,r)\in T{\cup}H:\rho^{A}(r)=\alpha^{A}(A)\)

  • \(\forall(v,r)\in T:\delta(r)\in\{-1,0\}\)

  • \(\forall(v,r)\in H:\delta(r)\in\{0,1\}\)

Aside from this more formal definition, we will now describe how typed and directed hypergraphs in PowerGraph can be used for data modeling based on examples from the WossiDiA system.

3.2 Data Modeling Using Directed Hypergraphs

Existing graph database systems didn’t care much about node and edge types to allow for flexibility in modeling semi-structured data of various formats. In Hydra.PowerGraph we make use of typed information. By introducing node and hyperedge types, special data structures and associated operations can be exploited in the database system. As a result, the user gains efficiency and effectiveness during query evaluation and execution. He benefits from a rich set of type-specific operations in query formulation.

Since the WossiDiA graph database is built on top of an object-relational database system, PostgreSQL, operations on certain node types like point in time, time-span and temporal relationships (e. g., before, after, within, concurrently) can be used. Operations on spatial data (point, line segments, polygons derived from GPS-coordinates) and spatial relationship querying (e. g., contains, overlaps, touches, crosses, intersects) can be seamlessly integrated into the data model.

These advantages come at the cost of defining and implementing application specific types before using the system for storing and retrieving. From the user supplied data type definition, explicit data structures (in fact database tables) for each node type can be created separately. The content of the hyperedges created by the user is checked against the hyperedge type. On the one hand, the system can assist in creating hyperedge information by suggesting relevant (allowed) node types participating in a certain hyperedge. On the other hand, the system can ensure that all aspects (nodes) are inserted into the database, no data is omitted.

When querying the archive the user doesn’t have to know about node and hyperedge types, but he can make use of certain query operators provided by PowerGraph.

3.2.1 Basic Node Types for Storing Domain Specific Data

Nodes in the hypergraph model represent entities, events, material or immaterial objects in the application domain. The basic node type has just an identity. All other node types are derived from this basic node type and must be specified by the user. Nevertheless, the Hydra.PowerGraph system comes with a set of built-in node types commonly used in historic information systems namely the Person, Place, and Word node types. The definition of the person node type is exemplified below:

CREATE NODETYPE person (     firstname STRING,     lastname STRING,     birthdate TIMESPAN,     appellation STRING );

All node types inherit attributes of their super node types and add an individual set of attributes. Node types are identified by their unique name. The attributes are also typed and make use of the data types supplied by relational database systems like numbers, strings, date, timestamp, etc., but can benefit from extensions like PostGIS for spatial data, too. The PowerGraph system takes care of creating separate database tables for each node type and handles query evaluation using the extended node type-specific attribute set.

3.2.2 The Hyperedge Content Model (HECM)

The idea of not only having user defined node types but also introducing hyperedge types is to enforce restrictions on nodes participating in a hyperedge. In fact, by defining hyperedge types, the user effectively manages the meaningful relationships nodes (or objects, entities) can be part of.

With hyperedges, two aspects are controlled: (1) which nodes with certain types can participate, and how they participate (direction), (2) how the overall hyperedge type is constituted by the participating nodes and a postulated content model.

A hyperedge content model resembles the idea of XML content models or semi-structured data in general. The main difference is in dropping the document order property essential to semi-structured data. There is no ordering of nodes within a hyperedge on the type level. Nevertheless, an application can use link attributes to implement order if necessary.

The basic elements of the content model are the participating incoming and outgoing nodes and their node type and role name. The content model is built from the set of (role, node)-pairs by allowing arbitrary combinations of them using sets (’,’) or unions (’|’), quantifiers (’?’, ’*’, ’+’, exact cardinalities: min-occurs and max-occurs) and grouping using ’(’ and ’)’. Formally, the content model is defined by a context free grammar.

The following depicts an example content model for activities as used in WossiDiA. An activity encompasses date, location, actors, observers, and a description of the activity, it can be defined in GrafL DDL like:

CREATE HYPEREDGE TYPE activity ...        MODEL (  start:date IN, end:date IN,  location:place? IN, actor:person* IN,  action:descr OUT, observer:person? OUT );

3.2.3 Types in Transit – Type Checking Hyperedges

If we initially create a hyperedge with

CREATE HYPEREDGE activity (     start:date(“2016/10/31”),     action:descr(“Halloween”), );

it is an incomplete hyperedge since the mandatory end:date is still missing. After adding this end:date link it is already a complete hyperedge, i. e., the current type matches the hyperedge content model. Nevertheless, it is not a closed hyperedge since we can still add some actor, place, or an observer link.

To cope with incomplete types, we introduce a so called remainder type. Whereas the content model of a hyperedge is the same for all instances of the same hyperedge type, the remainder type represents the nodes which can still be added until the hyperedge is closed. The current type of a hyperedge is one of the following states:

  1. (1)

    incomplete if not all required links are present, or usually the hyperedge is an empty set of (link, node)-pairs,

  2. (2)

    complete if the current type matches the content-model but the remainder type is not empty, i. e., if one can still add some nodes, and

  3. (3)

    closed if nodes can not be added anymore.

As a consequence, there are some content models which will be permanently open regardless how many links we add to the hyperedge. Only content models not containing any link (or link group) with quantifier ’*’ and ’+’ can be closed. Closed hyperedge types are easier to be stored, indexed, type checked and retrieved. Open hyperedge types are more flexible and better suited for semi-structured data but need more effort in management and dynamic type checking.

Hyperedge type checking takes place every time we add or delete a link from a hyperedge. Before adding the first link to or after deleting the last link from the current hyperedge, the type state will be incomplete. If the current type treeFootnote 6 is equivalent to the content model tree of the related hyperedge type, type checking succeeds and state will be complete. To hyperedges in state complete links can still be added until their remainder tree gets empty and in turn the hyperedge type gets closed. To hyperedges in state closed no links can be added, only links can be deleted from this hyperedge.

The basic algorithm for adding links will delete the first occurrence of the link from the tree if there is no quantifier or ’?’ annotated. Links with ’*’ or ’+’ will remain in the tree. After deleting a link from the remainder tree a check for empty groups will eliminate the whole subtree. If the remainder tree becomes empty the type state will be set to closed. Deleting a link will add the last occurrence of that link from the content model to the remainder tree.

Since the content model is expressed in a regular language which is a subset of the original UNIX regular expressions [58], it can be checked in \(O(n)\) time for \(n\) links in the current hyperedge by compiling both the remainder type and the content model into a deterministic finite automaton [40]. The implementation of the dynamic type checking mechanism as implemented in WossiDiA is described in [60].

3.3 Graph Querying

Currently, PowerGraph supports a descriptive query language GrafL based on a set of graph algebra operators. These operators belong to three basic classes:

  • content and structure predicates

  • type-checking operations

  • graph operations

The query language encompasses both content and structure querying, which select nodes and hyperedges based on predicates over their attribute set. Additionally, the query language allows for hyperedge matching by predicates formed over role and/or node types. Simple graph operations allow for testing the connectivity of two nodes, querying the k‑neighborhood of certain nodes or implement hyperedge contraction as a union of all (role, node) pairs of two hyperedges.

Complex graph operations can be implemented using these basic operations, e. g., a synopsis operator will be evaluated as a combination of node and edge contractions and structure retrieval. A k‑shortest path operator uses a special hypergraph algorithm [24].

The query language GrafL is similar to XQuery in querying node, hyperedge content and structure. The graph algorithms are encapsulated into functions. Where XQuery operates on sequences of XML fragments, GrafL is evaluated on node and hyperedge sets and usually returning hyperedge sets. To demonstrate the querying features of GrafL, we present some examples from the WossiDiA application domain.

Queries which retrieve nodes or edges by their content or structure use variables. These are bound to the node or hyperedge sets or some subtypes of them. First, this simple query returns all hyperedges of type Narrative:

LET $e := Edge::Narrative RETURN $e

The following content-based query returns all nodes Person born in “Rostock”:

LET $p := Node::Person,        $o := Node::Place,        $e := Edge::Vital-stats WHERE $e.person = $p     AND $e.birthplace = $o     AND $o.name = “Rostock” RETURN $p

The following structure-based query returns the node type of those nodes having an appellation and coordinates attribute:

LET $p := Node WHERE $p.appellation AND $p.coordinates RETURN $p.type

Graph algorithms which are currently supported in PowerGraph include node-connectivity, shortest-paths, and k‑neighbourhood. This query returns the shortest path (as a list of hyperedges) connecting persons “Ahrens” and “Helms”:

LET $p1 := Node::Person,        $p2 := Node::Person WHERE $p1.last-name = “Ahrens”    AND $p2.last-name = “Helms” RETURN shortest-path($p1, $p2)

The k‑neighbourhood operator returns k = 10 nearest neighbours nodes of type Person of person “Richard Wossidlo”:

LET $p = Node::Person WHERE $p.fist-name = “Richard”    AND $p.last-name = “Wossidlo” RETURN k‑nearest     -neighbour($p, Node::Person, 10)

The last query constructs a new set of hyperedges from existing node sets. For each person nodes $p all places nodes $o are collected from hyperedges $e representing the place of living.

FOR $p IN Node::person LET $e := Edge::livingplace,        $o := node-set($e.place) WHERE $e.person = $p RETURN EDGETYPE allplaces     MODEL (who:person, where:place+)     VALUES(who:$p, where:$o)

For more details of the graph data definition, manipulation and querying features of GrafL can be found in [46].

4 Architecture and Implementation of the Hypergraph Database System

4.1 Object-Relational Storage Structures for Hypergraphs

Using a model, in our case a hypergraph model, always raises several questions around implementation issues, beginning from “how can hypergraph data be represented in the digital archive system”, through “how to store data persistently” to “how to query and evaluate those data efficiently”. Those problems are quite complex; answering these questions in detail is way beyond the limitations and goals of this article. Nevertheless at least one of these aspects will be presented here briefly: storage of hypergraph data. As mentioned before, we use the object-relational database system PostgreSQL as the backend for the graph database. The relational storage structure is shown in Fig. 9.

Fig. 9
figure 9

Relational storage structure of the directed hypergraph model

The backend database is divided into two parts, a catalog schema and a data schema. This helps to distinguish and organize type and model information on the one hand from the actual data on the other hand.

Besides the nodes in the hypergraph, which do not differ fundamentally from nodes in graphs, and therefore are easy to understand, the most important and structurally most relevant constituents of hypergraphs are their hyperedges and the connectors between hyperedges and participating nodes, which we call arcs (without loss of generality). These data are represented on the right side of the figure.

To accommodate further information about edges and arcs, the model allows for attributes to be attached to both of them (EdgeAttr, LinkAttr), on the far right in the figure. Since we have to deal with a single graph only, there is no need to store graph properties, such as ID and type at all. Relationships between attributes and edges/arcs, as well as the relationship between arcs and edge are guaranteed by referential integrity.

All hypergraph constituents, not only edges and arcs, even their attributes are typed, as shown in the catalog (left) part in the figure. Attribute types are always flat, however edge and node types are part of an inheritance hierarchy (therefore they have parent attributes). This makes the definition of hyperedge content models much easier. Hyperedge content models, described in more detail in Sect. 3.2, are important properties of hyperedge types. They define the way nodes participate in hyperedges they are associated with, hence defining the semantics of the hyperedge.

Inheritance of node types also implies inheritance on node storage, which means they all inherit from a base node which resides in the node table. As can be seen in the figure, node attributes are separated from node identities, which makes the system very flexible and independent from DDL operations and database system restrictions on inheritance, when it comes to modeling inherited nodes. However, this does not necessarily represent the implementation in the database. An alternative would hold tables on a per-node-type basis and hence make use of the inheritance capability of PostgreSQL tables and individual specific attribute type supportFootnote 7. A tradeoff between those two storage options would be the split-attributes-by-type approach, which separates attributes of individual types from node identities in their respective own specific data type tables, retaining the database system’s type-specific capabilities, while providing a very flexible storage for attributes of inherited nodes. The decision on how to handle inherited nodes’ storage has not yet been made and is still subject to research.

When running a PowerGraph instance, the whole catalog will be loaded into memory, which is inevitable for all hypergraph operations processed in PowerGraph. Catalog/model changes in the system are made persistent in the backend database immediately, to ensure ACID properties, using the respective backend database system capabilities.

For the sake of high efficiency, parts of the data schema are also cached into main memory. This does not include node, edge, or arc attributes, as they may include large amounts of data, especially nodes. In any case, it is advisable to cache node identities, edge identities, and incidence information, including arc directions, to provide for efficient graph operations. Initial strategies and ideas on how to do that, are described in the following section.

4.2 Linking Cache – Main Memory Graph Representation

The linking cache is a main memory structure taking care of loading graph data and writing back updated parts to the backend database. It exploits two main data structures: (1) a so called plex for graph, node, and hyperedge access and (2) an union-find index for fast node connectivity testing.

4.2.1 The plex implementation of a hypergraph

The plexFootnote 8 is a simple but efficient interface to a variety of graph or tree structured datasets. It consists of four basic sets:

  • containers, set of plexes where the current plex occurs

  • contents, set of plexes contained in the current one

  • sources, set of plexes the current plex originates from

  • destinations, set of plexes the current one is pointing to

Table 2 shows how graphs, hyperedges, and nodes can be represented in the same plex structure:

Table 2 How graphs, hyperedges, and nodes can be represented in the same plex structure

Operations on the plex structure have to be valid under two general, global integrity constraints

$$plex_{i}{\in}plex_{j}.containers\Leftrightarrow plex_{j}{\in}plex_{i}.contents$$
$$plex_{i}{\in}plex_{j}.sources\Leftrightarrow plex_{j}{\in}plex_{i}.destinations$$

and some validity rules specific for graphs, hyperedges, or nodes. These specific constraints are shown together with a sketch of the Java classes using generics for implementing the plex structure.

class plex<T1,T2,T3> {     set<T1> containers;     set<T2> contents;     set<T3> sources;     set<T3> destinations; };    class graph extends plex    <null,edge,null> {        //set<edge> contents;

\(this{\in}edge.containers\)

}; class edge extends plex    <graph,node,node> {        //set<graph> containers;

\(this{\in}graph.contents\)

       //set<node> sources;

\(this{\in}node.destinations\)

       //set<node> destinations;

\(this{\in}node.sources\)

};    class node extends plex    <edge,null,edge> {       //set<edge> containers;

\(this{\in}edge.contents\)

       //set<edge> sources;

\(this{\in}edge.destinations\)

       //set<edge> destinations;

\(this{\in}edge.sources\)

};

4.2.2 The Union-Find Data Structure for Node Connectivity

This data structure [10] is used in addition to the plex structure for graph operations which depend on fast node connectivity testing. Using path compression, it performs well not only on connectivity testing (\(O(lgN)\)), but also on the union operation connecting two nodes (\(O(lgN)\)) or deleting a connection (\(O(1)\)). For secondary storage, there are also efficient data structures for testing connectivity (cf. [59]).

5 The PowerGraph Application Context

The relationships between the PowerGraph system, the Hydra.PowerGraph extension, and the WossiDiA digital archive are shown in Fig. 10 and explained in the following subsections.

5.1 Hydra.PowerGraph

In this paper, we mainly discuss the graph database system PowerGraph which is part of the Hydra framework for digital archive systems. Hydra adds a rich set of pre-defined node and hyperedge types, like the one discussed above. It is accompanied by a flexible workflow engine and digitization and publishing modules [54].

5.2 WossiDiA on Top of Hydra.PowerGraph

WossiDiA is a project about digitization of ethnological documents, written and collected in the late 19th and the early 20th century by Richard Wossidlo, see Fig. 1. Fig. 2 shows a sample of Wossidlo’s hand-written documents.

Fig. 10
figure 10

WossiDiA as Hydra.PowerGraph application

More detailed information about the collection and the project can be found in [43] and [52]. In order to illustrate the type of documents, the complexity of the digitization workflow, and the need of a sophisticated way to describe digital archive topology data with metadata information, connected in a hypergraph type of way, we have included Fig. 4. It gives a general overview about the workflow and depicts how archival contents are acquired and amended in a way that generates the high degree of interconnectivity in the digital archive. The digitization process is outlined in [53].

Table 3 gives an overview of the current hypergraph data quantity structure of the Digital Wossidlo Archive’s PowerGraph instance. The upper part of the table shows the number of nodes and hyperedges in the archive topology, which is a hierarchical structure, consisting of the corpora of the Wossidlo Archive (e. g.: main box shelf with field research notes (ZAW), correspondences with contributors (BKW), alphabetical dictionary source documents (MWW), etc.). The numbers of nodes on the individual levels of the different corpora are listed explicitly to explain the quantity of hyperedges, which are used to hold together the corpora’s topologies. Level 0 represents the highest directory unit, directly below the respective corpus’s root node. Level 3 or 4 represents the actual digital image document nodes. The lower part of the table displays the overall node and hyperedge counts across all corpora. They are juxtaposed with the total numbers of all descriptive metadata nodes and all descriptive hyperedges. The latter are used to link metadata nodes with topology nodes.

Table 3 WossiDiA topology hypergraph quantity structure

A few examples for specific hyperedges types, which are either prominent in WossiDiA or make heavy use of hyperedges, are shown in Table 4. The left column indicates the edge types, which themselves represent the meaning of facts behind the association of participating nodes. The middle column gives a hint on how many of these hyperedges have been created and the right column shows the average fan-out of the edge-node relationship.

Table 4 WossiDiA hyperedge types (selection of)

For example, the hyperedge type Wossi region, which puts places into a regional context with a more prominent place nearby, has only a small number of edges but a huge fan-out of connected nodes. The same is true for fuzzy person and seminar networks. On the other hand, we have lots of edges describing document content (content) and information origin (narrated). Unlike the networks, they naturally don’t have a huge fan-out; it is still greater than two, proving the need of hypergraphs over graphs with edges only allowing for a pair of two nodes. Hyperedge type thesaurus represents the so-called “ethnological canon” which relates terms mostly hierarchically, but features a few very important cross references, being the average \(+0.012\) nodes on top of the regular pair.

6 Related Work

The CIDOC/CRM [18] is a conceptual model primarily designed for the exchange of historical information in the course of archives, museums, and libraries. It has to be mapped to a logical data model to be used for storage and retrieval. This has be done by Thalheim et al. for the relational model in [37]. More details on mapping the CIDOC/CRM to Hydra.PowerGraph’s directed and typed hypergraph model can be found in [38]. This approach was also used as a conceptual model in the Lagomar project which dealt with the inventarisation of the maritime cultural heritage of lagoon regions and employed CIDOC/CRM for data storage [42, 45]. In Lagomar, CIDOC/CRM was mapped to an RDF-triple-store (Jena FrameworkFootnote 9).

Claude Berge [11] established the hypergraph theory. Directed hypergraphs were first introduced by Ausiello et al. [7,8,9] and Gallo et al. [22, 23]) as a generalization of directed graphs.

Angles and Guttierez summarize and compare graph based data models [4]. But the majority of recently developed graph database techniques and approaches does not focus on hypergraphs and does not put too much effort into complex type concepts for graph edges or hyperedges. However, there are some approaches dealing with hypergraphs, such as [16] and [20]. Some use labeled edges such as [16, 17, 29, 34, 35, 51, 61], etc. Although labeled edges can be counted as a name based type system, PowerGraph uses a structure based type system.

A very small number of approaches deals with both hypergraphs and typed edges, e. g. [16]. But even in this case, edge typing means no more than simply using labeled edges. When it comes to widely spread graph database and graph processing systems, such as neo4j and Apache Giraph, the native support of hypergraphs is limited. Building hypergraphs in neo4j means using special node types acting as hyperedges (bipartite graphs), and sophisticated edge types need to be implemented explicitly. As a consequence, query evaluation and optimization do not benefit from node clustering and typed edges.

In contrast to the above-mentioned approaches and systems, PowerGraph provides a complex hypergraph type system, which is based on hyperedge content models. It features the notion of applying restrictions on nodes participating in hyperedge relationships, as described in Sect. 3.2. Using this, the user can define complex constraints in a way similar to content models known from semi-structured data and documents.

With respect to the hypergraph model and edge types, PowerGraph uses a query language resembling XQuery. This approach is different to current graph languages like SPARQL since it mixes graph processing techniques with querying semi-structured data and documents. Angles et al. [5] survey the foundations of graph languages like SPARQL, Cypher and Gremlin. Since RDF-triples can be represented by bipartite graphs as well as hypergraphs can, there should be a way to adopt a SPARQL-like language for broader use in the PowerGraph system. Some researchers [31] argue that hypergraphs may better represent RDF graphs. RDF graphs allow for several representations, for example bipartite graphs [31], labeled directed graphs [28, 32], and hypergraphs [30].

As for managing XML data with relational technologies [26], there are situations where managing graph data with (object-)relational databases can perform as good as a native graph database. Gubichev [27] at least gives some evidences for simple graph matching scenarios.

The main memory structure of PowerGraph, the linking cache, relies on efficient main memory graph representations. Succinct data structures are a basic building block, [21, 36, 47] show more complex discrete data structures such as trees and graphs that can be built using them. Some of the tasks for which they have used include Web graphs [15], XPath indexing [6], partial sums [50], and short read alignment [39].

7 Conclusion

Primarily developed for the WossiDiA digital archive, the directed hypergraph model is versatile enough to build a foundation for not only this special archive structure but more general applications. For the realm of digital libraries, archives, and museums, we have shown that our model is well suited as implementation logic for CIDOC/CRM, a conceptual model well accepted in those domains.

The directed hypergraph model is not only a set of concepts for defining structures but is accompanied with a set of operators forming a graph query language and graph algebra. Such a graph algebra is also a formal basis for implementing efficient algorithms of algebra operators and developing an optimization framework. Nevertheless, a more descriptive query language like GraphQL, SPARQL or Cypher should be adapted to hypergraphs and implemented on top of such an algebra. GrafL is a first try on a hypergraph query language and borrows much from XQuery. The semantics and the optimization is still an open topic in the project and needs further research. Other open questions include more complex graph operators for graph abstraction and summarizing as well as customized index structures for frequent graph pattern matching.

Graph matching will be also a focus of a forthcoming project. It will provide intelligent search and analysis across three of the world’s largest machine actionable folklore collections (Dutch Folktale Database from Meertens Instituut, Amsterdam, the Danish Folktale Database from UCLA, and the Mecklenburger Folklore Database WossiDiA from University of Rostock) presenting the opportunity for large scale data-driven research into traditional folk expressive culture. By facilitating search, discovery and analysis across all three collections, the Hydra.PowerGraph infrastructure will provide researchers an unprecedented opportunity to discover patterns both within and across the target corpora. The proposed research, focusing on storytellers, legends and the dispersion of (historical) beliefs in magic, witchcraft, hauntings and supernatural beings seeks to reveal what ordinary people believed, and how storytelling traditions and story repertoires differed in and across these three areas [1, 56, 57].