Keywords

1 Introduction

More and more researchers in the field of materials science have realized that data-driven techniques have the potential to accelerate the discovery and design of new materials. Therefore, a large number of research groups and communities have developed data-driven workflows, including data repositories (for an overview see [14]) and task-specific analytical tools. Materials design is a technological process with many applications. The goal is often to achieve a set of desired materials properties for an application under certain limitations in e.g., avoiding or eliminating toxic or critical raw materials. The development of condensed matter theory and materials modeling, has made it possible to achieve quantum mechanics-based simulations that can generate reliable materials data by using computer programs [17]. For instance, in [1] a flow of databases-driven high-throughput materials design in which the database is used to find materials with desirable properties, is shown. A global effort, the Materials Genome InitiativeFootnote 1, has been proposed to govern databases that contain both experimentally-known and computationally-predicted material properties. The basic idea of this effort is that searching materials databases with desired combinations of properties could help to address some of the challenges of materials design. As these databases are heterogeneous in nature, there are a number of challenges to using them in the materials design workflow. For instance, retrieving data from more than one database means that users have to understand and use different application programming interfaces (APIs) or even different data models to reach an agreement. Nowadays, materials design interoperability is achieved mainly via file-based exchange involving specific formats and, at best, some partial metadata, which is not always adequately documented as it is not guided by an ontology. The second author is closely involved with another ongoing effort, the Open Databases Integration for Materials Design (OPTIMADEFootnote 2) project which aims at making materials databases interoperational by developing a common API. Also this effort would benefit from semantically enabling the system using an ontology, both for search as well as for integrating information from the underlying databases.

These issues relate to the FAIR principles (Findable, Accessible, Interoperable, and Reusable), with the purpose of enabling machines to automatically find and use the data, and individuals to easily reuse the data [23]. Also in the materials science domain, recently, an awareness regarding the importance of such principles for data storage and management is developing and research in this area is starting [6].

To address these challenges and make data FAIR, ontologies and ontology-based techniques have been proposed to play a significant role. For the materials design field there is, therefore, a need for an ontology to represent solid-state physics concepts such as materials’ properties, microscopic structure as well as calculations, which are the basis for materials design. Thus, in this paper, we present the Materials Design Ontology (MDO). The development of MDO was guided by the schemas of OPTIMADE as they are based on a consensus reached by several of the materials database providers in the field. Further, we show the use of MDO for data obtained via the OPTIMADE API and via database-specific APIs in the materials design field.

The paper is organized as follows. We introduce some well-known databases and existing ontologies in the materials science domain in Sect. 2. In Sect. 3 we present the development of MDO and introduce the concepts, relations and the axiomatization of the ontology. In Sect. 4 we introduce the envisioned usage of MDO as well as a current implementation. In Sect. 5 we discuss such things as the impact, availability and extendability of MDO as well as future work. Finally, the paper concludes in Sect. 6 with a small summary.

Availability: MDO is developed and maintained on a GitHub repositoryFootnote 3, and is available from a permanent w3id URLFootnote 4.

2 Related Work

In this section we discuss briefly well-known databases as well as ontologies in the materials science field. Further, we briefly introduce OPTIMADE.

2.1 Data and Databases in the Materials Design Domain

In the search for designing new materials, the calculation of electronic structures is an important tool. Calculations take data representing the structure and property of materials as input and generate new such data. A common crystallographic data representation that is widely used by researchers and software vendors for materials design, is CIFFootnote 5. It was developed by the International Union of Crystallography Working Party on Crystallographic Information and was first online in 2006. One of the widely used databases is the Inorganic Crystal Structure Database (ICSD)Footnote 6. ICSD provides data that is used as an important starting point in many calculations in the materials design domain.

As the size of computed data grows, and more and more machine learning and data mining techniques are being used in materials design, frameworks are appearing that not only provide data but also tools. Materials Project, AFLOW and OQMD are well-known examples of such frameworks that are publicly available. Materials Project [13] is a central program of the Materials Genome Initiative, focusing on predicting the properties of all known inorganic materials through computations. It provides open web-based data access to computed information on materials, as well as tools to design new materials. To make the data publicly available, the Materials Project provides open Materials API and an open-source python-based programming package (pymatgen). AFLOW [4] (Automatic Flow for Materials Discovery) is an automatic framework for high-throughput materials discovery, especially for crystal structure properties of alloys, intermetallics, and inorganic compounds. AFLOW provides a REST API and a python-based programming package (aflow). OQMD [19] (The Open Quantum Materials Database) is also a high-throughput database consisting of over 600,000 crystal structures calculated based on density functional theoryFootnote 7. OQMD is designed based on a relational data model. OQMD supports a REST API and a python-based programming package (qmpy).

2.2 Ontologies and Standards

Within the materials science domain, the use of semantic technologies is in its infancy with the development of ontologies and standards. The ontologies have been developed, focusing on representing general materials domain knowledge and specific sub-domains respectively.

Two ontologies representing general materials domain knowledge and to which our ontology connects are ChEBI and EMMO. ChEBI [5] (Chemical Entities of Biological Interest) is a freely available data set of molecular entities focusing on chemical compounds. The representation of such molecular entities as atom, molecule ion, etc. is the basis in both chemistry and physics. The ChEBI ontology is widely used and integrated into other domain ontologies. EMMO (European Materials & Modelling Ontology) is an upper ontology that is currently being developed and aims at developing a standard representational ontology framework based on current knowledge of materials modeling and characterization. The EMMO development started from the very bottom level, using the actual picture of the physical world coming from applied sciences, and in particular from physics and material sciences. Although EMMO already covers some sub-domains in materials science, many sub-domains are still lacking, including the domain MDO targets.

Further, a number of ontologies from the materials science domain focus on specific sub-domains (e.g., metals, ceramics, thermal properties, nanotechnology), and have been developed with a specific use in mind (e.g., search, data integration) [14]. For instance, the Materials Ontology [2] was developed for data exchange among thermal property databases, and MatOnto ontology [3] for oxygen ion conducting materials in the fuel cell domain. NanoParticle Ontology [21] represents properties of nanoparticles with the purpose of designing new nanoparticles, while the eNanoMapper ontology [11] focuses on assessing risks related to the use of nanomaterials from the engineering point of view. Extensions to these ontologies in the nanoparticle domain are presented in [18]. An ontology that represents formal knowledge for simulation, modeling, and optimization in computational molecular engineering is presented in [12]. Further, an ontology design pattern to model material transformation in the field of sustainable construction, is proposed in [22]. All the materials science domain ontologies above target different sub-domains from MDO.

There are also efforts on building standards for data export from databases and data integration among tools. To some extent the standards formalize the description of materials knowledge and thereby create ontological knowledge. A recent approach is Novel Materials Discovery (NOMADFootnote 8) [7] of which the metadata structure is defined to be independent of specific material science theory or methods that could be used as an exchange format [9].

2.3 Open Databases Integration for Materials Design

OPTIMADE is a consortium gathering many database providers. It aims at enabling interoperability between materials databases through a common REST API. During the development OPTIMADE takes widely used materials databases such as those introduced in Sect. 2.1 into account. OPTIMADE has a schema that defines the specification of the OPTIMADE REST API and provides essentially a list of terms for which there is a consensus from different database providers. The OPTIMADE API is taken into account in the development of MDO as shown in Sect. 3.

3 The Materials Design Ontology (MDO)

3.1 The Development of MDO

The development of MDO followed the NeOn ontology engineering methodology [20]. It consists of a number of scenarios mapped from a set of common ontology development activities. In particular, we focused on applying scenario 1 (From Specification to Implementation), scenario 2 (Reusing and re-engineering non-ontological resources), scenario 3 (Reusing ontological resources) and scenario 8 (Restructuring ontological resources). We used OWL2 DL as the representation language for MDO. During the whole process, two knowledge engineers, and one domain expert from the materials design domain were involved. In the remainder of this section, we introduce the key aspects of the development of MDO.

Requirements Analysis. During this step, we clarified the requirements by proposing Use Cases (UC), Competency Questions (CQ) and additional restrictions.

The use cases, which were identified through literature study and discussion between the domain expert and the knowledge engineers based on experience with the development of OPTIMADE and the use of materials science databases, are listed below.

  • UC1: MDO will be used for representing knowledge in basic materials science such as solid-state physics and condensed matter theory.

  • UC2: MDO will be used for representing materials calculation and standardizing the publication of the materials calculation data.

  • UC3: MDO will be used as a standard to improve the interoperability among heterogeneous databases in the materials design domain.

  • UC4: MDO will be mapped to OPTIMADE’s schema to improve OPTIMADE’s search functionality.

The competency questions are based on discussions with domain experts and contain questions that the databases currently can answer as well as questions that experts would want to ask the databases. For instance, CQ1, CQ2, CQ6, CQ7, CQ8 and CQ9 cannot be asked explicitly through the database APIs, although the original downloadable data contains the answers.

  • CQ1: What are the calculated properties and their values produced by a materials calculation?

  • CQ2: What are the input and output structures of a materials calculation?

  • CQ3: What is the space group type of a structure?

  • CQ4: What is the lattice type of a structure?

  • CQ5: What is the chemical formula of a structure?

  • CQ6: For a series of materials calculations, what are the compositions of materials with a specific range of a calculated property (e.g., band gap)?

  • CQ7: For a specific material and a given range of a calculated property (e.g., band gap), what is the lattice type of the structure?

  • CQ8: For a specific material and an expected lattice type of output structure, what are the values of calculated properties of the calculations?

  • CQ9: What is the computational method used in a materials calculation?

  • CQ10: What is the value for a specific parameter (e.g., cutoff energy) of the method used for the calculation?

  • CQ11: Which software produced the result of a calculation?

  • CQ12: Who are the authors of the calculation?

  • CQ13: Which software or code does the calculation run with?

  • CQ14: When was the calculation data published to the database?

Further, we proposed a list of additional restrictions that help in defining concepts. Some examples are shown below. The full list of additional restrictions can be found at the GitHub repositoryFootnote 9.

  • A materials property can relate to a structure.

  • A materials calculation has exactly one corresponding computational method.

  • A structure corresponds to one specific space group.

  • A materials calculation is performed by some software programs or codes.

Reusing and Re-engineering Non-ontological Resources. To obtain the knowledge for building the ontology, we followed two steps: (1) the collection and analysis of non-ontological resources that are relevant to the materials design domain, and (2) discussions with the domain expert regarding the concepts and relationships to be modeled in the ontology. The collection of non-ontological resources comes from: (1) the dictionaries of CIF and International Tables for Crystallography; (2) the APIs from different databases (e.g., Materials Project, AFLOW, OQMD) and OPTIMADE.

Modular Development Aiming at Building Design Patterns. We identified a pattern related to provenance information in the repository of Ontology Design Patterns (ODPs) that could be reused or re-engineered for MDO. This has led to the reuse of entities in PROV-O [15]. Further, we built MDO in modules considering the possibility for each module to be an ontology design pattern, e.g., the calculation module.

Connection and Integration of Existing Ontologies. MDO is connected to EMMO by reusing the concept ‘Material’, and to ChEBI by reusing the concept ‘atom’. Further, we reuse the concepts ‘Agent’ and ‘SoftwareAgent’ from PROV-O. In terms of representation of units we reuse the ‘Quantity’, ‘QuantityValue’, ‘QuantityKind’ and ‘Unit’ concepts from QUDT (Quantities, Units, Dimensions and Data Types Ontologies) [10]. We use the metadata terms from the Dublin Core Metadata Initiative (DCMI)Footnote 10 to represent the metadata of MDO.

3.2 Description of MDO

MDO consists of one basic module, Core, and two domain-specific modules, Structure and Calculation, importing the Core module. In addition, the Provenance module, which also imports Core, models provenance information. In total, the OWL2 DL representation of the ontology contains 37 classes, 32 object properties, and 32 data properties. Figure 9 shows an overview of the ontology. The ontology specification is also publicly accessible at w3id.orgFootnote 11. The competency questions can be answered using the concepts and relations in the different modules (CQ1 and CQ2 by Core, CQ3 to CQ8 by Structure, CQ9 and CQ10 by Calculation, and CQ11 to CQ14 by Provenance).

The Core module as shown in Fig. 1, consists of the top-level concepts and relations of MDO, which are also reused in other modules. Figure 2 shows the description logic axioms for the Core module. The module represents general information of materials calculations. The concepts Calculation and Structure represent materials calculations and materials’ structures, respectively, while Property represents materials properties. Property is specialized into the disjoint concepts CalculatedProperty and PhysicalProperty (Core1, Core2, Core3). Property, which can be viewed as a quantifiable aspect of one material or materials system, is defined as a sub concept of Quantity from QUDT (Core4). Properties are also related to structures (Core5). When a calculation is applied on materials structures, each calculation takes some structures and properties as input, and may output structures and calculated properties (Core6, Core7). Further, we use EMMO’s concept Material and state that each structure is related to some material (Core8).

The Structure module as shown in Fig. 3, represents the structural information of materials. Figure 4 shows the description logic axioms for the Structure module. Each structure has exact one composition which represents what chemical elements compose the structure and the ratio of elements in the structure (Struc1). The composition has different representations of chemical formulas. The occupancy of a structure relates the sites with the species, i.e. the specific chemical elements, that occupy the site (Struc2 - Struc5). Each site has at most one representation of coordinates in Cartesian format and at most one in fractional format (Struc6, Struc7). The spatial information regarding structures is essential to reflect physical characteristics such as melting point and strength of materials. To represent this spatial information, we state that each structure is represented by some bases and a (periodic) structure can also be represented by one or more lattices (Struc8). Each basis and each lattice can be identified by one axis-vectors set or one length triple together with one angle triple (Struc9, Struc10). An axis-vectors set has three connections to coordinate vector representing the coordinates of three translation vectors respectively, which are used to represent a (minimal) repeating unit (Struc11). These three translation vectors are often called a, b, and c. Point groups and space groups are used to represent information of the symmetry of a structure. The space group represents a symmetry group of patterns in three dimensions of a structure and the point group represents a group of linear mappings which correspond to the group of motions in space to determine the symmetry of a structure. Each structure has one corresponding space group (Struc12). Based on the definition from International Tables for Crystallography, each space group also has some corresponding point groups (Struc13).

Fig. 1.
figure 1

Concepts and relations in the Core module.

Fig. 2.
figure 2

Description logic axioms for the Core module.

Fig. 3.
figure 3

Concepts and relations in the Structure module.

Fig. 4.
figure 4

Description logic axioms for the Structure module.

The Calculation module as shown in Fig. 5, represents the classification of different computational methods. Figure 6 shows the description logic axioms for the Calculation module. Each calculation is achieved by a specific computational method (Cal1). Each computational method has some parameters (Cal2). In the current version of this module, we represent two different methods, the density functional theory method and the HartreeFock method (Cal3, Cal4). In particular, the density functional theory method is frequently used in materials design to investigate the electronic structure. Such method has at least one corresponding exchange correlation energy functional (Cal5) which is used to calculate the exchange-correlation energy of a system. There are different kinds of functionals to calculate exchange–correlation energy (Cal6–Cal11).

Fig. 5.
figure 5

Concepts and relations in the Calculation module.

Fig. 6.
figure 6

Description logic axioms for the Calculation module.

The Provenance module as shown in Fig. 7, represents the provenance information of materials data and calculation. Figure 8 shows the description logic axioms for the Provenance module. We reuse part of PROV-O and define a new concept ReferenceAgent as a sub-concept of PROV-O’s agent (Prov1). We state that each structure and property can be published by reference agents which could be databases or publications (Prov2, Prov3). Each calculation is produced by a specific software (Prov4).

Fig. 7.
figure 7

Concepts and relations in the Provenance module.

Fig. 8.
figure 8

Description logic axioms for the Provenance module.

Fig. 9.
figure 9

An overview of MDO.

4 MDO Usage

In Fig. 10, we show the vision for the use of MDO for semantic search over OPTIMADE and materials science databases. By generating mappings between MDO and the schemas of materials databases, we can create MDO-enabled query interfaces. The querying can occur, for instance, via MDO-based query expansion, MDO-based mediation or through MDO-enabled data warehouses.

As a proof of concept (full lines in the figure), we created mappings between MDO and the schemas of OPTIMADE and part of Materials Project. Using the mappings we created an RDF data set with data from Materials project. Further, we built a SPARQL query application that can be used to query the RDF data set using MDO terminology. Examples are given below.

Fig. 10.
figure 10

The vision of the use of MDO. The full-lined components in the figure are currently implemented in a prototype.

Instantiating a Materials Calculation Using MDO. In Fig. 11 we exemplify the use of MDO to represent a specific materials calculation and related data in an instantiation. The example is from one of the 85 stable materials published in Materials Project in [8]. The calculation is about one kind of elpasolites, with the composition \(\mathrm {Rb}_2\mathrm {Li}_1\mathrm {Ti}_1\mathrm {Cl}_6\). To not overcrowd the figure, we only show the instances corresponding to the calculation’s output structure, and for multiple calculated properties, species and sites, we only show one instance respectively. Connected to the instances of the Core module’s concepts, are instances representing the structural information of the output structure, the provenance information of the output structure and calculated property, and the information about the computational method used for the calculation.

Fig. 11.
figure 11

An instantiated materials calculation.

Mapping the Data from a Materials Database to RDF Using MDO. As presented in Sect. 2.1, data from many materials databases are provided through the providers’ APIs. A commonly used format is JSON. Our current implementation mapped all JSON data related to the 85 stable materials from [8] to RDF. We constructed the mappings by using SPARQL-Generate [16]. Listing 1.1 shows a simple example on how to write the mappings on ‘band gap’ which is a CalculatedProperty. The result is shown in Listing 1.2. The final RDF dataset contains 42,956 triples. The SPARQL-generate script and the RDF dataset are available from the GitHub repositoryFootnote 12. This RDF dataset is used for executing SPARQL queries such as the one presented below.

figure a

A SPARQL Query Example. As an example, we show a SPARQL query related to CQ6 in Listing 1.3. The result contains 7 records, which are shown in Table 1. The query is:

  • “What are the materials of which the value of band gap is higher than 5eV?” (The result should contain the formula, and the value of band gap.)

figure b
Table 1. The result of the query

We show more SPARQL query examples and the corresponding result in the GitHub repositoryFootnote 13.

5 Discussion and Future Work

To our knowledge, MDO is the first OWL ontology representing solid-state physics concepts, which are the basis for materials design.

The ontology fills a need for semantically enabling access to and integration of materials databases, and for realizing FAIR data in the materials design field. This will have a large impact on the effectiveness and efficiency of finding relevant materials data and calculations, thereby augmenting the speed and the quality of the materials design process. Through our connection with OPTIMADE and because of the fact that we have created mappings between MDO and some major materials databases, the potential for impact is large.

The development of MDO followed well-known practices from the ontology engineering point of view (NeOn methodology and modular design). Further, we reused concepts from PROV-O, ChEBI, QUDT and EMMO. A permanent URL is reserved from w3id.org for MDO. MDO is maintained on a GitHub repository from where the ontology in OWL2 DL, visualizations of the ontology and modules, UCs, CQs and restrictions are available. It is licensed via an MIT licenseFootnote 14.

Due to our modular approach MDO can be extended with other modules, for instance, regarding different types of calculations and their specific properties. We identified, for instance, the need for an X Ray Diffraction module to model the experimental data of the diffraction used to explore the structural information of materials, and an Elastic Tensor module to model data in a calculation that represents a structure’s elasticity. We may also refine the current ontology. For instance, it may be interesting to model workflows containing multiple calculations.

6 Conclusion

In this paper, we presented MDO, an ontology which defines concepts and relations to cover the knowledge in the field of materials design and which reuses concepts from other ontologies. We discussed the ontology development process showing use cases and competency questions. Further, we showed the use of MDO for semantically enabling materials database search. As a proof of concept, we mapped MDO to OPTIMADE and part of Materials Project and showed querying functionality using SPARQL on a dataset from Materials Project.