Introduction

Data-centric approaches have been adopted to dramatically accelerate progress in materials science. This is also the case for the study of nano structure s of materials.1 Thanks to advances in computational power and techniques, theoretical calculations using density functional theory (DFT) can be systematically performed for many different crystals and nanostructures with predictive performances; these have been stored as open databases, as shown in the other articles in this issue, or in local depositories. Progress of digitally controlled microscopy and spectroscopy has enabled acquisition of big data from nanostructures with atomic resolution. The combination of such digital data with modern machine-learning (ML) techniques has been used to explore materials and structures. It has been used to extract meaningful and useful information and patterns from existing data, or data-driven discovery.

This article reviews recent progress in ideas and tools in nanoinformatics and informational materials science. Actual applications of ML techniques for materials problems will also be demonstrated. Topics include descriptions of materials properties, construction of interatomic potentials, discovery of new inorganic compounds, exploration of potential energy surfaces for efficient characterization of ionic transport, efficient search of interface structures, data analysis of hyper-spectral images by transmission electron microscopy, and design of catalytic nanoparticles.

Descriptions of materials properties

How compounds are represented in a data set is a key factor in controlling the performance of an ML approach. Representations of compounds are called “descriptors” or “features.” A useful strategy is to use a set of quantities, derived from elemental and structural representations of a compound, as descriptors since such representations are abundant in the literature. Kernel ridge regression prediction models for the DFT cohesive energy have been used to evaluate the performance of descriptors derived from elemental and structural representations.2 Our best prediction model has a prediction error of 0.045 eV/atom.2 Therefore, the present method should be useful to search for compounds with diverse chemical properties that are applicable to a wide range of chemical and structural spaces without performing exhaustive DFT calculations. Our previous research confirmed that descriptors based on elemental and structural representations are useful in other applications such as the prediction of thermal and electronic properties.35

Another application of descriptors is the ML interatomic potential (MLIP). MLIP, which is based on a large data set obtained by DFT calculations, can improve the accuracy and transferability of interatomic potentials.6,7 In the MLIP framework, the atomic energy is modeled by descriptors corresponding to structural representations. For elemental metals, MLIPs have been obtained from DFT calculations using linear regression such as Lasso and linear ridge regression.8,9 MLIPs with structural descriptors dependent only on the distance between two atoms have small prediction errors, enabling the physical properties to be accurately predicted. It is important to use an extended approximation for the atomic energy in transition metals.10,11 For elemental Ti, the optimized angular-dependent MLIP has a prediction error of 0.5 meV/atom, which is much smaller than that of the linearized MLIP with the power of pairwise descriptors of 17.0 meV/atom. This angular-dependent MLIP can predict the physical properties much more accurately than existing IPs.

Recommender system for the discovery of new inorganic compounds

Chemically relevant compositions (CRCs) that form stable crystals and atomic arrangements of inorganic compounds have been collected as inorganic crystal structure databases. We have proposed recommender system approaches to predict currently unknown CRCs from a database.12,13 First, the performance of matrix- and tensor-based recommender system approaches was examined to discover currently unknown CRCs.12 The Tucker decomposition recommender system shows the best performance: 735 test CRCs (24.5%) were identified in the top 3000 compositions. Second, systematic DFT calculations were used to investigate the phase stability of 27 recommended compositions with high predicted ratings. Among the top 27 compositions, 23 currently unknown compounds were found to be stable by the DFT calculations.12 These results indicate that the recommender system has great potential to accelerate the discovery of new compounds.

Exploring a potential energy surface for characterizing atomic transport

Atomic transport plays a significant role in various phenomena related to physics, chemistry, and materials science. For the transport of a mobile atom governed by thermally activated processes in a host crystal, the kinetics are fully characterized by the entire potential energy surface (PES) of the mobile atom in the crystal. The entire PES can be theoretically evaluated by exhaustive local structural optimizations around the mobile atom, in which the mobile atom is fixed at each point on a fine grid introduced in the crystal. However, such evaluations require huge computational costs, particularly for first-principles calculations. In this section, a novel ML method is introduced.14 Only several dominant points characterizing atomic transport of interest were evaluated selectively. The global minimum point and the bottleneck point on the optimal path are focused on. The latter point is defined as the maximum point on the lowest-energy path between two global minimum points separated by a lattice translation vector.

The basic strategy for identifying the dominant points is to construct a probabilistic Gaussian process (GP) model of the entire PES, which is iteratively updated using the first-principles PEs already computed in the earlier steps. The next point for PE computation is selected by the estimated likelihoods of the dominant points within a Bayesian optimization (BO)-like framework. The intrinsic difficulty is that the optimal path and its bottleneck point are found after acquiring complete information about the entire PES. Use of only the mean and variance at each grid point is never sufficient to estimate the likelihood of the bottleneck point, although these are usually used in typical GP + BO strategies.4,15,16 To overcome this difficulty, multiple randomized PES samples were generated according to a probabilistic GP model, and the optimal path for each PES sample was identified using a dynamic programming (DP)-based algorithm. This enables collections of the global minimum and bottleneck points to be obtained, and these collections are considered to be distributions representing the likelihoods of the dominant points. See the flowchart shown in Figure 1 for reference.14

Figure 1
figure 1

Flowchart of the machine-learning method based on the Gaussian process, dynamic programming, and Bayesian optimization frameworks for preferential evaluation of several dominant points primarily characterizing an atomic transport of interest.14 Note: PE, potential energy; PES, potential energy surface; DFT, density functional theory.

The GP + DP + BO method was applied to the isotropic and anisotropic proton diffusivities in c-BaZrO3 with the cubic perovskite structure and t-LaNbO4 with the tetragonal scheelite structure.14 Only 50 and 100 PE computations were required to identify both global minimum and bottleneck points in c-BaZrO3 and t-LaNbO4, respectively. Thus, the ML method based on the extended frameworks of GP, DP, and BO shows high computational efficiency for identifying the dominant points. Note that this method is, in principle, applicable to any kind of atomic-transport phenomena governed by multiple mobile atoms. Furthermore, other phenomena governed by thermally activated processes (e.g., phase transitions and chemical reactions), are also considered as applications, where both initial and final states can be given in a configuration space. The novel ML method should therefore be used extensively as a robust, efficient, and realistic method.14

Interface structure determination from informatics

Interfaces are a lattice defect inside materials and influence overall material properties. For example, interfaces in poly-crystalline materials (i.e., grain boundaries [GBs]), determine ion-transportation properties and high-temperature mechanical properties. That interfaces have different properties from the bulk is a consequence of the fact that they have different atomic configurations from that inside the bulk. Thus, for a comprehensive understanding of interface properties, determination of the atomic structure of the interface is crucial.

However, extensive calculations are necessary to determine even one interface structure because of the geometrical freedom of the interface. The number of atomic configurations to be considered often reaches 104 in even the model GB, such as coincidence site lattice ΣGBs. As schematically illustrated in Figure 2, the structure and energy calculations for all candidates must be performed, leading to optimized configurations and energies (Ei,j in Figure 2). The most stable configuration with the minimal energy (Ei,min in Figure 2) can then be determined from the DFT/molecular dynamics (DFT/MD) simulation of the interface. This “brute force” computation is necessary to determine other types of interfaces because the interface structure is dependent on the type of interface (ΣGB1, ΣGB2 in Figure 2). To accelerate interface structure searching, efficient methods based on ML techniques, including virtual screening and BO have been proposed.16,17

Figure 2
figure 2

Schematics of interface structure searching using the virtual screening method.16 Note: DFT, density functional theory; MD, molecular dynamics.

Virtual screening is an effective method in time-critical problems, and has been used in drug discovery, where a prediction model was constructed using ML from a relatively small data set and a large database consisting of the actual data and data predicted by the prediction model. The idea of our virtual screening method to determine the interface structure is illustrated in Figure 2. A prediction model (predictor) is constructed via regression analysis of the training data, in this case, ΣGB1 and ΣGB2. Once the predictor is constructed, the GB energies, Ei,min (i=3,4,..N), can be predicted from the initial configurations. Next, the promising initial configuration is optimized using the structure and subsequent energy calculations, and then the accurate energy and stable structure are obtained (Stable ΣGB3,4 N in Figure 2).16

We applied the virtual screening method to the [001] symmetric tilt GB of Cu. The predictor was constructed using two Σ5 and two Σ17 GBs, and a total of 83 descriptors related to the geometrical data, such as bond length and atom density, were used. The predictor was used to determine 12 other GBs of Cu, Σ13~Σ125.16

To obtain the stable structures for these GBs, the DFT/ MD simulation was performed more than one million times. On the other hand, the most plausible candidate can be determined by the predictor, and only a single (or a few) calculation is necessary for each type of GB. The virtual screening method thus significantly decreases the computational cost to determine the GB structures.

In addition to the virtual screening, we have developed an alternative and powerful method to search for stable interface structures with the aid of a geostatistics approach called kriging.17 Kriging is an effective interpolation method based on BO and GP governed by prior covariance. The kriging method has been applied and demonstrated to determine the GBs of fcc-Cu, bcc-Fe, MgO, rutile-TiO2, and CeO2 GBs.17,18 By using the kriging method, the most stable structure can be determined using less than 100 time calculations.

Furthermore, the concept of transfer learning, in which learning results from other related tasks, has been combined with kriging. We have confirmed that the transfer learning can accelerate search by approximately three times compared to the original kriging.19 All of these investigations demonstrated that ML methods, virtual screening, kriging, and transfer learning are powerful tools to accelerate interface structure searching.

Hyperspectral image data analysis through nonnegative tensor factorization

Another intriguing field of “nanoinformatics” involves applications of blind signal separation (BSS) techniques to experimental hyperspectral image data to separate physically interpretable overlapping spectral components and map their spatial distributions. Current digitally controlled scanning electron/transmission electron/probe microscopes (SEM, STEM, SPM, respectively) equipped with spectroscopic detectors enable automatic generation of large data sets consisting of laterally spatially resolved spectral intensities even at atomic resolution, which are expressed in three-dimensional (3D) tensor form (data cube). This data expression is particularly useful for extracting subtle chemical state changes associated with localized defects or impurities, such as inclined interfaces and trace elements, and also signals from data having low signal-to-noise ratios (SNRs).

The problem here is to achieve statistical isolation of a small number of basis spectra and their contributions at individual positions, under the assumption that the spectral intensity at each sample pixel is represented by a linear combination of the basis spectra associated with the underlying chemical states or phases, without any a priori knowledge other than the experimental data.

In order to extract basis spectra from spatially highly mixed data, we have developed a spatially orthogonally constrained NMF algorithm20 and demonstrated successful applications to chemical state analyses of the cathodes of lithium-ion batteries.21,22 As a solution, we have proposed signal subspace sampling (SSS), which can be used to convert the original data set into a sampled set, preserving the original information and better satisfying the recovery conditions for subsequent BSS methods compared to other techniques; this is particularly effective in cases involving strong spatial and/or spectral mixing.23

NMF can be extended to tensor form to process data sets having higher-order dimensions (e.g., those concurrently obtained through use of more than one type of spectroscopic method at each sampling point). The tensor-form NMF can be used to obtain deeper and more accurate insights (which are often unique) than those resulting from analysis of a single data source.24,25 In NTF, each data point of a 3D data cube, xklm, is described as \({x_{klm}} = \sum\nolimits_i {{{\rm{\lambda }}_i}a_k^{(i)}} b_l^{(i)}S_m^{(i)}\) for a given number of components in a manner that minimizes the sum of the squares of the residuals. Each of the terms in the sum is a data cube, where the information along each mode can be described by a single vector, as shown in Figure 3. A successful application example is given in Reference 26, where significant spectral information is extracted from low-SNR data.

Figure 3
figure 3

Schematic representation of the nonnegative tensor factorization concept for a 3D tensor case.2426 The tensor X stores two types of spectroscopic data concurrently recorded at the sampling points of (x, y). Each data point of a 3D data cube, xklm, is described as \({x_{klm}} = \sum\nolimits_i {{{\rm{\lambda }}_i}a_k^{(i)}} b_l^{(i)}S_m^{(i)}\) for a given number of components. Each of the terms in the sum is a data cube, where the information along each mode can be described by a single vector, the resolved matrices, S, A, B storing isolated spectral components and their spatial abundances, respectively.

Informatics design of catalytic nanoparticles

One of the primary thrusts of computational materials design is the use of data mining algorithms coupled with DFT calculations, allowing for accelerated and high-throughput calculations.2729 From this enlarged database, systems are screened for those with calculated properties meeting design requirements. However, an issue that must be addressed is the lack of data density and diversity, resulting in modeling performed only in those regions of the chemical search space where data exist. This results in iterative design of new chemistries because the design is principally done in the data regions where the governing physics are already well defined. An alternative design approach is not to create as much data as possible, but rather to identify the targeted data for transformational design, as has been demonstrated in prior publications.3035 The objective of this approach is to identify (1) where new data are needed and (2) where additional data do not add new information. This leads to developing computationally efficient design rules based on the extraction of “hidden” physics within the existing knowledge base. This approach thereby guides future experiments and calculations, while simultaneously shrinking the data search space.

As an example, in the design of nanocatalysts (i.e., the design of catalytic chemistries as a function of constituent atom chemistry, atomic neighborhoods, and the site of reaction), the d-band center, dc, of the adsorbent can be used to represent the absorbate’s binding energy (BE).36,37 However, the relationships between dc and DFT calculated BE do not account for the edges of nanoparticles, but rather assume crystallographic planes. Therefore, expanding this work to the nanoscale requires additional descriptors to capture the BE-chemistry relationships with the descriptor considered, including surface strain, electronegativity difference between nanoparticle elements, charge transfer between the surface and subsurface, and weighted elemental descriptors such as atomic number, work function, and melting temperature.

To close the nanodesign gap, we have developed and employ a hybrid informatics methodology that (1) assesses a larger descriptor space; (2) ensures that the governing physics that we are attempting to model in a high-throughput fashion captures the governing physics; (3) identifies the minimal amount of data/descriptors needed to represent this physics; (4) develops quantitative structure–property relationships (QSPRs), which link the descriptors (describing the nanoparticle chemistry) with the property of interest (in this case BE); and (5) applies the QSPRs to a “virtual” material search space, which can then be rapidly screened.30 Beyond providing a computationally tractable and data-driven modeling approach, the QSPR also provides physically meaningful relationships by defining how the individual components impact the target material properties. With these models, we can predict the properties for massive search spaces and for materials, which are difficult to model via quantum mechanical approaches.

The nanoinformatics design steps are highlighted in Figure 4, where principal component analysis (PCA) was used to assess the correlations in the data, identify the minimal amount of information for design required, and to ensure that the known physics is sufficiently captured.30 This last point is critical to ensure that the QSPR is not solely a statistical result, but rather is physically driven even if that physics is not fully quantitatively defined. From this descriptor space, QSPRs were developed with high accuracy and robustness for BE of CO and H molecules. The robustness was ensured through the utilization of cross validation, leading to a proper tradeoff between accuracy and robustness. From the QSPR, we were able to expand the knowledge base to multicomponent nanoparticles,30 as shown in Figure 4. Starting from knowledge of BEs of 11 elements, we predicted with high accuracy the BE of 242 nanoparticles, with the number of potential chemistries expanding significantly as we relax the uncertainty limitations. This demonstrates the potential for significant expansion of the knowledge base and allows us to enter design spaces which have previously been prohibitive to explore.

Figure 4
figure 4

Informatics-driven expansion of nanocatalyst knowledge base, (a) Dimensionality reduction approach was used to identify the target descriptors and catalytic reactions, and to ensure that the descriptor base captures the governing physics, (b) From the reduced descriptor space, quantitative structure–property relationships were developed for predicting the binding energy (BE) as a function of nanoparticle chemistry, which expanded the data measurements by an order of magnitude with high confidence. Reprinted with permission from Reference 30. © 2014 AIP Publishing. Note: PC, principal component; Dc, d-band center of the bimetallic cluster; Γmelt, average melting temperature; c, strain on the surface plane; W, average work function; Z, average atomic number; δEN, difference in electronegativities of the metals composing the bimetallic.

Summary

In this article, we have reviewed recent progress in our ideas and in the tools used for nanoinformatics and materials informatics, as well as actual applications of ML techniques for materials problems. They include descriptions of materials properties, discovery of new inorganic compounds, exploration of potential energy surfaces for characterizing atomic transport, interface structure search, hyperspectral image data analysis, and design of catalytic nanoparticles. Consequently, nanoinformatics is expected to accelerate the exploration of frontiers in materials science and promote the integration of information and utilization of accumulated knowledge regarding nanostructures for the design and innovation of actual materials.