Introduction

Machine learning (ML) methods promise to accurately and automatically analyze large datasets at high-speeds, revolutionizing our materials characterization workflows. Many state-of-the-art ML tools rely on supervised learning techniques, where models utilize large amounts of data annotated with features of interest for training. The performance of supervised ML models, like neural networks, directly depends on the contents and generating distribution of the dataset used for model training, and, importantly, such models have been shown to extrapolate poorly beyond their training datasets1 and have limited out-of-distribution generalization behavior2,3. Developing robust ML models for automated analysis of transmission electron microscopy (TEM), a versatile technique for structural and functional materials characterization at the atomic-scale, thus requires large image datasets which fully cover experimental imaging conditions and the variety of samples one has imaged. However, manually producing sufficiently large and diverse sets of well-annotated experimental data in order to train robust, generalizable ML models can be extremely labor intensive and creates the possibility for both human and experimental biases to negatively impact model performance during deployment. With limited experimental data, it is also difficult to investigate any failures or biases of a ML workflow arising from the choice of data used.

The prohibitive cost of producing high-quality, well-annotated experimental data for supervised learning tasks makes data simulation an attractive alternative for developing effective machine learning models. Synthetic datasets produced through materials simulations offer several distinct advantages over their experimental counterparts. In particular, high-throughput simulation can create arbitrarily large datasets covering the full range of experimental conditions with ground-truth, physics-based data annotation, avoiding human bias and error in selecting and annotating relevant data and at a lower cost. Synthetic data generation also enables consistently reproducible end-to-end model development workflows and the ability to precisely isolate data stream effects—both positive and negative—on trained models. The final challenge then becomes choosing suitable and sufficient datasets for developing ML models to achieve a scientific task of interest, making it important to understand how the data we use influences the accuracy and quality of the scientific inferences we make and how such data curation decisions induce practical compromises between ML model performance and model development costs.

In the study of nanomaterials, developing suitable and robust machine learning models for experimental use while only training models with simulation data requires producing accurate, experimentally-similar synthetic data for a sufficiently diverse and representative set of atomic structures, as dictated by experimental needs. Recent applied ML advancements for TEM have successfully utilized simulated training data to train neural network models to analyze crystalline scanning TEM (STEM) and STEM diffraction data4,5, segment and analyze high-resolution TEM (HRTEM) of 2D material structures and nanoparticles6, and denoise HRTEM micrographs7. These prior achievements, enabled by modern TEM simulation methods which can perform large-scale, high-throughput, experimentally realistic simulations across a variety of modalities8,9,10,11,12, have been limited in scope to just a few types of atomic nanostructures5,6,7 or to only periodic crystals4. The narrow set of atomic structures used for training data fundamentally narrows the application scope of their models and workflows. To lift these scope limitations and enable a wider range of experimental use cases, we need flexible, atomic-structure generation tools for simulation that better mimic the variety of complex structures seen in experimental data. However, modern software tools for computational materials science tasks13,14,15 are not primarily designed for complex structure generation and do not facilitate the precise description and generation of arbitrary distributions of complex, nanoscale atomic objects, and thus inhibit sampling sufficiently numerous and diverse atomic structures in training datasets.

In this work, we develop an end-to-end workflow for training ML models to perform atomic-resolution image segmentation of experimental HRTEM data of nanomaterials using only large, high-quality synthetic datasets comprising complex, defected atomic nanostructures sampled via a robust structure generation tool. To achieve this, we develop Construction Zone (CZ) (Fig. 1), an open-source software package which enables algorithmic and high-throughput sampling of arbitrary atomic structures, which is then combined with HRTEM simulation to generate metadata rich databases with physics-based supervision labels. We demonstrate that CZ can be used as a core component in a flexible, all-purpose simulation framework for producing high-quality synthetic materials structures from complex distributions while still offering complete control of the structure generation process

Fig. 1: Diagram of the modular structure of Construction Zone.
figure 1

a Atoms are supplied by Generator objects; subtractively removed into convex objects by Volume objects; and combined together into Scenes, in which multiple objects interact. A Transformations module provides both standard symmetry operations and more complex modifications to Generators and Volumes. The rest of the package includes miscellaneous utilities for interfacing with other software, premade structural archetypes, and tools for surface analysis and modification. b Example structures generated in Construction Zone, including a multi-grain core-shell oxide nanoparticle with strain-mediated grain alignment42 (left), a heavily-faceted gold nanoparticle on a carbon substrate decorated with molecule ligands (center), and a series of gold nanoislands on a bilayer of MoS2 forming Moiré heterostructures43 (right). The Construction Zone package is available on Github at https://github.com/lerandc/construction_zone, with documentation available at https://construction-zone.readthedocs.io/.

After, we narrow our focus to image segmentation of nanoparticle systems on amorphous substrates—for which we have well-described, comparable experimental benchmark data—in order to carefully study the data-dependent behavior of ML models used in HRTEM analysis. Image segmentation is a common pre-requisite for a variety of nanomaterial characterization tasks in HRTEM, used to both identify and quantitatively describe the size and shape of regions of interest in nanomaterials such as crystalline regions, crystal structures, atomic interfaces, and atomic defects. Compared to natural images, or even TEM images with directly interpretable contrast such as those taken with HAADF-STEM, HRTEM micrographs taken at ultra-high magnification have complex contrast features which cause classical image segmentation approaches to fail16. Utilizing our data curation framework, we study statistically precise relationships between aspects of the training data—including simulation fidelity, structural composition, and diversity of imaging conditions—and image segmentation performance, aggregating the performance results of several hundreds of neural networks. We benchmark performance of our neural-network based image segmentation models on a series of experimental HRTEM micrographs of clusters of Au and CdSe nanoparticles imaged at ultra-high magnification and evaluate data curation strategies and data-efficient training methods for achieving state-of-the-art neural network performance.

Results

Atomic structure and HRTEM image database generation

Models developed with ML methods for TEM analysis need to be adapted to the specific atomic structures one plans or expects to analyze in experiments; therefore, the training dataset should include examples of a wide number of likely atomic structures, so that any experimentally observed atomic structures lie close to the distribution of structures used for training. When curating a dataset to train robust HRTEM analysis models, one needs to be able to effectively and simultaneously capture both broad, high level aspects of atomic structures alongside the fine details that can be used to fully describe an individual structure. For example, an atomic structure may belong to a broader family of related structures, such as core-shell nanoparticles with similar sizes and chemistries. The same structure can also be described with unique and specific details, such as the placement of a defect plane or the orientation of its lattice with respect to the electron beam. To fully capture and utilize these complementary structural details, we have developed Construction Zone (CZ), an open-source Python package for building arbitrary atomic scenes at the nanoscale. CZ is designed to be robust to general use-cases, such that any complex nanostructure can be made, whilst also facilitating a flexible, programmatic workflow, such that large distributions of similar objects can be generated quickly and easily, and structure generating code is both easy to interpret and easy to reuse and repurpose.

CZ relies on a simple module structure (c.f. Fig. 1a) that combines atomic placement (Generators), nano-object creation (Volumes), and nano-object interaction (Scenes). Structures can be further manipulated or generated with the Transformation class, which contains methods like standard symmetry operations, or by using convenience routines and analytical tools from the auxiliary modules, including functionality for atomic surface analysis and modification. The package derives some of its core functionality from other open-source materials science software packages, namely, PyMatgen13, the Atomic Simulation Environment14, and WulffPack15, and interfaces seamlessly into common simulation workflows. By allowing users to specify atomic features like defects and zone-axis orientations with a high-level, materials-focused interface, CZ enables easier generation of both specific and random nanoscale atomic structures. We showcase some example complex atomic structures ranging from nanoparticles to 2D heterostructures, generated entirely with CZ, in Fig. 1b. In our study, CZ allows us to sample and quickly generate a large number of random, similar nanoparticle structures with complex defects and varying zone-axis orientations, mimicking the collection of nanoparticles that might be imaged in a typical HRTEM experiment, whilst also tracking such metadata about each structure, enabling us to draw fully specified training data distributions for machine learning model development.

Here, we use CZ as a random structure generator alongside high-throughput TEM simulation to create large, synthetic datasets which we use to train neural networks via supervised learning techniques. These trained neural networks are then evaluated on the benchmark task of nanoparticle segmentation on experimental HRTEM micrographs of Au and CdSe nanoparticles on amorphous carbon substrates16. For each image, the goal is to classify each pixel as either part of a nanoparticle or substrate. Due to the subtle, complex interplay of contrast effects under HRTEM imaging at atomic resolution, classical image analysis techniques like Fourier filtering fail to segment nanoparticles accurately, and are outperformed by neural networks trained on manually-labeled experimental micrographs16. Unsupervised techniques, such as using k-means clustering for pixel classification based on intensity, have only been successfully deployed for TEM images of nanoparticles taken at lower magnification17, where amplitude contrast dominates the signal, atomic lattice texture is not present, and data tend to be less noisy, thus further motivating high-accuracy supervised learning methods as a more successful route for image segmentation at atomic resolution.

To generate our synthetic dataset for training neural networks, we built a data generation pipeline, diagrammed in Fig. 2 and detailed more thoroughly in Supplementary Fig. 1, that begins with randomly generating several thousand spherical Au nanoparticles, placed atop unique amorphous C substrates, with random radii, orientations, locations, and with possible twin defects or stacking faults, to account for structural diversity in the target micrographs. We utilize Prismatic8,9,10 to simulate HRTEM images using the multislice algorithm18 and calculate their corresponding, ground-truth supervision labels, i.e., the sets of pixels in the images where nanoparticles are located. For each structure, we simulate HRTEM output waves at 300kV with a final resolution of 0.02 nm per pixel. From each simulation, multiple images of each nanoparticle structure are sampled under varying image conditions and noise. In order to facilitate targeted data curation when training neural networks, we extensively track and aggregate metadata at each phase, so that specific distributions of simulated data can be easily drawn from the full database.

Fig. 2: Data processing pipeline from generated structures to training-ready data.
figure 2

For each structure, we separate the structure into subsets for the training data and supervision label, simulate the HRTEM image formation for each structure with Prismatic8, and use post-processing to sample imaging conditions, noise, and generate the segmentation mask. Metadata are accrued at each step, and the data are stored into a series of staged databases.

Image segmentation of nanoparticles with supervised neural networks

Utilizing our data generation pipeline, we examine how aspects of the training set of simulated HRTEM images, as induced via data curation, affect neural network segmentation performance on experimental HRTEM images. By evaluating both general trends and more granular effects of dataset characteristics on neural network performance, we identify data curation strategies for training high-performance ML models for experimental HRTEM image segmentation using only simulated data. To isolate the effects on model performance due to characteristics of the training dataset, we fix our neural network architecture and optimization hyperparameters and train multiple UNet networks19 for each data condition. For a given data condition, training data are drawn I.I.D. from the simulation database, such that each network has both unique random initializations of learnable weights and independent data streams. Details regarding the composition and relevant sampling routines of each training dataset are described in Supplemental Section 1, alongside a small selection of example images drawn from our training datasets (c.f. Supplementary Fig. 3).

In order to understand our model performance in the context of the intrinsic variability of HRTEM data, we benchmark our neural network performance against three previously published atomic-resolution experimental datasets16,20, taken on two different aberration-corrected TEMs by different operators at ultra-high magnification (about 0.02 nm/pixel), because contrast mechanisms in HRTEM are highly dependent on sample thickness, chemical composition, structure, and experimental conditions. The first dataset comprises images of large Au nanoparticles (5 nm) and agglomerates21; the second small Au nanoparticles (2.2 nm)22; and the final small CdSe nanoparticles (2 nm)21. Simultaneous benchmarking on several different experimental datasets provides an opportunity to analyze how robust our trained models are to distributional shift and to determine the generality of training data effects. Previous best results on these datasets, when analyzed with neural networks trained with experimental data, are F1 scores of 0.89, 0.75, and 0.59, for the large Au16, small Au20, and CdSe16 datasets, respectively.

Neural networks can be trained to segment experimental HRTEM datasets with moderate accuracy even with small simulated datasets containing only 512–1024 images, as indicated by our results in Table 1. Models trained on simulated data optimize performance on the simulated training data quickly and to an extremely high degree of accuracy—frequently achieving F1-scores above 0.90 on the (simulated) validation dataset in a small number of epochs (c.f. Table 1, rightmost column)—whereas their performance on experimental data increases more slowly, continually improving throughout training, even after performance on simulated data has apparently saturated (c.f. Supplementary Fig. 13). Thus, performance on simulated data is not a reliable signal for performance on experimental data and benchmarking is crucial for successful model deployment. Model performance on validation data can lag behind training performance, reducing with increasing dataset size, similar to delays observed for models trained on small, algorithmically generated datasets23. In Fig. 3, we visualize characteristic segmentation performance of four neural networks, across a range of accuracies, on the large Au dataset. Poorly performing segmentation models (Fig. 3b, c), after training on just simulated data, can accurately predict segmentation regions on nanoparticles with clear lattice fringes, but might miss similar particles in other micrographs and/or lose significant performance when predicting segmentation regions for more complex structures textures and particles with many grains. Better networks have smoother, more consistent predictions (Fig. 3d, e) but still might miss regions in particles with more complex structures or might have high-frequency spatial fluctuations in their predicted regions (Fig. 3d, rightmost column), which are not physically consistent with nanoparticle structures, potentially indicating important noise features or aspects of the imaging conditions are not fully captured during data curation. Across the board, neural networks seem to segment nanoparticles more consistently when the particle (or particle grain) has visible lattice fringes, indicating that trained models can distinguish ordered lattice textures from other regions.

Table 1 Best performance from neural networks on segmentation of nanoparticles in HRTEM images after training on various sets of simulated data, as measured on three experimental datasets
Fig. 3: Characteristic performance of neural network models analyzing experimental images of Au nanoparticles after training on only simulated data.
figure 3

Models were selected trained on Baseline, Substrate, and Optimized Au datasets, from which these models represent median to strongly-performing examples. Scalebar is 2.5 nm.

The overall cost of the model development process depends both on the cost of procuring effective training datasets and the cost of training the models themselves, and often, compromises must be made between acceptable costs and the end quality of obtained models. As is well known, neural networks trained with supervised learning methods can be greatly improved by using larger training datasets, which can cost much more to obtain and can cause model training time to increase substantially. In Fig. 4a, we measure the performance of neural networks on the large Au experimental dataset after training on datasets of increasing size, where the training data are drawn uniformly randomly from all of the aggregated databases (c.f. Supplementary Section 1). Segmentation performance saturates at an F1-score of about 0.9 after about 8000 images are included in the training dataset, after which there are only marginal improvements due to dataset size alone. With smaller datasets, model performance can be highly variable, and the gap between the worst and best model trained steadily decreases as the dataset size increases. By tuning the quality and composition of the simulated data, we can train comparably accurate models with more data-efficient curation strategies, as, for example, with the networks trained on structures with varying substrate thicknesses (c.f. Table 1 and, for more details, Supplementary Table 2), indicated by the gold circle in Fig. 4a.

Fig. 4: Dependence of neural network model performance on dataset size and noise.
figure 4

Effect of training dataset size (a) and noise from applied electron dose (b) on neural network performance, as measured on experimental data of large Au nanoparticles. Networks in (a) were trained on a uniformly random chosen subset of all simulated data in this study, excluding the ‘Optimal’ datasets. Networks in (b) were trained on data drawn from the ‘Thermal’ and ‘Dose Variation’ datasets. The gold circle indicates the performance of the best neural network trained on a dataset containing only 1024 images, which were drawn from a simulated dataset comprising structures with varying substrate thicknesses. For each dataset condition, five randomly initialized neural networks were trained and measured.

At a fixed dataset size, the model performance can be highly sensitive and dependent on the noise of the training dataset, as shown for models trained on datasets varying only in applied electron dose in Fig. 4b. On the small Au and CdSe experimental datasets, we find that model performance maintains this sensitivity, but now peaks at lower dosages, with slow fall off of performance as noise decreases (c.f. Supplementary Figs. 9 and 10), indicating that noisier data are important for models to learn. The performance boost observed when using noisier data can potentially be seen as a regularization effect: including noisier data in the training dataset for a neural network model can help ensure that model predictions are stable with respect to noisy perturbations in experimental data. This effect is more clear when measuring the performance of our networks across all of the simulated datasets of varying noise, where the models using the lowest dose training data tend to have stable performance across all higher dosages, too (c.f. Supplementary Fig. 6). Differences in the qualitative performance trend across noise levels between the large Au data and the other experimental data might be a result of differing noise distributions in the images arising from differing camera statistics, indicating that a pure Poissonian noise model may not fully match observed noise distributions in our experimental benchmarks. Given the sensitivity of model performance to the noise level of the training dataset, we recommend to sample relevant noise from wider distributions, such that during training models see examples from a range of signal-to-noise conditions, which can improve the consistency of model training but still requires a careful choice of the noise distribution (c.f. Supplementary Fig. 2). Our results appear to be consistent as to prior work for a similar nanoparticle segmentation task24 with a different experimental geometry in which nanoparticles are mounted on a crystalline substrate and are imaged over vacuum. Given that, in our task, the imaging beam passes through both the nanoparticle and the (amorphous) substrate and has higher effective electron dose (200–600 eper Å2 across all datasets), it is likely that the relationship to noise could be more complex and that the minimal experimental dose that could be consistently segmented is larger, i.e., more signal-to-noise is required for our task.

In high-throughput simulation settings, instead of producing many examples over varying noise conditions, individual examples in the training dataset could be made more informative by improving the quality of the TEM simulation, which can improve the robustness of the desired dataset and shift the source of development costs. In our case, improving the fidelity of HRTEM image formation by including the effects of inelastic scattering due to thermal vibrations, losses due to plasmonic excitations, and/or residual aberrations in the optical alignment of the microscope has a significant, positive effect on neural network performance. In Fig. 5, we show the shift in performance of neural networks in the data scarce regime (N = 512) when trained on data with combinations of applied thermal effects, residual aberrations, and plasmonic losses as compared to a baseline dataset with no such effects. Visually, the impact of these effects can be subtle on the simulated image (Fig. 5c, d), yet, when added to the simulation data used for training, model performance can increase by as much as 0.1–0.15 in F1-score (Fig. 5b). Importantly, in a regime of stable optimization, including simulation effects appears to be helpful across a wide range of applied dosage only when all additional effects are included. Of these effects, applying thermal effects is the most computationally expensive, as it requires averaging several HRTEM wavefunctions over a set of independent frozen phonon configurations, which linearly increases simulation costs, postprocessing time, and memory requirements, though, typically, only a small number of frozen phonons (\({{{\mathcal{O}}}}(10)\)) are needed to thermally converge TEM simulations of larger atomic structures. Applying the effects of residual aberrations and plasmonic losses are both relatively cheap in comparison, and thus, should be included in training datasets if possible.

Fig. 5: Dependence of neural network model performance on simulation fidelity.
figure 5

Baseline neural network performance with cheapest simulation methods (a) and effect of improving simulation fidelity on network performance, conditional on applied dose, including effects from thermal averaging, residual aberrations, and plasmonic losses (b). Visual examples of baseline quality (c) and applied simulation effects (d) on simulated image of Au nanoparticles at 300 kV with focal spread and 400 eper Å2. Training data for these models were drawn from the ‘Baseline’, ‘Thermal’, and ‘Simulation effects’ datasets.

The relationship between dataset composition—such as diversity of atomic structures or imaging conditions—to model performance is more nuanced, and in particular, aspects of dataset composition seem to be more important for controlling the variance of model performance than for boosting performance ceilings. That is, by including images of nanoparticles from a wide variety of structures or imaging conditions, we can much more likely guarantee that a randomly initialized model will optimize well and learn to segment experimental data effectively. In Fig. 6, we measure the performance of models trained on simulated datasets of fixed size (N = 1024) with varying atomic structural content and imaging conditions. Networks trained on datasets comprising images of samples with varying substrate thicknesses performed better and with lower variance than networks trained on a single, fixed substrate thickness (Fig. 6a, b), indicating diversity in the atomic structures seen during training is important for image segmentation. Regarding imaging conditions, relative defocus is known to have a strong nonlinear effect on contrast features in HRTEM, which is reflected in a highly variable nonlinear relationship of model performance to the focal point of simulated data (c.f. Supplementary Fig. 7). To address this, we can sample images from a larger variety of focal points (Fig. 6c, d), increasing the number of unique focal points (and by construction, focal range) seen during training while maintaining the total size of the dataset, which has a small positive effect on segmentation performance and a strong effect on reducing model variance. The number of unique nanoparticle atomic structures, however, does not significantly impact model performance (Fig. 6e, f), as long as a wide variety of imaging conditions are indeed being sampled in the training dataset. In general, these data compositional effects are more significant when models are optimized more aggressively (Fig. 6a, c, e) by using a larger initial learning rate and the extra data diversity can be interpreted as having a regularizing effect. Notably, regularization effects arising from an increase in data diversity might come into play only when the training dataset is already a suitable match to the experimental data being analyzed—i.e., that the distributions from which simulated training data and experimental data are drawn are similar. We find that these regularization effects are not noticeable when neural network performance is measured on the large Au or CdSe datasets (Supplementary Figs. 8 and 10), where performance is worse overall, possibly due to the experiments measuring nanoparticle structures that are structurally dissimilar to the simulated data used to train the networks varying in focal conditions. Changing the nanoparticle size in the training data can have a strong effect on performance (c.f. Supplementary Fig. 11), but does not appear to be completely related to the size of the nanoparticles in experimental images.

Fig. 6: Dependence of neural network model performance on dataset composition.
figure 6

Effect of training dataset composition on neural network performance, at fixed noise levels and dataset size (N = 1024 images), varying a, b the number of substrate thicknesses of the carbon substrate in the simulated structures, c, d the number of unique focal conditions in the training dataset, and e, f the number of unique structures in the dataset. Training data for these models were drawn from the ‘Varying Substrate’ and ‘Smaller Nanoparticles’ datasets.

Incorporating these lessons on data effects altogether, we can design an optimized simulated dataset to target maximal absolute performance on our experimental benchmarks—ultimately, the balance of dataset size, fidelity, and composition will be dictated by the needs of a particular experiment or analytical task, which can include constraints on model size, data curation time, and model time-to-train. Here, we focused on improving performance on both the large Au dataset and the CdSe dataset by increasing the variety of atomic structures seen, including all simulation effects mentioned previously, sampling a wider range of imaging and noise conditions, and using a modestly large training dataset (8000 total images). On these benchmarks, with slight re-tuning of the training hyperparameters, we achieve a maximum F1 score of 0.9189 and 0.7516, for Au and CdSe, respectively, whilst also achieving relatively strong generalization performance (c.f. Supplementary Fig. 12). When using a random 50/50 mixture of the optimized datasets, networks have strong generalization performance across the three datasets, but better peak performance is achieved only on the small Au dataset (c.f. Table 1). For details regarding the composition of these datasets and the training strategy and performance measurements on other auxiliary datasets, please refer to Supplemental Sections 1 and 3.

Discussion

Our results, taken altogether, indicate that high-accuracy supervised models for analyzing atomic-resolution HRTEM experiments can be trained with sufficiently large, high-quality simulated databases. These synthetic datasets tend to need to be larger than what is typically curated experimentally (around 4000–8000 images), simulated with relatively high fidelity, and must contain appropriate dispersions of atomic structures, especially varying substrates, imaging conditions, particularly defocus, and applied noise to ensure consistent model performance. Critically, many observed effects of data curation strategies can be unnoticeable when networks perform poorly across the board or when the simulated dataset differs significantly from an experimental benchmark, as demonstrated by our tests varying the dataset composition at fixed dataset sizes (Fig. 6). With precise control of the prior distributions from which training datasets were drawn, we were able to isolate these successful data curation strategies for machine learning model development, which otherwise can become almost intractable due to the large observed variance of model performances under certain training dataset conditions.

The advantages of simulated data primarily arise from access to arbitrarily large datasets, a broader and controlled distribution of training data, and physics-based ground-truth measurements on the dataset. Our results indicate that, even when dealing with experimental data that arises from a highly-complex measurement process, the use of simulated data during the development of ML models can compete and even provide distinct performance gains as compared to utilizing experimental data. Alternative approaches for bridging the gap between simulated data and experimental data could involve training complementary sets of neural networks to specifically learn how to model and apply noise effects and features not captured by simulation models, e.g., with CycleGANs25, though the extent to which such an approach could be as effective for HRTEM as for annular dark-field STEM is unclear. Well-curated experimental datasets are still immensely valuable—both as reference benchmarks, as used in this study, and as high-quality training data. An important caveat of our results is that all performance metrics shown here ultimately compare physics-based ground-truth labels and expert manual labels. Given the inevitability of human measurement error, bias, and uncertainty, the two sets of labels never match perfectly, leading to a performance gap in our models and thus an artificial performance ceiling. Further work is needed to best understand measurement error in the face of such distribution gaps and to understand the best way to develop machine learning models that can make use of such mixed data given systematically different labels.

Ideally, during training, a model would have access to a simulated dataset drawn from an identical distribution to the experimental prior and enough examples in this dataset to provide adequate coverage of the experimental distribution in which it will be deployed for analysis. In this setting, ML models only make in-distribution predictions, and never face extrapolation errors. In practice, it can be difficult to determine the exact experimental prior, suggesting instead a strategy of drawing training data from a distribution which the experimental distribution could be a feasible subset, thus, reducing adverse effects of distributional shifts during model deployment. General-purpose tools like Construction Zone, which enable fast sampling of realistic synthetic examples for ML problems and are not limited to any specific domain, are a crucial component of robust data workflows which would enable such sampling strategies. Further, such general purpose tools enable careful tuning of training datasets to best match experimental prior distributions, which can be performed manually or automatically, for example, as a component in an active learning training loop26,27. Curating the best suitable dataset for a particular ML problem is a crucial component of reliable ML workflows, but should be pursued in conjunction with finding the best model framework, which can include developing bespoke model architectures28,29 or implementing custom loss functions30 with regularization31 specific to the problem of interest.

In sum, we have developed Construction Zone, a Python package that enables programmatic generation and sampling of atomic nanostructures, a general purpose tool designed primarily designed to help study and design ML workflows at scale for nanoscale materials science problems. In conjunction with HRTEM simulation, we systematically generate large structural and imaging databases for training ML models and have demonstrated their utility for atomic-scale characterization with HRTEM. By systematically generating large structural and imaging databases, we achieve state-of-the-art performance on such models, whilst also providing the ability to design and develop ML tools carefully through high levels of specific control on the data generation process.

Methods

Construction Zone

Nanoscale atomic structures can be designed in Construction Zone (CZ available at https://github.com/lerandc/construction_zone) using a combination of Generator, Volume, Transformation, and Scene objects. Generators, which supply atoms in space, specify atom positions and can generate atoms in both crystalline and non-crystalline arrangements. Volume objects define the convex regions in which atoms from Generators are accepted. Volumes can either be defined by sets of convex algebraic surfaces, convex hulls of point clouds, or intersections of supplied convex objects; non-convex geometries can be created as the union of multiple Volumes. The Transformation module manipulates objects with standard routines like symmetry operations and more complex routines like applying inhomogenous strain fields or modifying the local chemistry of a structure. Generators and Volumes can be manipulated jointly or individually for full construction flexibility. For example, a grain can be oriented either by rotating a Generator lattice relative to its boundary or by rotating the whole object in global coordinates. Finally, Scenes aggregate objects together. Given a set of atomic objects, a Scene will handle interactions through a generation precedence scheme and prepares data for interfacing with other simulation methods or for writing to file.

CZ also features a small set of auxiliary modules for more complex functionality and convenience. The Surface module provides fast routines for analyzing, querying, and modifying the surface of a generated object using derivatives of alpha-shape algorithms32,33. The Utilities module provides several analysis routines, such as radial distribution function (RDF) analysis and orientation sampling. Lastly, the Prefab module contains routines that generate “pre-packaged” objects like Wulff constructions of nanoparticles (as implemented in ref. 15) or grain structures with planar defects. For a full discussion of underlying routines, available features, and package usage, we refer the reader to the CZ documentation, available at https://construction-zone.readthedocs.io.

TEM simulation

High-resolution TEM image simulations in this study were performed using the multislice algorithm18, as implemented in the Prismatic software package8. In the multislice algorithm, we model the image created in an electron microscope through the interaction of an electron beam and a material sample as the evolution of a 2D complex wavefunction, ψ(r). The evolution of the wavefunction is given by the Schrödinger equation for fast electrons34

$$\frac{\partial \psi ({{{\bf{r}}}})}{\partial z}=\frac{i\lambda }{4\pi }{\nabla }^{2}\psi ({{{\bf{r}}}})+i\sigma V({{{\bf{r}}}})\psi ({{{\bf{r}}}}),$$
(1)

where V(r) is the electrostatic potential of the sample and σ is an interaction constant. V(r) is typically calculated with an isolated atom approach, where the total potential is the sum of independent atomic potentials

$$V({{{\bf{r}}}})=\mathop{\sum}\limits_{i}{V}_{i}({{{\bf{r}}}})$$
(2)

where the individual atomic potentials themselves calculated with a parameterized look-up table of electron scattering factors35,36. Alternatively, the scattering potential can be determined for a sample through ab initio techniques11. Once V(r) is determined, it is split into a series of binned slices along the beam direction and then the wavefunction ψ(r) is evolved through a split-step method, where the electron beam alternately interacts with the sample

$$\psi ({{{\bf{r}}}})=T({{{\bf{r}}}}){\psi }_{0}({{{\bf{r}}}})={e}^{i\sigma {V}_{n}({{{\bf{r}}}})}{\psi }_{0}({{{\bf{r}}}}),$$
(3)

and then propagates in free space

$$\psi ({{{\bf{r}}}})={\hat{{{{\mathcal{F}}}}}}_{{{{\bf{k}}}}\to {{{\bf{r}}}}}^{-1}\left\{{e}^{i\lambda \Delta z| {{{\bf{k}}}}{| }^{2}}{\hat{{{{\mathcal{F}}}}}}_{{{{\bf{r}}}}\to {{{\bf{k}}}}}\left[{\psi }_{0}({{{\bf{r}}}})\right]\right\}.$$
(4)

After the wavefunction has propagated through the entire sample, we obtain the exit wavefunction, which can be further modified to apply defocus and other residual optical aberrations typically seen in HRTEM imaging by applying another transmission operation

$$\Psi ({{{\bf{k}}}})={\Psi }_{0}\left({{{\bf{k}}}}\right){e}^{-i\chi ({\bf{k}})}$$
(5)

where Ψ0(k) is the unaberrated wavefunction in Fourier space and χ is the aberration function. Aberration functions used to model imaging conditions in our study are detailed in Supplementary Table 1 and Supplemental Section 1.

Database generation with CZ and HRTEM simulation

We simulate a HRTEM data stream by first generating a large database of semi-realistic structures and then simulating micrographs for those structures under suitable imaging conditions. We greatly approximate the variety of nanoparticles imaged in ref. 16 as a series of spherical Au nanoparticles of varying diameters with a small number of planar defects, which are equally likely to be twin defects or stacking faults. Each nanoparticle is placed upon a unique amorphous carbon substrate and at a random location on the substrate surface at a random orientation. We then perform plane-wave multislice simulations of the structures at an acceleration voltage of 300 kV, focused to the center of the atomic model, at a resolution of 0.02 nm per pixel. We simulate output wavefunctions for both the full structure and just the nanoparticle in vacuum. To create ground-truth segmentation masks, we threshold the phase of the output wavefunctions from the nanoparticles in vacuum, averaged over frozen phonons. Threshold values were determined heuristically. We found phase-thresholding to be much more stable and physically consistent across focal conditions than analogous intensity thresholding techniques.

With the output wavefunctions, we then apply relevant focal conditions, residual aberration effects, focal spread, thermal effects, plasmonic losses, and electron dosage effects. Focal point, focal spread, and residual aberrations are applied to output wavefunctions using Eq. (5) with the appropriate aberration function; the effect of focal spread is approximated by incoherently averaging the wavefunction over a series of focal points distributed about the intended focal point. Thermal effects are included via the frozen-phonon approximation, where we incoherently average several exit wavefunctions to approximate the effect of thermal motion, corresponding to unique, perturbed copies of the original atomic structure. The atomic positions are perturbed by I.I.D. Gaussian noise, with amplitudes given by atomic Debye-Waller factors; for all simulations including thermal effects, we used eight frozen phonons. We approximate plasmonic losses in HRTEM imaging by applying a contrast reduction to the wavefunction intensities before averaging over frozen phonons. The contrast reduction is applied by taking a weighted pixelwise mean of the original intensity and an intensity with a constant added background (renormalized to have its original mean value). Noise from applied electron dosage is modeled by sampling count intensities from scaled Poisson distributions. We note that we do not include potential artifacts from the camera, i.e. via the modulation transfer function, though all experimental datasets have been acquired by scintillator-based detectors, for which including effects from the modulation transfer function in the noise sampling process could further improve performance6,24. For more details regarding the composition and sampling of specific datasets, we refer the reader to Supplemental Section 1.

Neural network training

For our neural network development, we used a standard UNet architecture16,19,37 with a ResNet 1838 encoder/decoder architecture and three pooling/upsampling stages, resulting in about 14 million trainable parameters (c.f. Supplementary Fig. 4). Models were trained using a mixed categorical cross-entropy and F1-score loss function on the segmentation masks. We performed a brief manual hyperparameter search to determine initial learning rates and learning rate schedules which were suitable enough for stable optimization under a wide variety of the simulated datasets. For most networks whose performance is reported in this study, we used the Adam optimization algorithm39 with initial learning rates of 0.01 and 0.001, a constant learning rate decay of 0.8, and a batch size of 16 images; for more detailed effects of the batch size on training dynamics, we refer the reader to Supplementary Fig. 5. Unless explicitly indicated otherwise, results in the primary text reflect those of networks trained with an initial learning rate of 0.001. Models were trained for a fixed number of epochs (25) with no early stopping; the network parameters that achieved the lowest validation loss were saved via checkpointing. Input data were first normalized, per image, to a range of 0 to 1, and then augmented with orthogonal rotations and random left-right/up-down flips. For experimental data, images were processed with a 3 × 3 median filter and then normalized to a range of 0 to 1, per image, before evaluating neural network test performance. Experimental image patches containing only substrate regions were removed from the training dataset prior to preprocessing. Per simulated training data condition tested, at least five different networks were trained with different random weight initializations and training data subsets (drawn uniformly at random after appropriate filtering). For each network, relevant metadata were saved alongside the training history and model parameters. All training was performed using Tensorflow version 240, with the Segmentation Models package41. Neural networks were benchmarked for performance on three experimental datasets using the F1-score

$${\rm{F}}1=\frac{2f(x)\cdot \hat{y}}{\parallel f(x)\parallel +\parallel \hat{y}\parallel }$$
(6)

where f is the neural network model, x is the input image, and \(\hat{y}\) is the corresponding image label. When discretizing the prediction to binary values, the F1 score can be interpreted as a ratio of true positives (identifying nanoparticle region correctly) to a sum of true positives, false positives (identifying substrate as nanoparticle) and false negatives (identifying nanoparticle as substrate) such that \({\rm{F}}1=\frac{TP}{TP+0.5(FN+FP)}\). The F1 score measures 0 when all pixels are misclassified and 1 only when there are zero misclassified pixels. The non-discretized form is used for training and evaluation in this work.

Computational resources and benchmarks

All structure generation, simulation, data generation, and machine learning were performed on a workstation computer, using an Intel Xeon Gold 6130 CPU (16 cores, 256 Gb RAM) and a NVIDIA Quadro P5000 GPU (16 Gb RAM). Table 2 shows representative benchmark timings for each of the main compute tasks in our training pipeline, as logged by our data generation scripts. To understand how our generation pipeline scales with typical modern high performance computing resources, we ran comparable data generation tasks on the Perlmutter supercomputer at NERSC, using a single CPU node comprising 2 AMD EPYC 7763 CPUs (64 cores ea., 512 Gb RAM total). Our database generation workflow can generally be run in an embarrassingly parallel fashion, where multiple processes are spawned to generate larger databases by, e.g., individually handling independent structures, simulations, or images; the timings below represent the time per item when running in parallel. In our pipeline, filesystem performance can heavily influence the computational cost since each intermediate output was saved to disk for reproducibility, reducing the effective parallelism of the approach, especially on the workstation. We suspect time-to-compute, from specification of a structure to a set of corresponding training images, can be reduced significantly by instead keeping things in working memory. The structure generation costs are dominated by the generation of the amorphous carbon substrate, which is both serial and difficult to parallelize, but could be precomputed. The HRTEM simulation and image generation tasks are both dominated by the fast Fourier transform (FFT) operations, in the propagation of the electron wavefunction and the application of focal spread, respectively, and thus greatly benefit from having an optimized shared-memory FFT implementation.

Table 2 Performance timings for structure generation, HRTEM simulation, image generation, model training, and model inference using a single workstation and, where timings available, on a single CPU node on the Perlmutter supercomputer at NERSC