Realistic Synthetic Data Generation: The ATEN Framework

McLachlan, Scott; Dube, Kudakwashe; Gallagher, Thomas; Simmonds, Jennifer A.; Fenton, Norman

doi:10.1007/978-3-030-29196-9_25

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1024))

Included in the following conference series:

International Joint Conference on Biomedical Engineering Systems and Technologies

743 Accesses
4 Citations
1 Altmetric

Abstract

Getting access to real medical data for research is notoriously difficult. Even when data exist they are usually incomplete and subject to restrictions due to confidentiality and privacy. Synthetic data (SD) are best replacements for real data but must be verifiably realistic. There is little or no investigation into systematically achieving realism in SD. This work investigates this problem, and contributes the ATEN framework, which incorporates three component approaches: (1) THOTH for synthetic data generation (SDG); (2) RA for characterising realism is SD, and (3) HORUS for validating realism in SD. The framework is found promising after its use in generating the realistic synthetic EHR (RS-EHR) for labour and birth. This framework is significant in guaranteeing realism in SDG projects. Future efforts focus on further validation of ATEN in a controlled multi-stream SDG process.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Workflow Characterization of a Big Data System Model for Healthcare Through Multiformalism

Synthetic data use: exploring use cases to optimise data utility

Article Open access 13 December 2021

A Multifaceted benchmarking of synthetic electronic health record generation models

Article Open access 09 December 2022

Keywords

1 Introduction

The McGaw-Hill dictionary of Scientific and Technical Terms describes Synthetic Data as any production data applicable to a given situation that are not obtained by direct measurement [1]. Prior to [2] the domain of statistics, especially population statistics, primarily viewed synthetic data to be larger datasets that result from merging two or more smaller datasets [3, 4]. The earliest direct reference to synthetic data is a 1971 article describing creation of tables of synthetic data for use in testing, modifying, and solving problems with marketing data [5]. Other works present methods for creating fully synthetic data based on observed statistics [6, 7]; predicting and testing observational outcomes [8]; and generation driven by probability models for use in simulations [9]; and forecasting [10]. The reasons for generating synthetic data include software testing [11,12,13,14], population synthesis [15], hypotheses testing or generation of seed data for simulations [16, 17]. Recently, the major reason for generating synthetic data is limiting the release of confidential or personally identifiable information inherent to the use of real data sources [13, 18,19,20]. Some synthetic data generation (SDG) approaches use real data either directly, or as seed data in their SDG methods [11, 21, 22]. Caution should be used prior to release of such synthetic datasets as a poorly designed or inappropriate model can still carry the risk of exposing confidential or personally identifiable information. Most contemporary research works have focused heavily on data anonymization, that is, isolating and replacing personally identifiable data with the concomitant goal of maintaining integrity of the data that an organisation may wish, or be required, to publish [23]. Anonymization has been dogged by modern methods for re-identification of anonymised data using a person’s linkages to publicly available personal information sources, such as the electoral roll and newspaper articles [24,25,26]. As a result, some SDG methods also risk suffering inverse methods and re-identification attacks that ultimately breach personal privacy.

It is not enough to generate random data and hope it will be suitable to the purpose for which it will be used [27]. The data values may be required to fall within a defined set of constraints. For example, the heart rate should be a numerical value that falls within healthy resting (60–100), exercising (100–160) or disease state (40–60 or 160+) ranges. Some projects require increasingly more complicated datasets where not only the values of single attributes must be valid, but all values and interrelationships must be indistinguishable from observed data [28, 29]. This is where the problem of realism becomes imperative, yet it remains unexplored in current SDG literature [30]. The common sense implication of the term realistic is as [31] succinctly puts it: synthetic data that becomes “sufficient to replace real data”. The property of realism brings a greater degree of accuracy, reliability, effectiveness, credibility and validity [22]. Most researchers recognise the need for realism [18, 22, 31], however many leave realism unexplored in their works with only two authors giving some attention to it [18, 19]. In both cases this was vague and limited only to hinting that the aim of realism was that the synthetic data should be a representative replacement for real data [19], and comparably correct in size and distribution [18]. Neither handled validation of realism in the synthetic data they created. The lack of research attention makes it difficult to imbue realism into SDG methods, and to verify success in doing so. Realism should only be asserted if it has been verified [32, 33]. Scientific endeavours should always be concerned with testing and verification, yet few published approaches present systematic ways for validation [34, 35]. We find many SDG methods that claim success in the absence of a systematic ways of scientific validation [12, 36,37,38]. Some form of validation is necessary to support claims for realism in resulting synthetic data [32, 38, 39]. Otherwise, reliability of the approach must be questioned [40]. This work addressed these challenges and hereby presents the ATEN framework that allows realism to be inherent in SDG methods while also incorporating validation of realism in the resulting synthetic data.

The rest of this chapter is organised as follows: First, a review of related works focusing on SDG methods and realism is presented. Second, the ATEN framework and its component approaches, namely, THOTH, RA and HORUS, are covered in detail. Third, the ATEN framework is evaluated by applying it to the case of generating the synthetic electronic healthcare record (EHR) for labour and births. Fourth and finally, the chapter is concluded and summarised.

2 Related Works

A literature search was conducted to identify works describing methods or approaches for synthetic data generation (n = 7,746). This collection was reduced to works that also used the terms realistic (n = 290) or realism (n = 6) in describing either the need or purpose for synthetic data, their method, or the resulting synthetic dataset. The resulting collection included works that identified realism as a primary concern in the generation of synthetic data generally [12, 22, 41], or that discussed developing synthetic data that would be sufficient to replace, or be representative of, real data [13, 19, 31, 42]. Due to the low number of works that identified realism as a factor in synthetic data, a random selection of excluded works was included. This review found that one third of SDG articles focused on common goals, namely, authenticity [11], accuracy with respect to real structures [21], and the replacement of real data [43]. A key observation is the conspicuous absence, in the literature, of an investigation of realism for synthetic data, along with the lack of rigorous explanation of the approaches used to produce what authors claim to have been realistic datasets. In the absence of a clear definition and framework for realism in the context of SDG, any process seeking to verify and validate realism in synthetic data is severely challenged.

Works in the literature present common narrative for describing their SDG problem justification, operational method, and claimed results. This narrative consists of a common sequence of themes, each presented with two components. The themes are presented in Table 1. For the justification theme, research challenges include limited available data [44, 45] and privacy protection [37, 43]. Uses include testing of learning algorithms [45], enabling release of data [43], and prediction [37]. The operation theme includes SDG inputs such as network structures [45], observational statistics [44], and configuration files [37]. Methods ranged from random selection [45] and change behaviour modelling [37], to stochastic simulation using Markov models [44]. The result theme covers actions such as the use of benchmark and performance test simulation [45], comparative graphs [44], and performance analysis [46] used to assess published SDG methods. Resemblance to real networks [45], model advantages and capabilities [44] and likeness of the synthetic data to the synthetic scenario [37] were all reasons claimed by authors for claiming their SDG method was promising or successful.

Table 1. The common SDG narrative.

Full size table

SDG approaches set the goal of simply producing synthetic data that is a suitable replacement for real data. The focus is heavily weighted toward the outcome, the synthetic data. Validation of realistic aspects of synthetic data tended to be absent or singular or simplistic, ranging from direct comparisons between either the entire dataset or fields within the synthetic data to observations drawn from the real data [22], or graphical and statistical comparisons between the two [21, 44, 47]. The majority did not discuss validation at all [36, 48, 49]. Disclosure of the validation approach in research work completes and improves understandability of their work. It would also allow researchers to adequately assess whether or not a project met its goal; and the success claimed is truly justified [50]. This characteristic ensures that SDG experiments can be independently verified to the same standard as other scientific endeavours.

3 ATEN: The Framework for Realistic Synthetic Data Generation

It is common to see methodologies with multiple separate, combined, or sequential components presented as a framework [51]. This section presents the ATEN framework shown in Fig. 1. The ATEN framework is a synthesis of three interdependent component approaches, THOTH, RA, and HORUS which, when used together infuse realism into synthetic data. Each component of the ATEN framework seeks to answer the related questions in Table 2. The sections that follow describe in detail each of the components of ATEN.

Table 2. ATEN component aims.

Full size table

3.1 THOTH: The Enhanced Generic Approach to SDG

A review of the way authors described data generation approaches yielded a generic four-step SDG approach, which incorporates the minimum common structural elements shared by all SDG methods. The approach is presented as a waterfall model, primarily due to its cumulative and sequential nature. Thus, the next phase is undertaken solely through completion of the previous [53]. Verification, a required step of any scientific endeavour but one rarely seen in the context of SDG, can only occur during limited opportunities at the end of each step of the approach [53] and after the SDG operation is complete. The following paragraphs present the four-step SDG approach.

1.
Identify the need for synthetic data: This step involves recognising both the need and justification, or reason, for creating synthetic data. The most commonly expressed justification across the contemporary literature was that the synthetic data being created was necessary to replace real data containing personally identifiable, sensitive or confidential information.
2.
Knowledge gathering: This step can involve a number of sub-steps assessing the requirements for the synthetic dataset being created. It usually begins with analysis of the data to be generated, identifying such things as necessary fields to be generated, the scope, and any constraints or rules to be imposed.
3.
Develop the method or algorithm: It is not unusual for researchers to identify common solutions that have become preferred for a given research method or field; a method or algorithm that has drawn significant focused attention or is considered more reliable to producing a particular outcome. Many of these algorithms have operational steps or processes requiring focused attention, or for which data must be properly prepared. Developing the generation solution is as important as the need, and the level of attention paid during this step has a direct relationship to the quality of the output.
4.
Generate the synthetic data: The process of generation involves presenting any seed data, conditional requirements, rules, and constraints to the generation algorithm that will perform the processes that output synthetic data.

This four-step approach represents a simple method, which are favoured due to its usefulness, reduced complexity, and experiment time; all of which reduces cost [54,55,56,57]. However, the approach suffers the waterfall model weakness; flowing unidirectionally, lacking flexibility, meaning any change in requirements or issues identified necessitate expensive and time consuming redevelopment and retesting [58]. For this reason, a more adaptable and agile approach to SDG development should be encouraged. Pre-planning and preparation may mitigate the weaknesses of the generic SDG waterfall model. This is where THOTH will assist. THOTH encourages the synthetic data creator to perform decisive steps prior to engaging in the generation process. THOTH begins with characterisation, that is, identifying the level of synthetic-ness desired in the data to be generated. The synthetic-ness required of generated data can range from anonymisation of personally identifiable components in real data, through to truly synthetic data relying on no personally identifiable information during the creation process. The five primary characterisation types are shown in Table 3.

Table 3. Characteristics of synthetic data.

Full size table

The characterisation level provides an element that aids in the second step, selection of the classification, or generation model, from the following five categories of synthetic generation methods: (i) data masking models that replace personally identifiable data fields with generated, constrained synthetic data [13, 43, 59], (ii) those that embed synthetic target data into recorded user data in a method known as Signal and Noise [11, 18, 60], (iii) Network Generation approaches that deliver relational or structured data [21, 41, 45], (iv) truly random data generation approaches like the Music Box Model [61], and (v) probability weighted random generation models like the Monte Carlo [12], Markov chain [61], and Walkers Alias methods [62].

When combined with the generic SDG approach discussed earlier, the resulting THOTH-enhanced generic approach is shown in Fig. 2. With these steps complete, the synthetic data creator engages the remaining steps from the generic SDG approach described previously. However, they are beginning with an additional level of wisdom that comes from knowing where they are going (the level of synthetic-ness required of their efforts) and the framework for how they are going to get there (the informed selection of a generation model).

Summary of THOTH:

We found a generic four-step waterfall approach is common to most SDG methods. This approach moves through identifying a need for synthetic data, gathering knowledge necessary to its generation, developing or customising an algorithm or generation method common to their domain or solution needs, before generating the synthetic data. Incorporation of THOTH benefits the researcher, providing greater awareness of their requirements and guiding the direction of the overall synthetic data generation approach.

3.2 RA: Characterising Realism for SDG

RA provides a structured approach to identifying and characterising realism elements, or knowledge, for use in SDG. The RA process, including the steps of enhanced knowledge discovery, are shown in Fig. 3 and described in Table 4. RA identifies extrinsic and intrinsic knowledge following a logical progression of steps, with increased focus on elements drawn from [64,65,66,67]. The following subsections present the processes used within the KDD data mining in Step 5 of Table 4.

Table 4. Enhanced KDD process following the RA approach [52].

Full size table

RA: Extrinsic Knowledge

Extrinsic knowledge is the sum of quantitative and qualitative properties found in the real data to be synthesised. To be a suitable replacement, the synthetic data will need to adhere to these properties.

Quantitative Characteristics:

Real or observed data may in itself be statistical, and therefore quantitative, such as patient demographic data shown later in Fig. 10. Even if it is not, it is often possible to identify quantitative knowledge, for example; consider generating a synthetic version of a spreadsheet of people who voted at a selection of polling booths, as the real data cannot be made public for privacy and confidentiality reasons. On the surface this may appear to be qualitative data however it would be possible to draw a number of statistical representations from it, such as: (a) how many people of each genealogical nationality voted in (b) each hour, (c) the percentage that were male, (d) the percentage of the overall population as found in census data voted in each polling booth, and so on.

Qualitative Characteristics:

The qualitative characteristics of real or observational data should be identified and documented for any SDG project, but especially for those projects seeking realistic synthetic data. One example of qualitative characteristics may be to identify and describe the database schema. The database schema explains how the data is structured [68]. In the relational database example this includes expression of the tables, the fields within those tables, constraints such as those identifying the primary key or limiting field values along with any referential integrity constraints, or foreign keys [68].

Summary: Extrinsic Knowledge:

These quantitative and qualitative observations of real data, once identified and documented, represent the characteristics that should be created and validated in synthetic data. This is especially true if authors present that there is a requirement for, or claim of, realism.

RA: Intrinsic Knowledge

Knowledge Discovery in Databases: While traditional methods of data mining often involved a manual process of scouring through databases in search of previously unknown and potentially useful information, these processes can be slow and an inefficient use of time [64, 66, 67]. Modern approaches, where the human is accentuated by machine learning or neural network algorithms are considered more expedient for realising insights from today’s extremely large datasets [64, 66, 67].

Concept Hierarchies:

Concept Hierarchies (CH) are a deduction of attribute-oriented quantitative rules drawn from large to very large datasets [69]. CH allow the researcher to infer general rules from a taxonomy, structured as general-to-specific hierarchical trees of relevant terms and phrases. For example: “bed in ward in hospital in health provider in health district” [67, 69, 70]. Developing a concept hierarchy involves organizing levels of concepts identified within the data into a structured taxonomy, reducing candidate rules to formulas with a particular vocabulary [69]. CH are used by RA to identify an entity type, the instances of that entity and how they relate to each other; they help to ensure identification of important relationships in the data that can be used to synthesise meaningful results [71].

Once the concept hierarchy tree is identified, a second pass across the source data is performed to provide an occurrence count for each concept. This second pass allows the researcher to enhance the concept hierarchy with statistical knowledge to improve accuracy of the generation model.

Formal Concept Analysis:

Formal Concept Analysis (FCA) is a method of representing information that allows the researcher to easily realise concepts observed recognised from instances of relationships between objects and attributes. For example: occurrences of different nosocomial infections across the wards of a hospital. FCA starts with a formal context represented as a triple, where an object {G} and attribute {M} are shown with their incidence or relationship {I} [72]. A table is created displaying instances where a relationship exists between the object and its corresponding attribute(s).

Concept creation, represented as rules, occurs from the context table. For example, one might seek to identify the smallest or largest concept structures containing one particular object.

The second step to FCA involves creating the concept lattice. A concept lattice is a mapping of the formal context, or intersections of objects and attributes. The concept lattice allows easy identification of sets of objects with common attributes as well as the order of specialisation of objects with respect to their attributes [73].

Characteristic and Classification Rules:

[69] provides a set of strategies that can be used to learn characteristic and classification rules from within a dataset. These rules can be applied as constraints during generation, and later as tools to compare against the resulting synthetic data to validate its accuracy and realism.

Characterisation Rules:

The development of characteristic rules entails three steps. First, data relevant to the learning process is collected. All non-primitive data should be mapped to the primitive data using the concept hierarchy trees as shown in Fig. 5 (e.g. Forceps would be mapped to Assisted, Elective would map to Caesarean and so on). Second, generalization should be performed on components to minimize the number of concepts and attributes to only those necessary for the rule we are working to create. In this way, the name attribute on a patient record would be considered too general and not characteristic to a set of data from which we could make rules about the treatment outcomes for a particular ethnicity. The final step transforms the resulting generalization into a logical formula that identifies rules within the data. These rules are the sum of four elements, where if the values of any three of those elements are found to be consistent to the rule for a given instance in the dataset, the fourth element will always be true.

Classification Rules:

Classification knowledge discovery discriminates the concepts of a target class from those of a contrasting class. This provides weightings for the occurrence of a set of attributes for the target class in the source dataset, and accounts for occurrences of attributes that apply to both the target and contrasting class. To develop a classification rule, first the classes to be contrasted, their attributes and relevant data must be identified. Attributes that overlap form part of the generalisation portion of the target class only. Attributes specific to a target class form the basis of classification rules.

RA: Summary

The RA enhanced and extended KDD method identifies realistic properties from real data, providing improved input data quality and constraints that improve the output of generation algorithms used to create synthetic data. An obvious benefit is that generation methods using this knowledge should deliver data that is an accurate replacement for real data. Another benefit is a set of knowledge and conditions that can be used in validation of realism in the data created. Its use for this last purpose is discussed in the next section.

3.3 HORUS: An Approach to Validating Realism

One of ancient Egypt’s earliest precursor national gods, Horus, was revered as the god of the sky; that which contains both the sun and the moon. In the same way, the Horus approach to realism validation draws on both THOTH’s enhanced generic SDG and RA’s enhanced KDD approaches, effectively containing both the sun and moon as a means to validate for realism in synthetic data.

The validation approach incorporates five steps that analyse separate elements of the SDG method and resulting synthetic data. These steps are identified as the smaller square boxes in Fig. 4, with their descriptions below. Collectively, the five steps provide the information necessary for confirmation of whether synthetic data is consistent with and compares realistically to real data that the SDG model seeks to emulate.

Input Validation:

Input validation concerns itself only with that knowledge coming from the generation specification in the form of data tables and statistics. The input validation process verifies each item, confirming that the right input data in the correct form is being presented to the generation engine, thus ensuring smooth operation of the data synthesis process [74]. Input validation is intended to prevent corruption of the SDG process [75].

Realism Validation 1:

The first realism validation process verifies concepts and rules derived from the HCI-KDD process, along with any statistical knowledge that has been applied. Realism validation reviews and tests both the premise and accuracy of each rule to ensure consistency with the semantics of any data or guidelines used in their creation.

Method Validation:

Method validation reviews the efforts of others inside and outside of the research domain. Attention is drawn to methodological approaches common for that domain, as well as methods other domains have employed for similar types of SDG. Evaluating the entire scope of method application ensures that which is chosen should be the most appropriate for the particular need and solution. Method validation also seeks to verify that the algorithm being used is correctly and completely constructed, and free of obvious defect [76].

Validation is not a search for absolute truth, more correctly, and in this instance, it is a search to establish legitimacy [76]. Table 5 provides the six key questions that should be asked of any SDG methodology the researcher may propose to use.

Table 5. Method validation questions [77].

Full size table

Output Validation:

Output validation evaluates the output data and verifies its basic statistical content. This step demonstrates the difference between the two terms: validation and verification. Validation ensures the model is free from known or detectable flaws and is internally consistent [76]. Verification establishes whether the output, or predictions, of the SDG model are consistent with observational data. The output validation step ensures that the synthetically generated data conforms to the quantitative and qualitative aspects derived during the knowledge discovery phase.

Realism Validation 2:

The second realism validation process undertakes the same tests as the first, except that tests are now performed against the synthetic dataset. This test aims to ensure synthetic data is consistent with the knowledge (rules, constraints and concepts) previously derived from the input data and used in creation of the synthetic data. The second realism validation step is the most important for establishing, and justifying, any claim that this synthetic data presents as a realistic and proper substitute for the real data it was created to replace.

3.4 Summary: Benefits of the ATEN Framework

There are a number of ways that ATEN benefits those engaging in SDG. First, it is a complete SDG lifecycle that considers every element before, during and after data generation. Second, it encourages more complete level of self-documentation than most presented in the SDG literature. The third benefit is cumulative from the first two, in that when applied during an SDG project, THOTH and RA provide the necessary knowledge to validate realism using HORUS. ATEN supports claims of success, realism, and enables repeatability. All of which are fundamental to the scientific method. Works found in the literature do not conform to the ATEN Framework, as significant gaps are evident in most SDG literature. The framework provides for additional knowledge discovery and documentation processes, which could be automated. However, this is dependent on the type of data being analysed, generation method, synthetic data sought, and the use to which that data will be applied. The knowledge discovery component leads to greater accuracy and help to support validation of realism.

4 Evaluating the ATEN Framework: The Labour and Birth EHR

This section evaluates ATEN by applying it to the domain of midwifery. While ATEN is intended to be generally applicable for use with any defined group of patients and chosen health problem or disease that has a Caremap, for the purposes of evaluating the ATEN framework, this work now focuses on the problem of generating the RS-EHR for only the delivery episodes for female patients who are giving birth in the Counties Manukau District Health Board (CMDHB) catchment area of Auckland in New Zealand. The practical advantages, to the authors, of focusing on delivery episodes for the purpose of this evaluation only are that: (1) deliveries take relatively short periods of time; (2) comprehensive statistics are readily available that cover a long period of time; (3) clinical guidelines as well as locally specified midwifery practice protocols derived from localisation of international clinical practice guidelines are widely available; (4) the delivery episode can range from being very simple to being very complex with a wide variety of complicating factors that include the health of the mother and that of the baby; and, (5) the authors had ready access to midwives on a regular basis throughout this research work. The rest of this section presents the prototype system, results of evaluation, and discussion of these results.

The labour and birth EHR contains a record of the labour and birth events starting at onset of labour and ending when delivery is complete and the new child is presented to her parents. To generate the labour and birth EHR in such a way that realism is achieved we apply the ATEN framework’s components: THOTH, RA and HORUS. The next sections present this application, which leads to the synthetic labour and birth EHR that has the realistic properties that are guaranteed by the ATEN Framework.

THOTH is a combination of the generic method for SDG, combine with the pre-planning elements that characterise and classify the synthetic data being sought, in this case, the synthetic labour and birth EHR. Table 6 summarises the application of THOTH to the labour and birth scenario leading to the ingredients, method and context for the generation of the synthetic labour and birth EHR. In the context of the labour and birth EHR, the characterisation (truly synthetic data) was selected to meet with the ideal that we do not rely on access to the real EHR in the context of our generation approach. This ensures the highest degree of patient privacy as, unlike most other methods in this domain, no real patient records are necessary to this generation approach.

Table 6. Application of THOTH in the context of midwifery EHR generation.

Full size table

Analysis of SDG literature demonstrated that a probability weighted random generation approach was more likely to generate the synthetic records required. Also, other methods including the data masking and the signal and noise models required access to some amount of real (seed) EHR data, which discounted their use in this example.

RA is the knowledge discovery and characterisation approach seeking to identify realistic elements of the data gathered during THOTH. Application of RA specifically to the Labour and Birth problem required identification of the care process (Caremap) for labour and birth, as well as its concepts and contexts.

Extrinsic Knowledge

Quantitative Properties: The quantitative properties in the domain of midwifery included a range of demographic statistics regarding the mother and baby. Essentially they were not as simple as looking at the examples in blue contained in Fig. 10, presented later in this section, and saying that 22% of mothers were European, or that 24% of mothers were aged between 20–24 years. There were interrelationships between these values that also needed to be modelled, including that of the 24% of mothers between 20 and 24 years of age, only 8% were identified as European. Other statistics included how many mothers delivered naturally versus by caesarean section, and the spread of clinical interventions across ethnicity, age, and gestation.

Qualitative Properties:

A range of qualitative properties were assessed within the knowledge gathered for generating midwifery EHRs. These included the structure of the source data being used, as well as the structure and appearance of how the synthetic data should be presented on generation. A truncated example of how demographic data was structured in one midwifery EHR system is shown in Table 7. Other qualitative aspects might include: (a) logical internal consistence between the dates reported in different fields (last menstrual period, estimated due date, and so on), (b) whether fields have been misappropriated as placeholders for other data types, and (c) the completeness of fields within the dataset.

Table 7. Application of THOTH in the context of midwifery EHR generation [52].

Full size table

Intrinsic Knowledge

Concept Hierarchy for Labour and Birth Domain: An extract focusing on child birth from the concept hierarchy (CH) developed for the labour and birth domain is presented in Fig. 5. The general term Childbirth breaks down into the two modes by which birth occurs, Caesarean and Vaginal. As an example; Caesarean births break down even further into the two specific types that occur, the elective or requested/planned caesarean and the emergency caesarean. In this way we are moving from the most general concept at the top to the most specific at the bottom. This is extended with the addition of quantitative statistics (in brackets) identified from the Ministry of Health (New Zealand) source data.

The CH provides structural understanding of primary or significant concepts, from most general to most specific, within the domain being modelled. In RA, is also used to provide statistical understanding of the incidence of each concept. The CH provides constraints, or weights, that are applied during the generation phase, as well as forming one component used to verify statistical accuracy, and in turn realism, within the resulting synthetic data.

Constraining Rules:

Characteristic Rule: Fetal heart monitoring is used in midwifery to assess the health, and stress being suffered, by the baby. In the domain of midwifery, we found that only those pregnancies clinically described as low risk receive intermittent fetal heart monitoring. However, clinical practice guidelines (CPGs) necessitate continuous monitoring for a higher risk pregnancy. Properties of this rule would be expressed as the sum of the four elements. The characteristic rule expressed in the conditional formula is shown in Fig. 6 containing the values: Sex: Female, Pregnant: Yes; Pregnancy Status: Low Risk; Fetal Heart Monitoring: Intermittent in Labour. This rule was validated against, and found to be consistent with, the CPGs for several hospital birthing facilities in New Zealand.

Classification Rule:

The CPGs for Labour and Birth provide that where an expectant mother has had a previous caesarean birth, she may elect in this subsequent birth to still (safely) attempt a vaginal birth (known in medical terms as a VBAC - vaginal birth after caesarean). However, where she has had two or more previous caesarean births the obstetric team will counsel her to only have a caesarean birth due to considerations of risk and safety for both mother and baby that result from the previous caesarean scars and potential stress on the uterus. Figure 7 provides an example of a classification rule showing that 100% of patients undergo a caesarean procedure for the current birth if two or more of their previous births have also been by caesarean section. This rule was successfully validated against the MoH Labour and Birth statistics, with the finding that it was true in operation across all births that occurred in New Zealand for that year.

Characterisation rules describe reduced collections of generalised attributes for a class occurring together in the dataset; where for any query of the dataset specifying n-1 attributes from the rule, the remaining attribute is the only one that can be true.

Classification rules describe specific collections of attributes that differentiate one class from one or more remaining classes; where the target class is the only response for a query against the dataset specifying all of the attributes defined in the rule. These rules are used to constrain generation, ensuring consistency between real-world and the synthetic. They are used during validation to identify instances where synthetic records may be inconsistent, for example, if the midwifery patient being generated was male.

Formalisation of Labour and Birth CPG into Labour and Birth Caremaps

The core set of constraints in the CoMSER Method [62] are CPGs, Health Incidence Statistics (HIS), patient demographic statistics and the Caremap, all formalised in an integrated way into the state transition machine (STM) following the process shown in Fig. 8 [51]. The STM is the constraint enforcement formalism for generating the RS-EHR entries satisfying the constraints.

Figure 9 presents the UML State Diagram (USD) for the State Transition Machine (STM) that integrates the core constraints for generating the RS-EHR for delivery episodes within the Counties Manukau District Health Board (CMDHB) of Auckland, New Zealand (NZ).

The transition from one state to the next is determined by the pseudo-random selection of one state in the STM in which is stored the health incident prevalence constraint that is formally specified as the 2-tuple, <P, O>, such that P is the total number of patients who are known to enter the state according to statistics within the CMDHB catchment area, and O is the number of patients expressed as a percentage of the immediately preceding parent state. The caremap formalised by the STM in Fig. 9, covers the midwifery delivery event, which is also referred to, in this work, as the delivery episode. The caremap begins temporally at the point where the pregnant patient is established as ‘in labour’. It follows the sequence of possible states, that is, clinical events or decisions or both, consistent with the locations, interventions and outcomes that are currently available to the patient or her treating clinicians until the birth process concludes in one of the possible outcomes. Thus, the Caremap and hence its STM form the basis of the integrated constraint framework and also the basis for the algorithm for the RS-EHR generation.

In validating the midwifery RS-EHR, HORUS was applied, adhering to the steps as presented in Fig. 4. The following subsections describe the results observed.