Introduction

No serious differences in clinical outcomes such as infant mortality and morbidity have been reported among low-risk pregnant women giving birth at home, in a midwifery unit, or in an obstetrics unit1,2,3,4,5. However, childbirth care practices for women with low-risk pregnancy vary according to birth setting, medical professional, and organizational system. Women with low risk who are planning a birth at home or in a midwifery unit are more likely to have a vaginal birth and to receive less unnecessary medical intervention than women with planned births in an obstetrics unit6. In addition, women receiving midwife-led continuous care by the same midwife or team of midwives from pregnancy until the early parenting period report greater satisfaction7. In all cases, it is critical to refrain from unnecessary interventions, such as caesarean sections and episiotomies8,9,10.

To improve quality of care, quality indicators have been widely used in many clinical fields. A quality indicator is defined as “a measurable element of practice performance for which there is evidence or consensus that it can be used to assess the quality, and hence change in the quality, of care provided”11,12. Quality indicators for maternal and perinatal hospital care have been developed mainly for high-risk pregnancy using the consensus method13,14,15,16.

In Japan, 98% of women give birth in hospitals17, where midwife-led continuous care for low-risk woman is monitored by obstetricians. Among midwives, 87% of midwives works at hospital and clinics18. Midwives in Japan are not legally allowed to perform interventions such as episiotomy, epidural anaesthesia, oxytocin infusion, and instrumental delivery. If necessary, obstetricians from the same hospital provide emergency care. Additionally, care for low-risk pregnancy and childbirth is not covered by insurance in Japan; thus, there are no healthcare claims issued for these types of care. Clinical practices that are covered by the national insurance system can be administratively monitored using claims data; however, data for these low-risk pregnancies are neither publicly accumulated nor evaluated. Types of care that are not included in a claims database have not been adequately investigated with respect to quality improvement. To improve this situation and make such care more accessible, we focused on the importance of clinical data that are available from medical records, as the best method for quality improvement in each medical facility. Under this background, to assess the quality of childbirth care provided for women with low-risk pregnancy who give birth in a hospital, we developed and updated care quality indicators using existing clinical practice guidelines and quality indicators19,20. We aimed to demonstrate the applicability of care quality indicators for planned hospital births among women with low-risk pregnancies in Japan.

Methods

Study design

This was a retrospective study of medical records.

Study setting and participants

The study was conducted in one urban and one suburban hospital in Japan. Both hospitals have a perinatal medical centre. A perinatal medical centre is a key facility that provides perinatal and postnatal care to the surrounding area. The facilities contained units and teams that could treat serious illness in an emergency. One hospital was affiliated with a university; the other was a private general hospital. As both hospitals have a midwifery unit and an obstetric unit in the same ward, low-risk pregnant women can select midwife-led continuous care or obstetrician-led care from pregnancy to afterbirth. Low-risk pregnancy has no widely accepted definition. In our previous articles, we have defined low-risk pregnancy as “a pregnant woman with no particular high-risk factors or complications”19,20. For women who plan to give birth in a midwifery unit and have had at least three prenatal check-ups during each trimester, obstetricians diagnose abnormalities in the woman or the infant. If necessary, emergency care is provided by obstetricians in the same hospital. Low-risk pregnancy in this setting is also defined as a pregnant woman with no particular high-risk factors or complications20.

Using a retrospective medical records review, we collected data on women admitted for delivery in the participating hospitals during the study period of April 1, 2015 to March 31, 2016. The inclusion criteria were as follows: determined to have a low-risk pregnancy during the second trimester, and selection of planned hospital birth and midwife-led continuous care from pregnancy until the early parenting period in a midwifery unit. The exclusion criteria were as follows: aspects or complications of high-risk pregnancy such as multiple pregnancy and premature birth < 37 weeks’ gestation, elective caesarean section before the onset of labour, no antenatal care, or declined to participate in this study. This study used only past medical record. According to the current Ethical Guidelines for Medical and Health Research Involving Human Subjects in Japan and the Declarations of Helsinki, we made our study to be open by posting information in the participating hospitals. This study, with the procedure of waiving individual consent, was approved by the Ethics Committee of Kyoto University Graduate School Faculty of Medicine (No. R0442), the Ethics Committee of Morinomiya University of Medical Sciences (No. 2015-29), and the Ethics Committee of Nara Medical University (No. 1269).

Outcome and evaluation

Quality indicator scores

The quality indicators used in this study were developed by a multidisciplinary team of healthcare professionals and lay mothers, using the RAND/UCLA appropriateness method in 201219. These indicators are focused on process and outcome indicators. Based on new or updated clinical practice guidelines, the quality indicators were updated using modified Delphi methods in 2016, resulting in 35 quality indicators20. The care quality indicators for women with low-risk pregnancies who planned to give birth in a hospital are listed in Table 1.

Table 1 List of original 35 care quality indicators.

We calculated individual indicator scores using a dichotomous variable with values of 0 or 1 for each participant and each indicator. We calculated the percentage of adherence for each indicator as following equation:

$$ \frac{Number\; of\;participants \;eligible \;for\; indicator\; and\; receiving\; recommended\; or \;non{\text{-}}recommended\; care}{{Number\; of\; participants\; eligible\; for\; indicator \;excluding \;those\; with \;an\; obvious \;reason\; not \;to\; undergo\; the \;process, as \;defined \;by \;the\; indicator}}\,(\% ) $$

We analysed the data for clinical assessment of each indicator at the participant level.

Evaluation criteria for applicability

We conducted a practical test of multifaceted applicability using the three criteria of feasibility, improvement potential, and reliability21,22,23. (1) Feasibility signifies the extent to which the required data are easily available or can be collected without burdening staff. (2) Improvement potential is the sensitivity to detect when medical performance has changed, to discriminate among and within subjects. (3) Reliability relates to how well the measure is defined and how precisely it is specified so that it can be consistently implemented by the same or different data collectors. To assess the reliability of quality indicators in this study, the inter- and intra-rater reproducibility was examined.

  1. (1)

    Feasibility

    An indicator was considered “unfeasible” if > 25% of participants (denominator) for an indicator score could not be included because of missing data24.

  2. (2)

    Improvement potential

    An indicator was considered “low opportunity for quality improvement (or low sensitivity to change)” if the indicator score percentage was ≥ 90%24,25.

  3. (3)

    Reliability

    To assess the intra-rater and inter-rater reliability, we randomly sampled the medical records of 20 mothers from each of the two hospitals (n = 40). A researcher (KU) explained the procedure to the two raters (MT, NN). After completing several training sessions, they independently measured the quality indicators twice a month. The intra-rater and inter-rater reliability was evaluated by two research assistants (MT, NN) by measuring data from the records of selected 20 mothers from each hospital. In parallel, two midwives working at each hospital evaluated 10 records. We primarily used the kappa coefficient and secondarily used agreement score (positive and negative agreement score26) (Supplementary information 1). The kappa coefficient criteria were as follows: < 0.40, poor; 0.40 ≤ κ ≤ 0.60, moderate; 0.60 < κ ≤ 0.80, good; and > 0.80, very good)27. We also determined the percentage of positive and negative agreement for each indicator. The median, minimum, and maximum score for both agreements in terms of intra-rater and inter-rater reliability were also calculated.

Data sources and measurement

We retrospectively identified eligible mothers from the clinical records using medical safety and management reports. One researcher (KU) and seven midwives (four of whom worked at the participating hospitals and three who were research assistants) collected the data. The midwives had more than 3 years’ work experience and had received training in data collection. They manually collected indicator-relevant data for women and infants from the records and entered them into an electronic data capture system (REDCap)28. We used the data to evaluate the performance of planned hospital birth care for women with low-risk pregnancy. The research assistants measured intra-rater reliability. Inter-rater reliability data for research assistants and midwives working in the participating hospitals were collected more than 1 month after the initial measurement.

Sample size

We assumed an indicator adherence of 50% (the largest number of medical records or participants needed for adherence) with a confidence level of 95% and a precision estimate of 7.5% and included 167 participants. Multiple facilities were set up to obtain 167 participants per hospital29. We randomly selected sample records to assess the reliability of over 10% of the total participant records24.

Statistical analysis

We defined missing data as data not recorded in the clinical records. We performed statistical analysis using JMP® Pro, version 14.0 (SAS Institute, Cary, North Carolina, USA).

Results

Participants

Of 388 eligible participants, we analysed data for 347 mothers. A flow chart showing participant selection is shown in Fig. 1.

Figure 1
figure 1

Flow chart for selecting participants.

Table 2 shows the characteristics of participating women and infants. The median age for women was 31 years and the median gestational age was 39 weeks. There were 201 multiparous women (58%) and no foetal or neonatal deaths.

Table 2 Characteristics of the participating mothers and infants (n = 347).

Quality indicator scores

The scores for each quality indicator are shown in Table 3. The range of adherence to all indicators was 0–95.7%. Of 24 applicable indicators, the highest score (79.5%, 276/347) was found for no. 9 (vaginal delivery). No. 26 (staff peer review of severe adverse events), no. 34 (screening for antenatal or postnatal depression), and no. 35 (having complete medical records based on all quality indicators) had the lowest scores (0%). The mean score for all indicators was 32.6%.

Table 3 Scores for the 35 quality indicators: feasibility and improvement potential.

Feasibility

There were six indicators with feasibility concerns: no. 14 (neonatal respiratory support); no. 15 (necessary resuscitation in the first minutes after birth); no. 26 (staff peer review about severe adverse events); no. 28 (having a review of the childbirth experience and support from midwives); no. 30 (mother smokes or receives passive smoking cessation counselling); and no. 31 (administration of vitamin K three times up to 1 month after birth). At the time of data extraction, > 25% of the participants had missing data for these indicators.

Improvement potential

Two indicators showed a low opportunity for improvement (indicator score ≥ 90%): no. 3 (receiving antibiotic prophylaxis during childbirth if the mother had a group B streptococcal infection) and no. 4 (initial assessment of labour risk on admission).

Reliability

Table 4 shows the reliability for each quality indicator.

Table 4 Results for intra-rater and inter-rater reliability.

Indicators with poor kappa scores (< 0.4) for intra-rater reliability were no. 17 (the most comfortable position during second-stage labour), no. 31, and no. 33 (mother or infant readmitted within 30 days of discharge). Intra-rater reliability kappa scores that were incalculable or 0 were found for ten indicators: no. 3, no. 4, and no. 6 (assessment during second-stage labour), no. 12 (Apgar score less than 7 at 5 min after birth), no. 14, no. 15, no. 26, no. 27 (a fall during hospitalization), no. 34 (screening for antenatal or postnatal depression), and no. 35 (complete medical records based on all quality indicators). Indicators with a poor kappa score (< 0.40) for inter-rater reliability were no. 4 and no. 5 (assessment during first-stage labour). Inter-rater reliability kappa scores that were incalculable or 0 were found for ten indicators: nos. 6, 12, 14, 20, 26, 27, 28, 30, 31, 34, and 35.

The median (range) score for positive agreement intra-rater reliability was 0.95 (0.33–1.00) and for negative agreement intra-rater reliability was 0.99 (0.67–1.00). The lowest positive score (0) was found for the following indicators: no. 4, no. 6, no. 14, no. 15, and no. 33; the second-lowest positive score was 0.33 for no. 31. The lowest negative agreement score (0.33) was for no. 17. The median (range) score for positive agreement inter-rater reliability was 0.91 (0.25–1.00) and for negative agreement inter-rater reliability was 0.98 (0.57–1.00). The lowest positive score (0) was found for no. 6, no. 14, and no. 31. The second-lowest negative score (0.25) was for no. 4. The lowest negative agreement score (0.57) was for no. 2 (birth plan) and no. 17.

Three indicators (no. 17, no 31, and no. 31) with poor intra-rater kappa scores showed positive/negative agreement scores of 0.94/0.33, 0.33/0.95, and 0/0.97, respectively. Two indicators (no. 4 and no 5) with poor inter-rater kappa scores showed positive/negative agreement scores of 0.25/0.92 and 0.68/0.61, respectively.

Discussion

By extracting the necessary information from 347 existing medical records for mothers and children before assessing quality, we assessed the multifaceted applicability of 35 care quality indicators for planned hospital birth among woman with low-risk pregnancy. The feasibility of 29 indicators was high and 33 indicators showed a high potential for improvement. Although some indicators showed low kappa scores, the high agreement scores indicated that the reliability of these indicators was acceptable. With some caveats, the present practice test supported the applicability of these quality indicators, which were previously developed in Japan.

This is the first study to show the applicability of care quality indicators for planned hospital birth for women with low-risk pregnancy. However, the applicability of these quality indicators for real-world practice needs fully testing before they are disseminated. No studies have tested the applicability of care quality indicators for birth in low-risk women using the consensus method30,31,32. Previous studies that have tested the applicability of quality indicators in general have not fully shared an unified terminology24,25,33,34. In the present study, we tested quality indicator applicability in terms of feasibility, potential for improvement, and reliability.

We found that most indicators were feasible as 29 indicators with feasibility had less than 25% of missing data for an indicator score. However, there was concern about the feasibility of the following six indicators owing to the high proportion (> 25%) of missing data: no. 14, no. 15, no. 26, no. 28, no. 30, and no. 31. These indicators showed low feasibility because no data were recorded in the medical charts. If data are prospectively collected with a defined format, there is a lower likelihood of missing or ambiguous data35,36. The present practice test revealed that two indicators (no. 3 and no. 4) had a low improvement potential (score of over 90%). This was because these two indicators were practiced almost routinely. We consider this a “ceiling effect”: a phenomenon in which the scores for the quality indicator are near their maximum value and thus impossible to substantially increase. It is also very difficult to evaluate quality improvement or detect differences between the measured scores for such indicators over time and in other hospitals. The present results were based on only two hospitals that actively cooperated with the practice test and may have been conscious about care quality. At present, we cannot be certain that these two indicators showed a ceiling effect and that their measurement was invalid. Accordingly, adherence to the indicators should be examined in a large range of settings to determine whether they should be retained or rejected.

Some of intra-rater and inter-rater reliability showed paradoxical results that low kappa score with high level of agreement26,37. Cohen’ kappa is generally used as a method of reproducibility evaluation. The aim of this study is to assess the reliability of quality indicators, that is the inter- and intra-rater reproducibility. Cohen’ kappa is generally used as a method of reproducibility evaluation. However, when the distribution of responses is biased, there are paradoxical cases that kappa score shows low even if the actual proportion of agreement is high, namely. Therefore, we used kappa as primary measure of reliability (the inter- and intra-rater reproducibility), and secondarily used the positive/negative agreement proposed by de Vet et al., considering the possibility of such paradox. Based on both scores of kappa and agreement, quality indicators with low kappa score do not always mean low reliability (reproducibility).

Quality indicator scores were 0 or incalculable kappa scores were close to 0 or 100. All indicators with low kappa scores had positive or negative agreement scores > 0.5 in this study. A low kappa score did not necessarily reflect low reliability for a quality indicator. Therefore, we used positive and negative agreement scores together and assessed reliability considering both scores of them. Agreement scores of 0 were found for no. 4, no. 6, no. 14, no. 15, no. 31, and no. 33, and reflect the very small or large number of participants to which those indicators related. Excluding indicators with an agreement score of 0, indicators with the lowest intra-rater reliability agreement score were no. 31 (positive) and no. 17 (negative). The lowest agreement score for inter-rater reliability was for no. 4 (positive) and for no. 2 and no. 17 (both negative), which may reflect the difficulty of identifying relevant data from the medical records. If clinical staff were given advance notification of surveys of quality indicators and prospective data collection, this may increase adherence to the indicators and onsite data recording, which would improve indicator reliability. An additional reason for low reliability was the composite nature of the indicators (e.g., no. 4). Some indicators comprise two or more individual component measures23,38, and so may be characterized by a greater risk of disagreement. However, such composite indicators are meaningful only when all individual components are satisfied; individual itemization would reduce their significance. Therefore, the indicators need to be used as they are, with full knowledge of the risk of low reliability for retrospective record reviews.

We acknowledge several limitations. First, the practice test was conducted in only two hospitals with perinatal medical centres. Both hospitals had enough medical facilities and staff to provide onsite advanced obstetric care for high-risk problems and also midwife-led continuity care. The high level of care in the two participating hospitals may have affected the present findings; data from lower-level hospitals might show more missing data, resulting in lower feasibility according to the criteria. Indicators measured in lower-level hospitals may not show the ≥ 90% or higher adherence found in the present study, and may show greater potential for improvement. Additionally, reliability would be lower for low-quality medical records. Second, the applicability that we examined was limited to feasibility, improvement potential, and reliability. We did not test acceptability and predictive validity (i.e., whether indicators are related to clinical outcomes). However, our multidisciplinary panel evaluated and confirmed validity and acceptability during the development process19,20. Adverse maternal or perinatal outcomes for women with low-risk pregnancy are rare, so predictive validity is difficult to establish. Third, although the set of indicators was systematically developed based on existing international practice guidelines and quality indicators, it was only tested in Japan and may not be directly applicable to other countries in its present form. The process used in this study may be useful to test applicability in other settings39.

To conclude, the present study showed that the 35 quality indicators for low-risk women planning hospital birth could, with some caveats, be applicable to real-world clinical practice.