Main Text

Introduction

Cardiotocography (CTG) is currently the main method of fetal monitoring in labour. Introduced in the 1960s to detect fetal heart rate (FHR) patterns thought to indicate hypoxia, its use increased rapidly, before evidence established either efficacy or safety. Various methods of classifying FHR abnormalities have been described, but none has shown adequate levels of both sensitivity and specificity. In the 1990s, the shortcomings of CTG were highlighted by a study showing its limited power to accurately predict cerebral palsy (CP)1 However, since only around 15% of cases of CP are attributable to intrapartum hypoxia-ischemia and CP is only one of the possible outcomes of .birth asphyxia, differential rates of CP are not an ideal metric for the assessment of FHR monitoring.2 Some specific abnormalities e.g. absent or minimal baseline variability, and late or prolonged decelerations can achieve higher power for the prediction of fetal asphyxia. 3, 4 Nevertheless, the relative rarity of important adverse outcomes does hinder the positive predictive value of even the most specific FHR abnormalities. Metanalysis has shown that the use of continuous CTG results in increased rates of caesarean section, marginal reductions in the incidence of neonatal seizures and no improvement in other neonatal outcomes.5 For fetal monitoring to help improve clinical outcomes, adverse outcomes must be predicted with acceptable accuracy before hypoxia results in neuronal injury.
Most studies of the diagnostic accuracy of intrapartum CTG monitoring have focused on relatively common outcomes such as low Apgar scores or umbilical cord blood acidosis. There is a lack of studies which have tested the power of specific FHR patterns to predict neonatal encephalopathy. While rare, neonatal encephalopathy is an important clinical outcome. Therapeutic hypothermia has decreased rates of mortality, cerebral palsy, and intellectual impairment in childhood.6, 7 However, the risk of complications remains significant especially among those most severely affected. Intellectual impairment is now recognised as a possible complication even in those babies who have been spared other sequelae. Pappas et al. found an IQ score < 70 in 96% of survivors with cerebral palsy (CP) and 9% of those without CP, and IQ scores <84 in 52% of participants treated with hypothermia.8 As a group, babies with mild encephalopathy have lower cognitive tests scores at 5 years than healthy babies.9

Design and Methodology

Aim

To determine the accuracy of intrapartum fetal heart rate abnormalities as defined by NICE guidelines for the prediction of moderate to severe hypoxic-ischemic neonatal encephalopathy.

Study Population

Subjects were identified from a case-control database used to identify risk factors for neonatal encephalopathy. Eligible subjects were born in the Rotunda hospital between September 2006 and November 2017 at ≥35+0 weeks’ gestational age and had no major congenital anomalies. Cases were diagnosed with antenatally-acquired moderate or severe hypoxic ischemic neonatal encephalopathy by the attending consultant neonatologist. In all cases, the timing of the injury was thought to be intrapartum. Controls were the first eligible babies born before and after each case who was not admitted to the neonatal unit and had Apgar scores ≥5 at 1-minute and ≥7 at 5-minutes. For this study, those women in the database who were admitted to the delivery suite in labour and had electronic fetal heart rate monitoring were included.

Data Handling

Maternal and Neonatal Clinical Details
Maternal and neonatal clinical details were collected from documentation made by the clinical teams and available in the patient records.
Fetal Heart Rate Pattern Analysis
Cardiotocograph (CTG) traces were exported from the hospital’s Athena archive (K2 Medical Systems Ltd, Plymouth, United Kingdom), stripped of any identifiers besides the date and time. The hospital’s guidelines state that CTG monitoring should be recommended to all mothers who have risks identified during the antenatal or intrapartum period. The electronic fetal monitors used in the delivery suite were GE Corometric 170 series models. Fetal heart rate pattern features were marked according to NICE-UK criteria and definitions.10 The traces were marked by Dr. Adam Reynolds, blind to all clinical data, in order of a randomly assigned database number and using a unique user interface developed within Matlab (MathWorks, Natick, Massachusetts, USA). With this interface, each recording was displayed in consecutive 15-minute segments with an additional 7.5-minute window visible on either side. For each segment, the following fetal heart rate features were manually marked: interpretable (yes or no), baseline, variability, each acceleration, each early, variable, deep variable, prolonged variable, late, and prolonged deceleration, as well as the presence of any bradycardia or sinusoidal pattern. Based on the NICE guideline criteria the baseline, variability, and deceleration pattern for each 15-minute segment were automatically classed as reassuring, non-reassuring or abnormal and based on this assessment each segment was then categorised as normal, suspicious or pathological.

Statistical Analysis

All calculations were performed in in SPSS v26.0 (IBM Corp., Armonk, New York, U.S.) unless otherwise stated. Categorical and continuous variables were compared by Fisher’s exact test or Mann-Whitney U-test, respectively. For each variable and model, the area under the receiver operating characteristic curve (AUROCC) and its asymptotic 95% confidence interval were calculated. From the ROC analysis, the point with the maximum Youden index was selected as the split point.11 AUROCCs were compared using R (R Foundation for Statistical Computing, Vienna, Austria) and the pROC package.12 Multivariate logistic regression was used to estimate odds ratios (ORs) and 95% confidence intervals (CIs) for moderate or severe encephalopathy.

Results

Description of Cohort

The total number of live births over the study period was 99,046. Eighty-eight cases and one hundred and seventy-six controls were included in the database. Seventy-one (81%) cases and one hundred and forty-six (83%) controls were in labour and admitted to the delivery suite (Chi-squared p=0.649). Of that group, 52 (73%) cases and 121 (83%) controls had intrapartum electronic fetal heart rate monitoring (Chi-squared p=0.098). Three controls were then excluded because the duration of monitoring was less than 15 minutes. Selected demographic and obstetrical characteristics are shown in Table 1.
Thirty-eight and fourteen cases had moderate and severe neonatal encephalopathy, respectively. In the cases, the median arterial pH was 7.16 (n=45, IQR: 7.02-7.21). The median 5-minute Apgar score was 6 (n=52, IQR=2-7) in the cases and 10 (n=117, IQR=10-10) in the controls. Forty-four (85%) of the cases had a 5- or 10-minute Apgar score ≤5, or an umbilical cord or early postnatal blood sample which showed a pH <7 or a base excess <-12. When the pH and base excess thresholds were adjusted to 7.1 and <-8 respectively, fifty-one (98%) of the cases met the criteria. The remaining single subject had normal cord gases but a history of a significant sentinel event, was intubated for apnoea and had an initially severely abnormal amplitude-integrated electroencephalogram (aEEG).

Fetal Heart Rate Analysis

Univariate Analysis
The main results of the univariate analysis of individual FHR features are presented in Table 2. The largest number of consecutive segments with the baseline FHR above a threshold was statistically significant for >160, >150, and >140bpm. However, following correction for the total number of segments, >160bpm was the only remaining statistically significant predictor.
Decelerations were analysed both in terms of total number and in terms of rate. (Supplemental Material Table 1). The total number of variable decelerations showed a statistically significant difference between cases and controls, but the frequency of variable decelerations did not (p=0.076). There were no significant differences between the deceleration rate AURROC and the deceleration number AURROC for any other type of deceleration. However, logistic regression models which incorporated the rate of decelerations and the total length of tracing outperformed the respective univariate models based solely on the number of decelerations of that type.
The results of the univariate analysis of FHR categories are presented in Table 3. AUROCC was higher for the number of suspicious segments compared to whether any single suspicious segment was observed (p<0.001). The AUROCC for the number of suspicious segments was higher than for the number of pathological segments but the difference did not meet the level of statistical significance (p=0.088). The unadjusted odds ratio for the number of suspicious segments was 1.31 (95% CI: 1.17-1.47) compared to 1.47 (1.18-1.84) for the number of pathological segments.
Multivariate Analysis
The multivariate logistic regression models are detailed in Supplemental Material Table 2. The best performing multivariate model incorporated the total number of fifteen-minute segments, the percentage of segments classed as suspicious, and the percentage of segments classed as pathological (AUROCC: 0.782 [95% CI: 0.704-0.861], sensitivity: 69%, specificity: 80%). The AUROCC of this model was superior to that of the best univariate predictor (number of consecutive segments with baseline FHR >160bpm), but the difference did not meet the threshold for statistical significance (p=0.063). The best logistic regression model using FHR segment categories was essentially identical (p=0.9162) in overall performance to the best performing model which used individual FHR features. Figure 1 shows the ROC curve for the logistic regression model based on FHR categories along with the 95% confidence interval for sensitivity at a given specificity.

Fetal scalp blood sampling

Nine cases (17%) and 8 controls (7%) had one fetal blood sample taken for pH testing. Six cases (12%) and no controls had two or more samples taken. The overall Fisher’s exact test p value for the number of fetal scalp blood samples was <0.001. One case had a pH <7.2, but no subject had a pH <7.1.

Discussion

Main Findings

As expected, no FHR pattern feature or category achieved both high sensitivity and specificity. Multivariate models performed better than any single variable, but still did not achieve high accuracy. The best logistic regression model using FHR segments categorised according to NICE criteria was essentially identical in performance to the best performing multivariate model which used individual fetal heart rate features. This finding suggests that the current categories are appropriately capturing the predictive value inherent in the underlying features.

Strengths and Limitations

One of the strengths of this study is the CTG assessment method. The entire recording from delivery suite admission up to birth was analysed and each 15-minute interval was classified. Analysis of the entire length of the CTG rather than a specified period pre-delivery allowed determination of the duration of abnormalities. Features were identified blind to clinical outcome and segments were algorithmically categorised.
Another strength is the use of moderate-severe neonatal encephalopathy as the outcome. Most studies of FHR patterns have either focused on acidaemia or Apgar scores neither of which are highly specific or sensitive, or cerebral palsy which is etiologically diverse and not usually associated with intrapartum anomalies. Not all of the babies included in this study had abnormal cord gases, but all had features consistent with peripartum hypoxia-ischemia. This is in keeping with published data. In a study published in 2012 and based on data from the Vermont Oxford Network, 54% of the babies diagnosed with neonatal encephalopathy who had cord blood sampling had a pH <7.09.13 In this study, diagnosis was made by the attending neonatologist based on history, examination and amplitude-integrated electroencephalography findings. The inter-rater reliability of clinical examination to assess neonatal encephalopathy is generally good.14 Nevertheless, it is possible that knowledge of the FHR patterns could have biased the attending clinicians towards a diagnosis of neonatal encephalopathy and therefore increased the observed predictive power.
We did not measure FHR deceleration area. In a cohort study employing manual assessment of FHR traces in the last two hours before delivery, Cahill et al. showed that the total deceleration area (AUROCC: 0.76 [95% CI: 0.72-0.80]) was more predictive of umbilical cord acidaemia than “always ACOG grade 2” (0.61 [0.56-0.65]), “any ACOG grade 3” (0.62 [0.57-0.66]), or the total number of decelerations (0.66 [0.62-0.71]).15However, with regard to a composite measure of neonatal morbidity, the AUROCC for deceleration area was less (0.66 [0.64-0.68]) and similar to the values for ACOG FHR categories. A 2014 study showed that, when used in isolation, manual estimation of the total area of decelerations in the hour before delivery has an AUROCC of 0.68 (0.56–0.79) for detection of babies with moderate-severe encephalopathy.16 The controls in that study were matched for mode of delivery which may have resulted in a higher rate of fetal heart rate abnormalities and therefore lower specificity for a given sensitivity than would be found in the general population.
We did not employ fully automated analysis. Attempts to show a benefit to automated interpretation based on replication of existing classification schemes have been hampered by an unsatisfactory incidence of false positive alarms.17 Overall, existing methods of artificial interpretation of FHR traces have aimed to reproduce human methods and have therefore inherited the problems of poor agreement and unproven benefit for the reduction of neonatal acidaemia.18 Methods which train convolutional neural networks to predict adverse outcomes without resorting to existing classification schemes have shown promising accuracy in early studies.19 Such methods require large datasets to train and establish predictive accuracy before trials of clinical utility can be considered.
Owing to resource limitations, we were not able to analyse changes in the pattern of abnormalities over time. This may be of interest. Murray et al. identified three patterns of FHR abnormalities in babies with neonatal encephalopathy (group 1: abnormal CTG on admission; group2: normal CTG on admission with gradual deterioration; group 3: normal CTG on admission with acute sentinel events.)20 In that study, babies in group 3 had more severe encephalopathy. Our study only features women in the latter two groups. Due to sample size limitations we were not able to establish the relationship between patterns of abnormality and the severity of encephalopathy.

Interpretation

To our knowledge, this is the first study to assess the accuracy of NICE criteria for the prediction of encephalopathy. NICE criteria have been shown to result in more traces classified as either normal or pathological and fewer classified as suspicious compared to ACOG, resulting in overall relatively higher sensitivity but lower specificity for the prediction of umbilical artery pH values ≤7.05.21
Suspicious and pathological segments were found in ½ and ¼ of control labours, respectively. Despite the higher sensitivity of pathological segments, the overall predictive power of the number of suspicious segments was actually higher than that of the number of pathological segments even if the difference did not quite reach the level of statistical significance., Furthermore, the unadjusted odds ratio for the number of pathological segments was only slightly higher than for the number of suspicious segments. This is a surprising finding, but it is important to consider the clinical context which is that pathological traces will usually prompt immediate delivery, potentially partially uncoupling the relationship between that classification and adverse outcomes. In contrast, suspicious traces do not usually prompt immediate delivery and therefore may persist for longer.
While most studies focus on the severity of the FHR abnormality and therefore the severity of insult, the duration of hypoxia-ischemia is also important. Frey et al. showed that ACOG category II traces in the last half hour before delivery are extremely common and do not help to identify labours associated with neonatal encephalopathy.22 Our data shows that it is important to consider not only the presence of abnormalities but also the duration of those abnormalities. Data from animal models support this conclusion. In a rabbit model of preterm acute placental insufficiency, insults of ≥37 minutes produced increased rates of stillbirth and neuronal injury, but exposures of 30 minutes did not. In our study, 52% of cases and just 14% of controls had at least one hour of suspicious FHR traces. FHR pattern features which accurately predict HIE but require a long duration to produce fetal asphyxia may offer more opportunity to prevent injury than more severe and acute abnormalities. In the current NICE guidelines, the duration of abnormalities is considered as part of the feature assessment e.g. the presence of late decelerations for 30 minutes is classed as abnormal. However, there is no mention of how the duration of suspicious FHR patterns affect assessment or management.
To be clinically useful any intrapartum monitoring technique must not only predict neonatal encephalopathy but also help to prevent it. Despite extensive investigation, continuous FHR monitoring has never been proven to provide such a benefit. (However, it should be noted that due to a lack of equipoise, trials have only compared different forms of monitoring rather than featured a control group without any monitoring.) The apparent failure of FHR monitoring to prevent injury is possibly due to the fact that the most predictive patterns are often associated with severe acute insults such as placental abruption which would often be detected without continuous FHR monitoring and which, even with emergency delivery, can result in poor outcomes. In short, the risk is not alterable at the time of detection. The challenge in fetal monitoring is to recognise impending neurological injury early enough that it is still preventable without an unacceptable proportion of false positives. Recent evidence from randomised controlled trials of intrapartum sildenafil suggest that it reduces the rate of abnormal FHR traces.23 It is possible that by reducing the incidence of FHR abnormalities in labours which would have had normal outcomes anyway, sildenafil could increase the positive predictive value of persistent abnormalities.
Fetal acidaemia has been shown to have extremely poor sensitivity for adverse neonatal outcomes.24 Fetal scalp blood sampling was uncommon in this population. No fetuses in this study had an abnormal scalp blood pH. There was a strong relationship between the number of samples taken and the risk of neonatal encephalopathy. Obviously, this relationship is dependent on clinical practice and could vary depending on the setting. Therefore, this result is not necessarily generalisable. Nevertheless, our data supports the idea that a normal scalp pH does not ensure a normal outcome and that it is not generally prudent to use repeated sampling in an attempt to avoid interventions such as operative delivery.
It is unreasonable to expect any method of FHR interpretation used in isolation to have both high sensitivity and specificity for the diagnosis of neonatal encephalopathy. Neonatal encephalopathy is a heterogenous condition which is influenced by diverse risk factors some of which e.g. chorioamnionitis do not usually result in fetal heart rate changes.25 Models which incorporate multiple factors such as markers of placental function, labour progression, and uterine activity, in addition to FHR abnormalities and their durations may improve our ability to predict and possibly even to prevent HIE.

Conclusion

While the CTG is a useful aid in intrapartum management its limitations need to be appreciated and it is important that it is interpreted while taking all other fetal, maternal and partogram factors into account. In addition, the power of fetal heart rate abnormalities to predict HIE is not fixed or necessarily generalisable. For instance, it depends on the distribution of the aetiologies of HIE in the study population i.e. it is likely to have less power to prevent HIE in populations where sentinel events account for higher proportions of the cases of HIE. Since the incidence of HIE is alterable, it is also dependant on the clinical context e.g. the local caesarean delivery rate.