1 Introduction

Ultrasonic imaging is widely used in clinical examination since it does not use ionizing radiation and more low-costing compared with computed tomography (CT) and magnetic resonance imaging (MRI), which make it to be the first choice of prenatal care. A clear and accurate anatomical structure measurement is required in many clinical ultrasound diagnoses. In particular, the fetal head measurement can be used to estimate gestational age and monitor growth patterns [1]. In general, these measurements are performed by experienced clinical sonographers on account of ultrasound images, which are operator-dependent and machine-specific [2] leading to inter- and intra-observer variability. The automatic method of fetal biometric measurement can reduce the variability and doctors’ workloads with no intra-observer variability [3]. Furthermore, there is still a severe shortage of well-trained sonographers in many countries, so an automated system may assist inexperienced clinicians to obtain an accurate measurement.

Typically, three standard fetal head biometric parameters were considered by using two-dimensional ultrasound measurements: head circumference (HC), biparietal diameter (BPD), occipitofrontal diameter (OFD). The guidelines state that BPD and HC are measured in the transaxial plane at the widest portion of the skull at the level of the thalami. When measuring BPD, the vernier is placed from the outer margin of the proximal skull to the inner margin of the distal skull and perpendicular to the cerebral line. OFD overlaps as much as possible with the middle cerebral. The HC parameter is calculated by drawing an ellipse around the outline of the skull [4]. It is beyond the scope of this paper to detail the measurement, and more details can be found in references [3, 4].

Due to the attenuation of ultrasonic transmission and acquisition characteristics of ultrasonic equipment, ultrasound images may contain speckle noise, discontinuous or ambiguous anatomical boundaries, and shadows, which make ultrasound image measurement become one of the most difficult medical imaging tasks [5]. Examples of some artifacts are shown in Fig. 1, where fetal head ultrasound images from the first to the third trimesters are shown in the top row, and the corresponding manually labeled images are shown in the bottom. It can be seen that the boundary of the fetal head structures is mostly incomplete or not obvious, and there are a lot of speckle noises, which makes it ambiguous with the surrounding tissue.

Fig. 1
figure 1

Example of fetal head ultrasound images: without annotation and with annotation in blue from top to bottom. The first trimester, the second trimester, and the third trimester from left to right. Note that there are speckle noise, incomplete anatomical boundaries and other artifacts. The boundary in the red bounding box is relatively clear, while the boundary in the green box cannot be inferred from local brightness very well

The previous methods can be mainly divided into two categories: one is to fit the ellipse equation by segmentation of the region of interest [6, 7], the other is to fit ellipse equation directly from the original image [8,9,10,11]. Whereas they all obtain the BPD and OFD by measuring the length of the minor axis and the major axis, these methods, which make no distinction between geometric and biological lengths, will lead to errors in the final measurement. Moreover, in spite of the great effort did in many fields, traditional machine learning approaches based on hand-crafted feature limit to develop more complex application scenarios. Recently, deep convolutional neural networks (CNNs) have become the dominant approach in different vision challenges by automatically extracting useful features. Wu et al. first brought the CNN to the fetal head’ region segmentation by cascading three variants of the FCN [7]. However, using single scale information can cause incomplete boundary prediction ambiguity because a fixed receptive field of a neural network can only infer boundary with a fixed region. As the red bounding box shown in Fig. 1, the small receptive fields obtain a good prediction, while the fixed receptive fields in the large incomplete areas result in fuzzy boundary prediction.

In this work, we propose a novel automated measurement system based on a deep neural network. First, we propose a fetal head segmentation network, which considers the very deep CNN with the large receptive field to extract the main areas of the head and then utilize the layers corresponding to different receptive fields to speculate the discontinuous areas in different sizes of a skull. Instead of simple average fusion, we propose a scale attention-based multi-scale module to fuse the different scales of information. After segmenting the region of the fetal head from the ultrasound image, the ellipse closest to the skull can be fitted by the least square error method. As mentioned earlier, most of these current methods treat geometric lengths as biological length, which results in measurement errors. After fitting the ellipse, we propose a regression network to correct this error by predicting the residual angle between the major and minor axis of the ellipse and the true OFD and BPD. We modify RoIAlign [12] to present an ellipse pooling module that directly obtains the elliptic feature and send it into the fully connected layers for angle prediction. The ellipse pooling module is efficient for the reason that it avoids re-extracting features from the original image.

In summary, the contributions of this paper are in four aspects:

  1. 1.

    The fetal head ultrasound images have speckle noise, boundary occlusion, and other artifacts, so the segmentation algorithm must be able to segment complete and occlusive region in a lot of noise. We propose that the multi-scale method can be used for targeted boundary recognition and fitting.

  2. 2.

    We propose to build a scale attention-based feature pyramid to fuse the information of different receptive field layers.

  3. 3.

    We design a regression network for BPD and OFD prediction. In the proposed network, we design an ellipse pooling module to share the feature map which allows the regression network to reuse low feature layers in order to reduce time consumption.

  4. 4.

    We build a complete fetal head automatic measurement system that can be used to measure head circumference, OFD and BPD.

2 Related work

Fetal head measurement.

Foi et al. presented a fully automatic method for fetal head measurement using signal processing and minimizing a cost function to directly fit an ellipse [8]. Ciurte et al. used a semi-supervised approach, which interpreted ultrasound segmentation to a graph cutting problem and solved it via min-cut and fast minimization algorithm [6]. Stebbing et al. first calculated the position and direction of the fetal head boundary, then used the random forest to find the inner and outer contour of the skull, and finally used it to fit the two ellipses [9]. Lu and Satwika et al. used different hough transform approaches to directly estimate the parameters of ellipse function [10, 11]. Ponomarev et al. distinguished skull from the background by multilevel threshold and recognized the segmented objects by using two introduced shape-based descriptor. [13]. Machine learning techniques are potentially included in fetal head measurement, such as [14,15,16,17,18]used Haar-Like features to detect the position of the skull, of which [18] first trained a random forest classifier with Haar-Like features to locate the fetal head and the ellipse parameters are fitted by dynamic programming and hough transform.

Deep learning-based segmentation network.

Recent segmentation algorithms often convert an existing CNN structure to a FCNN such as AlexNet [19], VGGNet [20], and ResNet [21]. They use 1 × 1 convolution instead of fully connected layers to generate heatmap and perform some deconvolution layers for pixel-wise labeling [22]. Furthermore, intra-skipping connections are included to improve performance [23]. These networks have shown excellent performance in natural image understanding, maintaining the best records on many datasets. In the field of medical image, the FCNN is designed to a U shape structure as a state-of-the-art model [23]. These encoder-decoder architecture combined fine-grained and coarse-grained features have been proved effective at a satisfactory level in CT or MRI images [24,25,26]. Wu et al. first introduce the CNN to fetal head segmentation in ultrasound images [7]. They show that FCNN can filter out most of the speckle noise and non-skull regions by cascading three variants of the FCN. They utilize the single scale information, which can cause incomplete boundary prediction ambiguity because a fixed receptive field of a neural network can only infer boundary with a fixed region; meanwhile, they also make no discrimination on biometric and geometric length.

3 Proposed methodology

The whole automatic measurement system can be seen in Fig. 2, which is mainly composed of three parts: (1) scale attention pyramid deep neural network (SAPNet) for head region segmentation; (2) regression network for OFD and BPD prediction; (3) the fusion module which takes advantage of the results of the two network to output the final result. Our approach achieves state-of-the-art performance in the available dataset HC18 [18] and estimates the total runtime of the system on NVIDIA 1080TI GPU to about 30 FPS. In the rest of this section, we will detail our system.

Fig. 2
figure 2

Complete automatic system for fetal head measurement. An ultrasound image is an input from the left side and a mask bitmap of the head region is provided by the segmentation network. This mask is subsequently fitted to an ellipse by the least square error method. The angle of occipitofrontal diameter is obtained by the input regression network after ellipse pooling. Finally, the ellipse equation and occipitofrontal diameter angle are fused to obtain the final output

3.1 Dataset

The proposed method is based on deep learning which is very dependent on the amount and quality of data. Therefore, we will first detail the dataset. A total of 1334 two-dimensional fetal head ultrasound images were collected from the challenge of HC18, which were acquired from 551 pregnant women who received a routine ultrasound screening exam between May 2014 and May 2015. It should be noted that these fetuses are clinically healthy, which is very important to use this dataset to evaluate fetal development. These images were collected by experienced sonographers who went through the Voluson E8 and Vouson 730 ultrasound machine. The pixel size of each image is 800 × 540, and the distance between pixels corresponds to the real distance in the range of 0.052 to 0.326 mm. We can obtain 999 ground truth of 1334 images, and the remaining 335 results are submitted to the HC18 challenge website to evaluate algorithm rankingsFootnote 1. The ground truth is annotated manually by the sonographers, this was done by drawing an ellipse best fitting the circumference of the head skull, as shown in Fig. 3.

Fig. 3
figure 3

The parameterized ground truth: a and b are the major and minor axes of the ellipse respectively. cx,cy are the center of the ellipse. θ is the rotation angle in degrees

We express the ellipse parametric equation as:

$$ \begin{array}{@{}rcl@{}} Axy&+&By^{2}+Cy+Dx+x^{2}+E=0,\\ A&=&\frac{b^{2}-a^{2}}{a^{2}\sin^{2}\theta+b^{2}\cos^{2}\theta},\\ B&=&\frac{a^{2}\cos^{2}\theta+b^{2}\sin^{2}\theta}{a^{2}\sin^{2}\theta+b^{2}\cos^{2}\theta},\\ C&=&-2c_{y}B-c_{x}A,\\ D&=&-2c_{x}-c_{y}A,\\ E&=&{c_{x}^{2}}+{c_{y}^{2}}B+c_{x}c_{y} A-\frac{a^{2}b^{2}}{a^{2}\sin^{2}\theta+b^{2}\cos^{2}\theta}. \end{array} $$
(1)

Datasets are collected from fetuses of different gestation ages, as shown in Fig. 1. For these images of different ages varies greatly, our algorithm also tests the images in different gestation ages to better evaluate the quality. Table 1 shows the distribution of entire datasets from the first to the third trimester.

Table 1 The distribution of dataset

3.2 Data augmentation

The data augmentation scheme is a common approach to increase the amount of training data and tech the network desired invariance and robustness properties [23]. In the case of the fetal ultrasound image, we primarily need gray value, chrominance, contrast and sharpness invariance as well as the robustness of gaussian blurring in different window sizes. Especially, the random horizontal flipping of the training image seems to be the key augmentation to learn a segmentation network with very little ground truth. We randomly perturb an image within a specified range during each training session, detailed in Table 2.

Table 2 Data enhancement methods and their range

3.3 SAPNet architecture

The SAPNet, illustrated in Fig. 4, is a segmentation model that can combines different scale information in feature level. Our network is based on the U-Net [23] structure which is a standard medical image segmentation network. In order to get a scale pyramid module with different receptivity, we modify the convolution layer of the encoding layer of U-Net [23] structure to dilated convolution. The dilated convolution introduces a new parameter called the expansion rate of the convolutional layer which defines the spacing of the values when the convolution kernel processes the data. This convolution method can discard the pooling layer to output the full-resolution feature map while still obtaining a large receptive field. Meanwhile, we use the output of the different encoder layers to form the feature pyramid layer. The high-level feature layer of the feature pyramid has a larger receptive field, while the lower-level feature layer has a smaller receptive field. Feature layers with small receptive fields perform well on the more continuous parts of the skull, while feature layers with larger receptive fields perform well on the skull that appeared to be discontinuous on the ultrasound images.

Fig. 4
figure 4

Overview of the scale attention feature pyramid network. We use U-Net based on dilated convolution to extract features. Then, the scale attention module is applied to harvest different scale representations, followed by upsampling, concatenation, and 3 × 3 convolution to form final module representation, which contains information about the different receptive fields. Finally, the pixel-wise prediction is got by connecting fine-grained layers

To generate pixel-wise segmentation, one can make use of attention mechanism to generate a mask on the feature map [27], which enables a scale-level weight matrix by convolution to indicate which scale should be noticed. Inspired by [28], we propose a scale attention module (SAM) that provides attention of a global context prior to select scale-wise features. Scale attention module fuses three different scale context information by providing a scale-level attention value. To build the scale attention module, we first use a dilated U-Net model to extract the feature pyramid. As shown in Fig. 4, similar to the U-Net, the encoding network of SAPNet also consists of five large blocks. To better extract context from different layers, we use the feature map of the last three large blocks after convoluting different dilated rates to build the feature pyramid. The size of the feature maps in the scale pyramid module is 1/4 of the input size.

As shown in Fig. 5, the bottleneck layer of dilated U-Net generates an attention feature layer after global average pooling convolution. This global pooling method provides a global context as guidance of the feature pyramid to select scale attention. We get attention feature from global average pooling after 1 × 1 convolution with batch normalization and sigmoid activation function. The final attention value is obtained by averaging the weight of feature maps according to the number of each layer in the feature pyramid. The attention value multiplies the attention feature map and adds the original input to get the final output.

Fig. 5
figure 5

Scale attention module: The multi-scale pyramid is achieved by connecting the feature maps of different receptive fields. The weight of each feature map can be obtained through the global average pooling and 1 × 1 convolution. The weight of a pyramid layer can be calculated by averaging the weight of corresponding feature maps

3.4 Ellipse fitting

We use the least square error method to fit the elliptical boundary of the segmentation according to Eq. 1. For the N points on the edge of the segmentation result, we can get the minimum target as:

$$ Q=\sum\limits_{i=1}^{N}{({x_{i}^{2}}+B{y_{i}^{2}}+Ax_{i}y_{i}+Dx_{i}+Cy_{i}+E)^{2}} $$
(2)

where (xi,yi) indicates the coordinates of the detected edge points. We can get the minimum value by:

$$ \frac{ \partial Q}{ \partial A}=\frac{ \partial Q}{ \partial B}=\frac{ \partial Q}{ \partial C}=\frac{ \partial Q}{ \partial D}=\frac{ \partial Q}{ \partial E}=0 $$
(3)

This minimization problem can be converted into matrix equations following:

$$ \begin{array}{@{}rcl@{}} &&\left[ \begin{array}{lllll} \widetilde{x^{2}y^{2}} & \widetilde{xy^{3}} & \widetilde{x^{2}y}&\widetilde{xy^{2}}&\widetilde{xy} \\ \widetilde{xy^{3}} & \widetilde{y^{4}} & \widetilde{xy^{2}}&\widetilde{y^{3}}&\widetilde{y^{2}} \\ \widetilde{xy^{2}} & \widetilde{y^{3}} & \widetilde{xy}&\widetilde{y^{2}}&\widetilde{y} \\ \widetilde{x^{2}y} & \widetilde{xy^{2}} & \widetilde{x^{2}}&\widetilde{xy}&\widetilde{x} \\ \widetilde{xy} & \widetilde{y^{2}} & \widetilde{x}&\widetilde{y}&1 \end{array} \right] \left[ \begin{array}{l} A \\ B \\ C \\ D \\ E \end{array} \right] = \left[ \begin{array}{l} -\widetilde{x^{3}y}\\ -\widetilde{x^{2}y^{2}}\\ -\widetilde{x^{2}y}\\ -\widetilde{x^{3}}\\ -\widetilde{x^{2}} \end{array} \right],\\ &&\widetilde{x}=\frac{1}{N}\sum\limits_{i=1}^{N}{x_{i}}, \widetilde{y}=\frac{1}{N}\sum\limits_{i=1}^{N}{y_{i}}, \widetilde{xy}=\frac{1}{N}\sum\limits_{i=1}^{N}{x_{i}y_{i}},\\ &&\widetilde{y^{2}}=\frac{1}{N}\sum\limits_{i=1}^{N}{{y_{i}^{2}}},\widetilde{x^{3}}=\frac{1}{N}\sum\limits_{i=1}^{N}{{x_{i}^{3}}},\widetilde{x^{2}y}=\frac{1}{N}\sum\limits_{i=1}^{N}{{x_{i}^{2}}y_{i}},\\ &&\widetilde{xy^{2}}=\frac{1}{N}\sum\limits_{i=1}^{N}{x_{i}{y_{i}^{2}}},\widetilde{y^{3}}=\frac{1}{N}\sum\limits_{i=1}^{N}{{y_{i}^{3}}},\widetilde{x^{3}y}=\frac{1}{N}\sum\limits_{i=1}^{N}{{x_{i}^{3}}y_{i}},\\ &&\widetilde{x^{2}y^{2}}=\frac{1}{N}\sum\limits_{i=1}^{N}{{x_{i}^{2}}{y_{i}^{2}}},\widetilde{xy^{3}}=\frac{1}{N}\sum\limits_{i=1}^{N}{x_{i}{y_{i}^{3}}},\widetilde{y^{4}}=\frac{1}{N}\sum\limits_{i=1}^{N}{{y_{i}^{4}}}, \end{array} $$
(4)

3.5 BPD and OFD prediction

The first step in the prediction of the BPD and OFD, as shown in the guideline, is to find the middle cerebral. In fact, the major and minor axis of the ellipse fitted by the segmentation results is very close to OFD and BPD. Therefore, we obtain OFD in predicting the increment of the major axis by adding the branch of regression network in our SAPNet, as shown in Fig. 4. Our experiment shows that regression increment work is better than a direct prediction absolute angle. Regression networks only predict the angle of OFD because BPD is orthogonal to OFD. After knowing the angle of OFD, it is not difficult to obtain the angle of BPD. In order to eliminate the influence of areas outside the fetal head on the prediction of middle cerebral, we design an ellipse pooling layer to accurately locate the features inside the skull, as shown in Fig. 6. After founding the bounding box of the ellipse, it is projected onto the convolution layer of feature extraction by RoIAlign [12]. RoIAlign is an operation widely used in object detection tasks, which convert proposals of different shapes to fix shape as required by fully connected layers. The product of the feature map and the head mask can eliminate the impact of the area except for the head in the bounding box. OFD angle regression network consists of three fully connected layers whose final output passes through the activation function of \(\sigma \times \tanh \). σ are the maximum values of clockwise or counterclockwise increment of the long axis and we set it as 5 for the experiment. As shown in Fig. 7, after rotating the OFD angle of the original image coordinate system, a new coordinate system with the X-axis parallel to the middle cerebral is obtained. At this point, the binary image obtained by the segmentation network is projected to the new coordinate system, in which the highest point corresponding to the X-axis is BPD, and the highest point corresponding to the Y -axis is OFD.

Fig. 6
figure 6

An overview of the occipitofrontal diameter angle prediction module. The ellipse pooling layer uses max pooling to convert the feature inside any bounding box of the ellipse into a small feature map with a fixed spatial extent of H × W(e.g, 7 × 7), where H and W are layer hyper-parameters that are independent of any particular ROI. In order to obtain the elliptic region more accurately, we first multiplied the head mask before sampling the feature map. The occipitofrontal diameter angle regression network sends the output of the ellipse pooling layer into a regression network composed of three fully connected layers. The final output of the full connection layer finally obtains the increment value of the long axis of the ellipse through \(\sigma \times \tanh \) activation function

Fig. 7
figure 7

The binary mask image is projected to the coordinate system aligned with the middle cerebral, and the maximum projection values corresponding to each axis are biparietal diameter and the occipitofrontal diameter

3.6 Network training

The network performance is optimized using the Adam [29]. We set base learning to 0.0001 and reduce by a factor of 0.8 at training error saltation. The momentum and weight decay are set to 0.9 and 0.0001. Due to the limitation of our computer hardware, we have adjusted the original 800 × 540 pixel image to 480 × 320 and set the batch size to 10 during training. In the validation set, 100 epochs are used to train all networks and after the final comparison in the test set, we have trained 700 epochs. It is worth noting that our two network performances are improved by increasing the epoch number. We train our SAPNet by minimizing a cross-entropy loss:

$$ \begin{array}{@{}rcl@{}} L({\Gamma}_{\pmb{\theta}}(X),Y_{\text{truth}})&=&\frac{1}{m}\sum\limits_{i=1}^{m}\left( -Y_{\text{truth}}^{i}\log(P({\Gamma}_{\pmb{\theta}}^{i}(X)=1|\pmb{\theta}))\right.\\ &&\left.-(1-Y_{\text{truth}}^{i})\log(P({\Gamma}_{\pmb{\theta}}^{i}(X)=0|\pmb{\theta}))\right)+\lambda R(\pmb{\theta}) \end{array} $$
(5)

where the Γ(X) indicates the output network and the inputs are X and Ytruth represents the ground-truth image. The θ denotes the network model parameters that need to be obtained through training. R(θ) is the regularization term where we use L2 norm of the network weights. As in Mask R-CNN, the segmentation region is considered positive if it has IoU with the ground truth of at least 0.5 and negative otherwise. When one is the positive sample, the segmentation result is fitted to the ellipse and the regression network is trained. The L1 loss function is adopted in the regression network.

4 Experimental results

We perform four quantitative experiments to evaluate the performance of our approach on the HC18 dataset. Firstly, we compare our system to U-Net baseline [23] in segmentation evaluation metrics and fit segmentation images boundary to ellipses in the least square method for HC, BPD, and OFD comparison. Secondly, we compare the effects of different components on automated measurement. Finally, we compare results with the best performers on the HC18 leader board. In the first three experiments, we divided annotated images into 80% training sets and 20% test sets as shown in Table 3 according to the number of images in the three pregnancy stages as shown in Table 1.

Table 3 The distribution of experimental dataset

4.1 Evaluation metrics

The performance of the segmentation experiment is evaluated with three metrics in the mean pixel accuracy (mPA), the mean Intersection over Union (mIoU) and the Dice similarity coefficient (DSC).

The mAP is used to evaluate the accuracy of an image that is correctly labeled with pixels.

$$ \text{mPA}=\frac{N_{\text{TP}}+N_{\text{TN}}}{N_{\text{TP}}+N_{\text{TN}}+N_{\text{FP}}+N_{\text{FN}}}, $$
(6)

where NTP is true positive which represents the number of pixels correctly classified by the fetal head, NTN is true negative which represents the number of pixels correctly classified as background, and NFP and NFN are the number of the fetal head and background incorrectly annotated.

The mIoU is a common metric that calculates the ratio of intersection and union between two segmentation sets.

$$ \begin{array}{@{}rcl@{}} \text{mIoU}&=&\frac{\text{IoU}_{\text{fh}}+\text{IoU}_{\text{bg}}}{2},\\ \text{IoU}_{\text{fh}}&=&\frac{N_{\text{TP}}}{N_{\text{TP}}+N_{\text{FN}}+N_{\text{FP}}},\\ \text{IoU}_{bg}&=&\frac{N_{\text{TN}}}{N_{\text{TN}}+N_{\text{FN}}+N_{\text{FP}}}\\ \end{array} $$
(7)

where IoUfh and IoUbg represent the mean Intersection over Union of the fetal head and background annotated collection, respectively.

It is similar to the IoU metric the DSC gives an indication of overlapping area between our segmentation method and the ground truth.

$$ \text{DSC}=\frac{|\text{Area}_{\mathrm{M}} \cap \mathrm{Area_{GT}}|}{|\mathrm{Area_{M}}|+|\mathrm{Area_{GT}}|}, $$
(8)

where AreaM denotes the segmentation area in using our method and AreaGT is the area of annotation of the ground truth.

The final result is the ellipse fitted by least squares after segmentation, and then we evaluate the metric of the Hausdorff distance (HD), the difference head circumference (DF), and the absolute difference head circumference (ADF).

The Hausdorff distance is a measure of the degree of similarity between two sets of points: Let PGT and POM be the boundary points of the ground truth and proposed methods. pGT denotes a point of PGT and pOM a point of POM. The minimum measurement distance of a point p to PGT is defined as:

$$ d_{\min}(p,P_{\text{GT}})=\min_{p_{\text{GT}}\in P_{\text{GT}}}||p-p_{\text{GT}}||, $$
(9)

where ||.|| denotes the Euclidean distance. The HD can then be expressed as:

$$ \begin{array}{@{}rcl@{}} \text{HD}(P_{\text{GT}},P_{\text{OM}})&=&\max\left( \max_{p_{\text{GT}}\in P_{\text{GT}}}d_{\min}(p_{\text{GT}},P_{\text{OM}})\right.\\ &&\left. \max_{p_{\text{OM}}\in P_{\text{OM}}}d_{\mathrm{\min}}(p_{\text{OM}},P_{\text{GT}}) \right). \end{array} $$
(10)

The DF was defined as:

$$ \text{DF}=\text{HC}_{\text{GT}}-\text{HC}_{\text{OM}}, $$
(11)

where HCOM is the head circumference measured by ellipse proposed method and HCGT by the ground truth.

The ADF was defined as:

$$ \mathrm{DF=|HC_{GT}-HC_{OM}|}. $$
(12)

In order to evaluate the performance of the regression network, RMSE (root mean square error) was used to calculate the error between the predicted value and the true OFD angle or length.

4.2 U-Net baseline comparison

In order to evaluate our proposed segmentation network, we conduct a series of experiments for segmentation performance comparison between the U-Net baseline and the proposed network with our best settings. We use qualitative and quantitative results to compare algorithms. The qualitative comparisons of results allow us to know where the algorithm has been improved, and the quantitative comparisons allow us to know how much the algorithm has improved. Since most of the baby fetal head segmentation is to predict fetal development, we need to know not only the average of the segmentation metric but also the worst and best conditions of our segmentation network. So we count all the prediction data of our segmentation algorithm to provide the performance of our system.

The qualitative comparisons in the proposed networks with the U-Net baseline can be seen in Fig. 8. The qualitative results show the superior ability of the proposed network to deal with the incomplete regions of the skull while producing a smooth segmentation in the complete regions of the skull. In the first trimester, the ultrasound image has many noises, uncertain areas, and unclear skull boundaries. U-Net can get lost in these areas, and even other contours may be identified as skulls. In the second pregnancy, there is a sudden saltus in the fetal skull border, which confuses the U-Net segmentation path. In the third trimester, the skull in the ultrasound image itself has large discontinuous areas, and the black template coverage of other information on the fetus in the dataset makes the discontinuous area increase and irregular, which results in the use of only a single scale U-Net often recognizes errors.

Fig. 8
figure 8

Qualitative comparison of segmentation performance of networks on the fetal head ultrasound image. The original image and ground-truth region boundaries are shown in the first, second row, respectively. The outputs obtained using the U-Net baseline and proposed SAPNet are shown in the last row. The best results are shown by the SAPNet

To quantitatively demonstrate the performance of the U-Net and our proposed networks, we compare the results of two segmentation network without any post-processing. The output sizes of these three networks are adjusted to 320 × 480 and compared with the ground truth in the same size. Three assessment metrics of mIoU, mPA, and DSC are adopted for quantitative comparison. The performance between the ground truth and the results of the two networks is shown in Table 4. In our proposed SAPNet, we notice that a relatively greater improvement is performed by 3.05/1.83/2.57 and reaching 96.46%/98.02%/97.26%.

Table 4 Quantitative comparison of segmentation results for the U-Net baseline, proposed SAPNet from first trimester to third trimester

In order to get the ellipse closest to the fetal head, we fit the segmentation results of the U-Net and SAPNet by least squares, illustrated in Section 3.4 in details. The fitted ellipses are compared by three assessment metrics: DF, ADF, DSC, and HD. All the evaluation metric values are listed in Table 5. In Table 6, we compare our OFD angle regression network to another network with ellipse fitting. It can be seen that the direct fitting ellipse using the segmentation network can only be similar in circumference, but its OFD and BPD have a large error. The performance of the regression network during the first and second trimester was absolutely superior to other methods. However, the performance of the regression network in the third trimester was almost the same as that of other methods, which we believe is related to the fact that the middle cerebral in this stage is not obvious in most ultrasound images.

Table 5 Comparison of the metric for ellipse after segmentation contours fitting
Table 6 Comparison of OFD, BPD, and OFD angle prediction of different networks

4.3 Ablation experiments

To show the effectiveness of different components in our SAPNet, we present an ablation experiment to quantitative analysis of the following components: dilated convolution, feature pyramid, and scale attention module, as described before. As listed in Table 7, these experiments show that different factors have an effect on the final result.

Table 7 Ablation analysis of our proposed SAPNet with different settings

Ablation study for segmentation network:

As shown in Table 8, we first test the effect of different layers of feature pyramids on the final result, and the input feature map of each pyramid layer is adjusted to the same size. Finally, we find that the best results are achieved in three layers. In order to reduce the detail loss caused by upsampling on the small feature map of the feature pyramid layer, we replace the last two layers of the encoding network with dilated convolution. we notice that dilated convolution works are better than ordinary convolution, as shown in Table 7. Furthermore, when we replace the scale attention module to the feature pyramid, the performance of the network is further improved.

Table 8 Detailed analysis of our proposed SAPNet with different layers of feature pyramids

Ablation study for regression network:

As shown in Table 9, we test the difference between the absolute and incremental OFD angle of regression network prediction. The absolute angle is the angle between OFD and the X-axis of the image, and the incremental angle is the intersection angle between the long axis of the ellipse after fitting and the real OFD. The activation function σ of the last layer of the regression network that predicts the absolute angle is set to 180. Compared with absolute angle prediction, the incremental angle is equivalent to σ reduction, and the search space of deep neural network is also reduced.

Table 9 Comparison of OFD angle with different prediction methods

To verify the performance of the ellipse pooling module, we add a new feature extraction layer with the same encoding structure as SAPNet, followed by the OFD Angle regression network. The feature extraction structure runs in parallel with SAPNet, relying on the back-propagation gradient of the regression network. The results are shown in Table 10.

Table 10 Comparison of the results of both ellipse pooling and none ellipse pooling

4.4 Results in test set

Combining our best setting in the deep neural network, we experiment with the automated measurement system of fetal head on the HC18 test set. In evaluation, we use these best settings to train 700 epochs with the Adam optimizer, so the result would be better than our validation set. The final output of the entire system comes in the 1st place in the HC18 leader board (December 23, 2018, account name: shenzexu). Without adding the regression network, the result of using only the output of the segmentation network with fitting ellipse ranks the fourth. In order to ensure the authority of the evaluation, we only selected published paper results for comparison, as shown in Table 11. Our best result with the SAPNet achieves a score 1.81 ± 1.69/97.94 ± 1.34/0.59 ± 2.41/1.22 ± 0.77 in terms of ADF, DSC, DF, and HD.

Table 11 Results of the HC18 challenge

5 Discussion

The most important observation in our experiment is that we use multi-scale information to synthesize local and global context to identify the edge information of the skull, while the regression network can correct the elliptic geometric axes into biological OFD and BPD. A feature pyramid is established at the feature level to utilize local and global information corresponding to different sizes of receptive fields in the feature layer. Our network structure is quite different from the previous network that used a single scale to segment the head region of a fetus. Furthermore, we proposed a scale attention module for multi-scale information fusion, which yielded better performance in our experiments.

On the other hand, in different previous approaches that treat biological and geometric lengths equally, we add a regression network to obtain OFD and BPD by modifying the major and minor axes of the ellipse. In our regression network, the ellipse pooling module plays an important role, because it can combine ellipse parameters fitted by results of the segmentation network and visual feature to modify geometric lengths. Our experiments also demonstrate the effectiveness of our regression network.

There are also some problems with our proposed network and U-Net: as shown in Table 5, these networks perform worse in the first trimester compared with the second and third trimesters. This is because the fetus in the first gestation period has a softer skull tissue, which is very similar to the tissue inside the skull, so there is no obvious characteristic change between the skull and the inside of the fetal head in the ultrasound images. This can serve as an open question to further advance the measurement of the fetal head. One of the simplest treatments is to design a network structure for the first trimester.

6 Conclusion

In this work, we proposed a novel deep neural network that uses multi-scale information for fetal head segmentation and accurate BPD, OFD prediction in ultrasound images, and design an automatic measurement system based on the network structures. We design the SAPNet that establishes feature pyramids and uses attention mechanism to select feature layers. The SAPNet that uses scale information can fuse local and global information to infer skull boundaries that contain speckle noise or discontinuities. Based on the segmentation results of SAPNet, we obtain the head circumference by performing ellipse fitting in the least squares method. Ellipse pooling is used to project the ellipse parameters to the encoding feature layer of the segmentation network, and the elliptic geometric axes are modified by the regression network to obtain more accurate BPD and OFD. Our experimental results show that the proposed approach can achieve comparable performance with other models on the HC18 dataset. However, our results were only significant in ultrasound images of a single target. Future work should include multi-order data so that they be able to evaluate the performance on the fetal heads’ regions of twins.