1 Introduction

The anatomical structure of human brain consists of two cerebral hemispheres that are roughly symmetric and separated by the ideal midline (IML), which is a straight line connecting the most anterior and posterior visible points on the flax as shown in Fig. 1. However, high or unbalanced intracranial pressure (ICP) could distort the IML into deformed midline (DML), as also presented in Fig. 1. Such midline shift (MLS) can be deadly and is usually caused by conditions such as traumatic brain injury (TBI), stroke or hematoma. In fact, the degree of MLS serves as a quantitative indication for neurosurgeons to monitor and control ICP.

Clinically, the visual inspection of MLS on head CT images has become a widely applied practice. However, comprehensive and quantitative MLS evaluation is still challenging and time-consuming for many health care providers (i.e. emergency physicians). Thus, the computer-aided methods are of research value and practical significance given they could improve the accuracy and efficiency of this evaluation.

In the literature, limited works [1, 5, 6] have tried to facilitate this clinical evaluation. Liao et al. [5] considered the midline as three isolated segments and then utilized mathematical curves to fit these segments using local intensity symmetry. Liu et al. [6] employed a landmark detection method to locate some relatively-stable key points (i.e. falx, septum pellucidum) to build DML. Using an enhanced voigt model to simulate this shift process, Chen et al. [1] proposed a method to transform IML to DML. However, most aforementioned methods lack of robust in the CT images with largely deformed brain due to the following two aspects: (a) The midline is relatively difficult to be distinguished in CT images given the low soft tissue contrast, especially when affected by TBI or stroke; (b) The shape of DML varies a lot depending on the brain deformations.

To address such issues, we formulated the midline delineation as a skeleton extraction task and proposed a novel regression-based line detection network (RLDN) for the robust delineation especially in largely deformed brains. Our method enables end-to-end training and includes three parts: (1) multi-scale line detection; (2) weighted line integration; and (3) regression-based refinement. Specifically, we proposed a novel multi-scale feature integration strategy over whole scales to better capture high-level semantic and low-level detailed information to generate midline probability maps. Then the weighted line integration module fused these maps together to generate a final map. Finally, we proposed a regression-based refinement module to refine and thin the final map to generate the coordinates of midline. Experimental results show that our proposed method can improve the line detection accuracy and achieves the state-of-the-art performance. Moreover, to our knowledge, this is the first work employing fully-convolutional approach for midline delineation on CT images.

Fig. 1.
figure 1

Examples of ideal midlines (IML, green line) and deformed midlines (DML, red dotted line) on head CT images, represented as green lines and red dotted lines, respectively. The P1 and P2 represent the anterior and posterior flax points. (Color figure online)

2 Method

As mentioned above, the proposed RLDN includes three parts: (1) multi-scale line detection; (2) weighted line integration; and (3) regression-based refinement, with the architecture shown in Fig. 2. First, the multi-scale line detection network (LDN) takes a 2D CT image \(I \in R^{H \times W}\) as input, and outputs the probability map \(\hat{Y} \in R^{H \times W \times S}\) corresponding to \(S=5\) scale levels. The loss \(\mathcal {L} _{\hat{y}_i}\) for each output \(\hat{y}_i \in \hat{Y}\) is computed with the ground truth \(Y \in R^{H \times W}\). Second, \(\hat{Y}\) is merged by the weighted line integration module to yield better line detection result \(\hat{y}_{fuse} \in R^{H \times W}\) with the loss \(\mathcal {L}_{fuse}\) to \(\hat{Y}\). Finally, \(\hat{Y}\) and \(\hat{y}_{fuse}\) are utilized to train the regression-based refinement module with loss \(\mathcal {L}_{regress}\) to regress \(\hat{Y}_R \in R^{(H+4) \times 1}\), which represents the coordinates of midline and its two flax points. The following subsections elaborate each module of our method.

2.1 Multi-scale Line Detection Network

The proposed LDN includes the down-sampling branch and the multi-scale bidirectional integration module (MSBI). Specifically, the down-sampling branch adopts convolutional parts of VGG [8] except following different settings: (1) number of channels halved, and (2) the addition of pyramid pooling module [12] connected with the last convolution layer in the initial three conv-blocks. Moreover, it follows the side-output manner in HED [9] to output probability maps \(\hat{Y}\) including features of five different scales. Then, the MSBI module is utilized to integrate these features.

As we all known, low-level features focus more on local detailed structures, while high-level features are rich in conceptual semantic information [13]. In our line detection task, not only the high-level semantic information but also the high-resolution details are required. Thus, we propose the MSBI module inspired by [10] to integrate all scale features to overcome above semantic vs resolution conflict. Specifically, this module takes feature maps \(f_i\) in scale i as input and produces the same resolution feature map \(\hat{y}_i \in \hat{Y}\) as output. The MSBI consists of two directional pathways (shallow-to-deep and deep-to-shallow pathways) so that features of each scale can be obtained from both semantic and detailed information.

Fig. 2.
figure 2

Overview of the architecture of the proposed RLDN, which consists of three parts: (1) multi-scale line detection, where (A) denotes the multi-scale bidirectional integration module; (2) weighted line integration in (B); and (3) regression-based refinement in (C).

The shallow-to-deep pathway starts from the first conv-block and ends with the last one. Specifically, the current scale feature \(f_i\) updates itself by encoding the learned representation \(h_{f_{i-1}}^{SD}\) of its previous scale:

$$\begin{aligned} h_{f_i}^{SD}= \left\{ \begin{array}{lr} \mathcal {F}\left( f_i, R_i \left( h_{f_{i-1}}^{SD}\right) \right) , &{} i \ge 2 \\ f_i,&{} i=1, \\ \end{array} \right. \end{aligned}$$
(1)

where \(h_{f_i}^{SD}\) represents the updated feature map in current scale, and \(R \left( \cdot \right) \) denotes the bilinear interpolation operation due to the mismatched size over different scale. \(\mathcal {F} \left( \cdot \right) \) represents the concatenate operation followed by \(1 \times 1\) convolution to keep the number of channels unchanged. Then, we use additional convolution and ReLU to encode \(h_{f_i}^{SD}\) to form the learned feature map \(h_{f_i}^{SD}\).

On the other hand, the deep-to-shallow pathway gathers the multi-scale features from high-level to low-level. This process can be formulated as:

$$\begin{aligned} h_{f_i}^{DS}= \left\{ \begin{array}{lr} \mathcal {F}\left( f_i, R_i \left( h_{f_{i+1}}^{DS}\right) \right) , &{} i \le \left( S-1\right) \\ f_i,&{} i=S, \\ \end{array} \right. \end{aligned}$$
(2)

where \(h_{f_i}^{DS}\) denotes the updated feature map. Other operations are the same as these in shallow-to-deep pathway.

Then, we merge the learned representations (\(h_{f_i}^{SD}\) and \(h_{f_i}^{DS}\)) of each scale to form the fused feature representation \(M_i\):

$$\begin{aligned} M_i = \sigma \left( \mathcal {F} \left( h_{f_i}^{SD}, h_{f_i}^{DS}\right) \right) , \end{aligned}$$
(3)

where \(\sigma \) denotes the non-linear activation function ReLU, and the meaning of other symbols are consistent with these in any directional pathway. Then the \(1 \times 1\) convolution and interpolation are adopted to generate the side-output prediction \(\hat{y}_i\) in the current scale with one channel.

For the each side-output \(\hat{y}_i\), the prediction error is computed by using the weighted cross-entropy (WCE) [9] loss function:

$$\begin{aligned} \mathcal {L}_{\hat{y}_i}=-\gamma \sum \nolimits _{j \in Y_+} Y_j log \hat{y}_{ij} -(1-\gamma ) \sum \nolimits _{j \in Y_-} (1-Y_j)log (1-\hat{y}_{ij}), \end{aligned}$$
(4)

where \(Y_+\) and \(Y_-\) denote the object and background ground-truth label sets, respectively. \(\gamma = |Y_-|/|Y|\) represents the class weight used to balance object and background. \(Y_j\) and \(\hat{y}_{ij}\) denote the label and the prediction value at pixel \(j=1,...,|Y|\) for specific scale i.

2.2 Weighted Line Integration

The weighted line integration module receives \(\hat{Y}\) produced by LDN and generates the fused probability map \(\hat{y}_{fuse}\). Specifically, we first construct the learnable weight \(W_H \in R^{H \times W \times S }\) and use it to do Hadamard product with the input to generate S channel maps. Then, the \(1 \times 1\) convolution is adopt to produce the fused prediction \(\hat{y}_{fuse}\). Finally, the following loss is introduced to compute the error with the ground truth Y:

$$\begin{aligned} \mathcal {L}_{fuse}=WCE\left( \hat{y}_{fuse},Y\right) . \end{aligned}$$
(5)

2.3 Regression-Based Refinement

The regression-based refinement module takes \(\hat{Y}\) and \(\hat{y}_{fuse}\) as inputs to regress the coordinates of midline and its two flax points. It should be noted that, since the direct line regression focuses more on the overall trend, instead of its endpoints, we decide to adopt an additional network layer to regress endpoints. Basically, this module consists of two branches: (1) the convolution branch and (2) the soft-argmax branch. The convolution branch includes three residual blocks [3] and four fully-connected layers. This branch takes \(\hat{Y}\) as input and produces a 1D vector \(\hat{Y}_R \in R^{(H+4) \times 1}\), where the sub-vector \(\hat{Y}_{R-midline} \in R^{H \times 1}\) in the middle of \(\hat{Y}_R\) denotes the column coordinates of midline. The remaining top-2 and bottom-2 elements represent the coordinates of two endpoints.

The soft-argmax branch consists of only soft-argmax [11] layer and takes the \(\hat{y}_{fuse}\) as input to generate a \(H \times 1\) vector, which is then utilized to update \(\hat{Y}_{R-midline}\) by element-wise addition:

$$\begin{aligned} \hat{Y}_{R-midline}(i) =\hat{Y}_{R-midline}(i)+\sum \nolimits _j^W \frac{exp\left( \mu \cdot \hat{y}_{fuse}(i,j)\right) }{\sum \nolimits _m^W exp\left( \mu \cdot \hat{y}_{fuse}\left( i,m\right) \right) }j, \end{aligned}$$
(6)

where \(\hat{y}_{fuse} (i,j)\) denotes the prediction value of location (ij), and \(\mu =10\) is a hyper-parameter controlling the smoothness of the soft-argmax. Finally, the Mean Squared Error (MSE) loss function is employed to compute the error:

$$\begin{aligned} \mathcal {L}_{regress}=MSE(\hat{Y}_R,Y_R), \end{aligned}$$
(7)

where \(Y_R \in R^{(H+4) \times 1}\) denotes the ground truth of the regression task.

2.4 Cost Function and Optimization

We simultaneously optimize all the side-outputs \(\hat{y}_i\), the fused prediction \(\hat{y}_{fuse}\), and the regression output \(\hat{Y}_R\) in an end-to-end way. The loss function of the whole framework is given as:

$$\begin{aligned} \mathcal {L}_{all}=\gamma \mathcal {L}_{fuse}+\xi \mathcal {L}_{regress}+\sum \nolimits _i^S \lambda _i \mathcal {L}_{\hat{y}_i}, \end{aligned}$$
(8)

where \(\gamma , \xi \) and \(\lambda _i\) represent the balance weights.

In the inferring phase, an input CT image I simply forwards through the aforementioned steps to generate the coordinates of midline and its two flax points. Then, we draw it into the zero-value map to get the final full-resolution midline map, as shown in Fig. 2.

3 Experiments

3.1 Data

Our dataset was derived from public CQ500 [2]. We selected all 64 midline shift cases and the same number of health subjects in this study. Specifically, a total of 5 CT slices with the largest brain area in each subject were selected to be manually delineated for the midline (total 128 subjects with 640 slices). From the subject-level, we randomly selected 100 subjects as training set and the rest 28 subjects as testing set. For pre-processing, each CT slice was resampled to uniform resolution \((0.5\,\times \,0.5\,\mathrm{mm}^2)\), applied intensity normalization from (−100, 200) to (0, 1), and then cropped into a patch with the size of \(400 \times 288\) to contain only the brain region utilizing a simple thresholding segmentation algorithm. Finally, we augmented training set by randomly rotating, left-right flipping, and brightness changing.

To evaluate the line detection task, we introduced two standard measures [9]: F1-score (the harmonic mean of precision and recall) when choosing threshold based on optimal dataset scale (ODS) or optimal image scale (OIS). For the evaluation of the regression task, we defined the following distance-related metrics: line distance error (LDE), max shift distance error (MSDE), anterior flax point distance error (AFDE), and posterior flax point distance error (PFDE).

3.2 Experimental Setting

The proposed method was implemented based on the publicly available platform Pytorch. During the training phases, the stochastic gradient descent (SGD) algorithm was used to optimize the whole network. The network weights were initialized by the Xavier algorithm and the weight decay was set to be 1e−4. In Eq. (8), we set \(\lambda _i = \gamma =1\) and \(\xi =2\). The remaining hyper-parameters and corresponding values were: mini-batch size (32), base learning rate (1e−4), momentum (0.9), and maximal iteration (400). We decreased the learning rate every 200 iterations with factor 0.1. The experiments were implemented on a NVIDIA Titan Xp with 12 GB memory.

Table 1. The ODS and OIS F1-score on the testing dataset by different methods. Higher is better.
Table 2. The mean and standard deviation of distance error (mm) on the testing dataset. Lower is better.

3.3 Results

Our proposed method conducts two tasks: (1) line detection task and (2) regression task. For the line detection task, we compared the result \(\hat{y}_{fuse}\) in RLDN with other five leading CNN-based methods designed for the similar skeleton extraction tasks in natural images including HED [9], SRN [4], RCF [7], HiFi [13] and MSB-FCN [10]. For the regression task, the VGG-16 [8] was used as the baseline method in our experiments.

Visual Inspection: Visual inspection for line detection probability maps are shown in Fig. 3. Obviously, our method achieved thinner and more accurate line detection results than all of the other methods for the severely-deformed cases (especially for the cases in the first row). This shows the superiority of our proposed method.

Quantitative Comparison: The performance of line detection task is shown in Table 1, where our RLDN achieved the best performance in terms of ODS or OIS. Specifically, RLDN improved the current best ODS to 0.78, mainly owing to the obtained precise line detection result, as shown in Fig. 3. Compared with LDN and RLDN in Table 1, the effectiveness of the regression task is further verified by the improved performance, e.g., \(6\%\), and \(5\%\) F1-score performance for the line detection task. In the same way, the benefit of the line detection task to regression is also verified from the Table 2. In summary, our proposed RLDN obtained the state-of-the-art performance in the task of DML delineation.

Fig. 3.
figure 3

Qualitative comparisons of our proposed method and the other five state-of-the-art CNN-based methods on some challenging cases. As shown, some lines by other methods are broken or quite thick, which indicate unsatisfactory results.

4 Conclusion

In this paper, we have proposed a novel regression-based line detection network (RLDN) for delineation and measurement of largely deformed brain midline. Specifically, the algorithm was based on multi-scale line detection and weighted line integration to capture high-level semantic and low-level detailed information for extracting midline. Finally, the midline was obtained by using regression-based refinement. Comparative results demonstrated that our proposed method achieves a clear performance boost in terms of both accuracy and robustness.