1 Introduction

Based on the Global Stroke Fact Sheet 2022 published by the World Stroke Organization [1], stroke is still the World’s second major factor of death, as well as the third major factor of death and disability. It is estimated that more than US$891 billion is spent globally for the treatment and prevention of stroke, which is 1.12% of the global GDP [2]. Stroke has become one of the major threats to global public health. The lifetime risk of stroke has risen by 50 percent in the last 20 years% [3]. In many countries, stroke disease has shown explosive growth with the accelerated urbanization, aging of society, and the prevalence of unhealthy lifestyles. Therefore, the prevention and treatment of stroke are facing great challenges.

Stroke is associated with high mortality, recurrence and disability rates [4], and can be classified into two categories based on etiology: ischemic and hemorrhagic strokes. Ischemic stroke patients account for over 80% of all stroke patients [5]. Common ischemic stroke causes are atherosclerotic plaques and carotid stenosis in the carotid region. Hence, the early detection of atherosclerosis and carotid stenosis is a fundamental approach to preventing ischemic stroke and assessing the risk of development.

Image segmentation is an important method to assist in the diagnosis of stroke. Over the past few decades, lots of segmentation methods for vessels have emerged [6,7,8,9,10,11]. In traditional vessel segmentation studies, researchers have proposed many methods [12,13,14,15,16], such as the hierarchical region growth algorithm [17], semi-automatic carotid segmentation algorithm based on a self-adaptive segmentation algorithm [18], the decision mechanism for venous bone separation points [19], the random wandering algorithm [20], the shortest path faster algorithm [21] and the method based on Hession matrix [22]. However, most traditional image segmentation methods require manual initialization or interactive operations, which cannot be fully automated.

In recent years, deep learning-based segmentation methods have been used with good results in various medical tasks [23,24,25,26,27,28]. For example, CarotidNet for 3D CTA carotid segmentation [29], 3D-UNet for coronary CCTA image segmentation [30], the SC2Net for detecting the detection of COVID-19 in X-rays [31], the atlas-based organ segmentation network MTL-ABS3Net [32], the CT image segmentation algorithm for liver [33], the transformers for 3D Medical Image Segmentation [34], combo loss-based spatio-temporal feature fusion network for coronary artery segmentation [35], MSRF-Net for medical image segmentation [36] and SONNET for cell nucleus segmentation [37]. Nevertheless, the application of deep learning-based methods for the carotid segmentation of 3D CTA images is still relatively few. It fails to achieve better results due to the limitation of privacy protection and labeling difficulties.

With the development of computer computing power [38], research on stroke risk prediction based on deep learning and machine learning has received more attention. Khosla et al. [39] proposed a new automatic feature selection algorithm for stroke risk prediction. Dritsas et al. [40] used machine learning techniques to predict stroke risk. Teoh et al. [41] predicted stroke risk from electronic health records. Arslan et al. [42] used data mining methods to predict stroke. However, these methods do not fully take advantage of clinical text and image data. They are limited by the difficulty of data acquisition and cannot obtain higher accuracy.

Given the above problems, we propose a 3D carotid CTA image segmentation model called CA-UNet and an ischemic stroke risk prediction model. CA-UNet is based on the encoder-decoder structure and improves on the down-sampling scheme, which could decrease the model parameters and effectively accelerate convergence speed. Skip connections enable the network to have information at each scale when decoding features at different scales. And we proposed a new fusion loss function for the characteristics of the task and introduced multi-scale training to balance the model’s learning direction. Besides, our ischemic stroke risk prediction model is a fusion prediction network model that uses multiple data to predict jointly. It consists of a 3D image feature extraction network that uses carotid CTA images for prediction and a machine learning model that uses electronic medical records and medical history for prediction. The model predicts the risk of morbidity by the fusion of weights, fully uses clinical information and achieves good results. We validated the effectiveness of our models through comparative tests on our dataset. Our three contributions are listed below:

  1. 1.

    We propose a model which is more applicable for 3D CTA carotid segmentation.

  2. 2.

    We propose a multi-scale loss function for joint training which could solve the problem that features of image details would be lost in the process of down-sampling.

  3. 3.

    The proposed model for predicting the risk of ischemic stroke can effectively predict the risk of ischemic stroke in patients. The model could make a significant contribution to public health security.

2 Materials and Methods

2.1 Dataset

2.1.1 Private Dataset

The image data in the private dataset contain CTA images of 42 patients for the segmentation task and CTA images of 390 patients for the ischemic stroke risk prediction task. These data were provided by the partner hospitals and were desensitized for use in the study. Approximately 31,000 CTA images of 42 patients used for the segmentation task were annotated by two or more radiologists. We randomly selected 25 sets of 3D CTA images to be used for training and the remaining 17 sets of 3D CTA images to be used for testing. The CTA images of 390 patients for the ischemic stroke risk prediction task comprised approximately 290,000 CTA images. We randomly chose 80% of them as training samples and the others as test samples. Table 1 illustrates the sample distribution.

Table 1 Sample distribution of private datasets for risk prediction

The text data in the private dataset contains electronic medical records and the medical history of 390 patients used for the ischemic stroke risk prediction task. The text data categories are age, gender, blood glucose level, body mass index (BMI), smoking status, type of residence, type of work, marital status, history of heart disease and history of hypertension.

2.1.2 Public Dataset

The public dataset, which contains only text data, is from the Stroke Prediction Dataset in Kaggle [43], with 4908 patient data. The public dataset assists in machine learning model training as the sample size of text data in the private dataset is relatively small. We randomly selected 80% of the samples as training samples and the rest as test samples, and the sample distribution is shown in Table 2. The data categories of the public dataset used are the same as the private dataset.

Table 2 Data distribution of public dataset

2.2 Data Preprocessing

2.2.1 Carotid Segmentation Task

In the carotid segmentation task, the preprocessing operation performed on the dataset is divided into two steps. First, the image pixel values are intercepted according to the target of interest. The pixel values in the CTA images are called Hounsfield Units(HU). CTA imaging has a wide range, and the HU of the human body ranges from − 1000 to +1000, for a total of 2000 values. Humans cannot distinguish such minor grayscale differences. Hence, the radiologists adjust the Window Width and Window Center of the CTA image according to the actual condition to see the target better. Based on the above idea, we eliminated the interference of irrelevant parts by limiting the range of HU values of CTA images, i.e., setting two HU thresholds: the minimum HU and the maximum HU. If the HU value of a pixel in the image is smaller than the minimum HU or larger than the maximum HU, the HU value is truncated to the threshold. As shown in Fig. 1, by this processing, the influences of most other tissues outside the range of carotid HU values are excluded, which can effectively reduce mis-segmentation and facilitate network training.

Next, the CTA image data were resampled and down-sampled in the cross-section. Since the CTA images used in this paper came from multiple scanning devices, the pixel z-axis spacing of the data is not uniform. When using 3D CTA images for deep learning model training, different slice thicknesses can impact the scale of extracted features, leading to degradation of model performance. Therefore, the input CTA images were resampled by trilinear interpolation. During resampling, the center of the input image is kept constant. The images were remapped to a spatial coordinate system with a pixel pitch of (1, 1, 1) mm, ensuring that all samples had the same pixel pitch. Since the 3D images used in this paper occupy hundreds of times more space than the 2D images, even if each batch is trained with only one set of 3D images for training, the GPU processor memory limitation causes the deep learning model to be untrained. Therefore, the input CTA images were down-sampled in the cross-section, and the cross-section size was adjusted from 512 \(\times\) 512 to \(256\times 25\). The down-sampling was carried out by trilinear interpolation. Figure 1 shows the cross-sectional comparison of CTA images before and after image pre-processing.

Fig. 1
figure 1

Data preprocessing in carotid segmentation task

Fig. 2
figure 2

3D structure of the main region of the carotid artery

Fig. 3
figure 3

CA-UNet. The input 3D CTA image size is \(256 \times 256 \times 32 \,(\hbox {H}\times \hbox {W}\times \hbox {D})\). The model will first convolve the input image using a convolution kernel of size \(3\times 3 \times 3\). In the feature extraction process, the number of channels increases sequentially in the order of 16, 64, 128, and 256 as the size of the feature map decreases. In contrast, in the extended channel, the number of channels gradually decreases and the feature map becomes progressively larger. Note that the first result of each layer of the expanding path has a channel number of 64, since each has four sources. (For example, the second layer of the expanding path also has a jump connection from the fourth layer, which is not drawn in the figure for space constraints.)

2.2.2 Ischemic Stroke Risk Prediction Task

The two preprocessing steps described above in the ischemic stroke risk prediction task were also used to preprocess CTA images. The difference was that we needed to intercept the network model’s carotid artery region as the input image. The input to the image feature extraction network in the ischemic stroke risk prediction model is the segmented carotid region, which could be regarded as a subsequent task to the carotid segmentation task. The input was a 3D CTA image of the carotid region rather than the whole set of complete CTA images. Therefore, we selected 2 cm upward and 3 cm downward as the target area for interception according to the professional doctor’s recommendation, i.e., we took 20 images upward and 30 images downward centered on the carotid bifurcation in the section direction, totaling 50 section images. Then, we intercepted the carotid artery region according to the image segmentation results. To facilitate network training, we adjusted the size of the intercepted carotid artery region to \(40 \times 40 \times 50\).

Next, the public dataset was preprocessed. The public dataset contains some missing items, and the gaps need to be processed. The treatment of missing values was attempted, including deleting these records and filling the gaps with average values. After selection, we used a decision tree to predict the missing values.

2.3 CA-UNet Model

2.3.1 Main Structure

Compared with 2D segmentation tasks, the segmented target carotid region has prominent overall structural characteristics. The 3D structure of the central region of the carotid artery is shown in Fig. 2. The carotid artery can be roughly divided into the common carotid artery below the bifurcation, the internal carotid artery above the bifurcation, and the external carotid artery. The size of the common carotid artery is larger, while the size of the internal and external carotid arteries becomes smaller as the blood vessel extends upward. To adequately parse the information in the CTA images, the CA-UNet model adopts an encoder-decoder structure, and the CA-UNet is shown in Fig. 3. The left side is a contracting path and the right side is an expanding path. The primary function of the contracting path is to extract the feature from 3D CTA images. The expanding path fuses multi-scale features and gradually restores the feature map to the identical size as the input image.

In the past work, image features of various scales in the model were not related. Hence, we use skip connections, which enable the network to get information at each scale when decoding features at different scales. We fuse features from different scale feature maps by skip connections which enables the network to decode the features of each layer with information of each scale and improves the network performance. We take the feature map of the third layer as an example. Its input data have two sources, one is the jump connections of the first three layers of the feature map in the contracting path, and the other is the up-sampling of the feature map in the layer above it, i.e., the feature map in the fourth layer of the extension path. In order to fuse the above four groups of feature maps, two problems need to be solved. The first problem is the different sizes of feature maps at different scales. To solve this problem, we use max pooling. For example, the first layer of the contracting path has an output feature map size of \(256 \times 256 \times 32\), which is pooled using a 3D max pooling layer with the \(4 \times 4 \times 4\) window size and 4 step size. The second problem is that the number of channels of feature maps at different scales. The number of channels of deep feature maps can be tens of times that of shallow feature maps, and direct splicing will result in a tiny percentage of shallow features in the final fused features. Here we use the standard convolution module of 3D convolution operation + batch normalization + ReLU activation function to process the three sets of pooled feature maps separately. After that, the above four sets of 3D feature maps are combined at the channel level by the concatenation operation. Finally, the combined features are fed into the convolution kernel so that each input feature map is feature fused to obtain the output feature map for that scale layer of the expansion channel. The output feature map will be directly involved in the loss value calculation as one of the input predictions in the joint training scheme of the multi-scale loss function, in addition to continuing to up-sampling as input to the subsequent network layers.

Fig. 4
figure 4

Comparison of different size blood vessels in carotid CTA images

Furthermore, in the carotid 3D CTA image segmentation task, the carotid artery region accounts for a small proportion of the whole set of 3D images, and the blood vessels vary in thickness. As shown in Fig. 4, the difference in blood vessel size between the internal carotid artery, external carotid artery, and common carotid artery region is evident. Therefore, the original U-Net [44] four times down-sampling scheme in this task has the problem of excessive down-sampling. By removing the last layer of down-sampling, we not only increase the number of shallow convolutional layers and channels and reduce the model parameters, effectively accelerating the training, but also have no impact on performance.

2.3.2 Loss Function

Since the carotid region accounts for a small proportion of the whole 3D CTA images, there is a severe problem of positive and negative sample imbalance in the segmentation task. The statistics of the samples in the dataset reveal that the ratio of positive and negative samples is about 0.003, which is a severe imbalance. Thus, based on the Dice distance, we innovatively design an improved loss function. It effectively solves problems caused by inconsistent positive and negative samples and makes the network perform better on challenging classification samples. The single-scale loss function can be represented by Eq. (1).

$$\begin{aligned} \begin{aligned} L_{\text{ Hybrid } }&=\left( -\log \left( \frac{2 \cdot \sum _{i=0}^N\left( y_i \cdot \bar{y}_i\right) }{L_{H_{-} D}}\right) \right) ^\gamma \\ L_{H_{-} D}&=\sum _{i=0}^N\left( y_i \cdot \bar{y}_i\right) +\alpha \sum _{i=0}^N\left( \left( 1-y_i\right) \cdot \bar{y}_i\right) \\&\quad +\beta \sum _{i=0}^N\left( y_i \cdot \left( 1-\bar{y}_i\right) \right) \end{aligned} \end{aligned}$$
(1)

where \(\alpha\) controls the weight of false positive samples in the loss value calculation, and \(\beta\) controls the weight of false negative samples in the loss value calculation. We can weigh the model prediction bias by adjusting these two parameters. In this paper, the \(\alpha\) and \(\beta\) are taken as 0.4 and 0.6 to make the model balance and performance reach a better state. And \(\gamma\) is taken as 0.3 to improve the function’s nonlinear performance.

In addition, in order to solve the problems of unstable training and difficult convergence of the original Dice loss function and help the model jump out of local extrema, we add a binary Cross Entropy loss function, which can effectively smooth the gradient. The final loss function used in this paper at the single-scale can be represented by Eq. (2).

$$\begin{aligned} \begin{aligned} L_{\text{ Hybrid } }&=\left( -\log \left( \frac{2 \cdot \sum _{i=0}^N\left( y_i \cdot \bar{y}_i\right) }{L_{H_{-} D}}\right) \right) ^\gamma +\lambda L_{B C E} \\ L_{H_{-} D}&=\sum _{i=0}^N\left( y_i \cdot \bar{y}_i\right) +\alpha \sum _{i=0}^N\left( \left( 1-y_i\right) \cdot \bar{y}_i\right) \\&\quad +\beta \sum _{i=0}^N\left( y_i \cdot \left( 1-\bar{y}_i\right) \right) \end{aligned} \end{aligned}$$
(2)

where the parameter \(\gamma\) is used to balance the participation weight of the binary cross-entropy loss function. Setting the parameter \(\gamma\) to 1 at the beginning of model training can help the training to be more stably and accelerate the model convergence. In the later stages of training, we gradually decrease the value of the parameter \(\gamma\) so that we can improve the model performance according to the Dice distance.

To better supervise the fusion of features at each scale of the CA-UNet, we propose a multi-scale multi-loss function joint training scheme, which can make good utilization of the image features extracted at each scale layer in training. In the calculation, an additional convolutional layer is connected after the expanding paths of all four scale layers. This convolutional layer uses a convolutional kernel of size \(3 \times 3 \times 3\) with one channel to convolve the output of this scale layer. Then, We use trilinear interpolation to uniformly recover different feature maps to the input image size. We compute the corresponding loss values using the Sigmoid function and the above single-scale loss function. Finally, we assign different weights to the losses calculated by each scale layer to accumulate the final loss function. The multi-scale loss function can be represented by Eq. (3).

$$\begin{aligned} \begin{aligned} \text{ Loss } =\left( L_{\text{ Hybrid1 } }+L_{\text{ Hybrid2 } }+L_{\text{ Hybrid3 } }\right) \cdot \omega +L_{\text{ Hybrid4 } } \end{aligned} \end{aligned}$$
(3)

where \(L_{\text{ Hybrid1 } }\), \(L_{\text{ Hybrid2 } }\), \(L_{\text{ Hybrid3 } }\) and \(L_{\text{ Hybrid4 } }\) denote the loss values computed from four different scale feature maps in the network model from deep to shallow. \(L_{\text{ Hybrid4 } }\) is the loss value calculated from the output image in the network, the specific calculation of these four loss functions is specifically calculated for the single-scale loss function design described above. \(\omega\) iis the weight for the first three loss values. During the training process, this weight value is gradually decreased by a factor after each certain round of iterations. The proportion of \(L_{\text{ Hybrid4 } }\) in the overall loss value is gradually enlarged so that the model output is closer to the target effect in the later stages of training.

Fig. 5
figure 5

3D dense connection module structure

2.4 Ischemic Stroke Risk Prediction Model

2.4.1 3D Image Feature Extraction Network

In this paper, to extract the 3D features of carotid CTA images, we extend each convolutional layer of the conventional DenseNet from 2D to 3D convolution. Our 3D DenseNet uses dense connectivity to ensure that each network layer is connected to all the networks in the previous layer. First, we perform the initial feature extraction work on the input image using a convolutional kernel of size \(7 \times 7 \times 7\). Secondly, the model is followed by several 3D densely connected modules and transition modules. The transition modules consist of 3D convolution and 3D pooling, and the 3D densely connected modules are the core of image feature extraction in this paper. The structure of the 3D densely connected modules is presented as Fig. 5. Each feature map is directly connected to subsequent layers in the 3D densely connected module by skip connections before being fed to the next convolutional layer. The features from each layer are fused using the addition operation at the end of the connection. For the \(i\textrm{th}\) network layer within the module, the output x can be shown as Eq. (4).

$$\begin{aligned} \begin{aligned} x_i=H_i\Big (\left[ x_0, x_1, \ldots , x_{i-1}\right] \Big ) \end{aligned} \end{aligned}$$
(4)

where \([x_0, x_1, \ldots , x_{i-1}]\) represents the dense concatenation of the first few layers of the input feature maps, \(H(\cdot )\) is the nonlinear transformation function, which is a set of composite functions performing 3DConv+ReLu+BN operations. 3DConv is a 3D convolution operation using a convolution kernel of size \(3\times 3\times 3\). ReLU is the commonly used activation function. BN is the batch normalization operation. There is a direct connection between any two layers in the 3D DenseNet, which could obtain a larger range of receptive fields and preserve the features of the lower layers. Morever, using the bottleneck layers, the 3D DenseNet has fewer parameters than the 3D convolutional neural network. Fewer parameters make it easier to train the network when the 3D model is limited by GPU memory.

Fig. 6
figure 6

3D deformable convolutional structure

Besides, to extract the image features in the carotid region more effectively, we propose to introduce deformable convolution into the 3D image feature extraction network. The 3D deformable convolution adds an offset to the convolution kernel for learning, which enables the shape and size of the convolution window to be adjusted autonomously according to the characteristics of the carotid artery region. By this method, we make the convolution window focus on the carotid and take full advantage of the spatial structure of the data. The training of offsets and weights of the 3D deformable convolution can be represented by Eq. (5).

$$\begin{aligned} \begin{aligned} y\left( p_0\right) =\sum _{p_n \in R} w\left( p_n\right) x\left( p_0+p_n+\Delta p_n\right) \end{aligned} \end{aligned}$$
(5)

where \(p_0\) represents the position of the pixel point in the output feature map, \(y\left( p_0\right)\) represents the feature value of the convolution layer at that position, and \(p_n\) represents the \(n^{t h}\) value in the convolution receptive field R. When using a 3D convolution kernel of size \(3\times 3 \times 3\), the receptive field \(R\!=\!\{(-1,-1,-1),(-1,-1,0),\ldots ,(1,1,0),(1,1,1)\}\). \(w\left( p_n\right)\) represents the weight of the corresponding position of the convolution kernel. \(\Delta p_n\) represents the offset corresponding to the \(n\textrm{th}\) value in the deformable convolutional receptive field R, and the exact position is obtained by the bilinear difference. The improved 3D deformable convolution structure is shown in Fig. 6.

2.4.2 Fusion Prediction Network

Fig. 7
figure 7

Fusion prediction network. The input of the image feature extraction sub-model is the segmented 3D CTA image of the carotid region, which can reduce the influence of redundant image regions. The input of the machine learning sub-model is the electronic medical record. Ultimately, we obtain joint risk assessment results

The overall of fusion prediction model is presented in Fig. 7. The fusion prediction model mainly consists of the image feature extraction sub-model and the machine learning sub-model. Since data such as electronic medical records and medical history in the private dataset are insufficient to train a machine learning model with good results, we introduce a large amount of data from the public dataset to assist in training the machine learning sub-model. Along this line, we perform migration learning on a 3D image feature extraction network trained on a CTA image dataset and a machine learning model trained on a public dataset containing electronic medical records and medical history data. Through parameter migration, we migrate the weight parameters into the image feature extraction sub-model and the machine learning sub-model in the fusion prediction network model. Finally, we derive the joint risk assessment results by weight fusion. The optimal weight values of the two sub-models in the fusion prediction network are derived by grid search.

In the fusion prediction network model, the outputs of the image feature extraction sub-model and the machine learning sub-model are fused according to the scale factors \(\lambda _1\) and \(\lambda _2\). In this way, we can obtain the final output prediction probability value of the fusion prediction network model, which can be calculated by Eq. (6).

$$\begin{aligned} \begin{aligned} p_{\text{ out } }=p\left( \hat{y}=1 \mid x\right) =\lambda _1 \cdot p\left( \hat{y}_1=1 \mid x\right) +\lambda _2 \cdot p\left( \hat{y}_2=1 \mid x\right) \end{aligned} \end{aligned}$$
(6)

where x indicates the input value, \(\hat{y}_1\) indicates the evaluation result of the image feature extraction sub-model, \(\hat{y}_2\) represents the evaluation result of the machine learning sub-model, \(\lambda _1\) and \(\lambda _2\) represent the weight values of both in the fusion prediction network model. We migrate the trained weight parameters of the sub-models to form a fusion prediction network model. The model can obtain joint prediction results that combine various types of information while the data are trained separately, making full use of the information in the data.

3 Results and Discussion

3.1 Evaluation Indicators

Dice coefficient, Jaccard Index, False Negative Rate (FNR), and False Positive Rate (FPR) are utilized evaluation metrics in image segmentation. The below is the formula representation for the above evaluation metrics, where \(R_{ g t}\) represents the ground truth of the segmentation result and \(R_{ s e g}\) represents the segmentation result predicted by the network.

Dice coefficient indicates the ratio of the area of the intersection of two set regions to the total area and is usually used to represent the degree of overlap of two sets. A higher value of the Dice coefficient indicates a better segmentation result. The calculation method is represented by Eq. (7).

$$\begin{aligned} \begin{aligned} \text{ Dice } =\frac{2 *\left( R_{ s e g} \cap R_{ g t}\right) }{R_{ s e g}+R_{ g t}} \end{aligned} \end{aligned}$$
(7)

Jaccard Index is expressed as the ratio of the area where two regions intersect to the area where they merge, which is compared to the similarity and difference of the two regions. The larger the value of the Jaccard coefficient, the more similar the two sets are, and the calculation is denoted by Eq. (8).

$$\begin{aligned} \begin{aligned} \text{ Jaccard } =\frac{\left( R_{\text{ seg } } \cap R_{ g t}\right) }{R_{\text{ seg } } \cup R_{ g t}} \end{aligned} \end{aligned}$$
(8)

False Negative Rate denotes the proportion of foreground pixels misclassified as background pixels to all pixels in the whole. A higher value of False Negative Rate indicates that more parts of the target object are not segmented completely and is calculated by Eq. (9).

$$\begin{aligned} \begin{aligned} { F N R}=\frac{{ F N}}{R_{ s e g} \cup R_{ g t}} \end{aligned} \end{aligned}$$
(9)

False Positive Rate denotes the proportion of background pixels misclassified as foreground pixels to all pixels in the whole. If the FPR value is higher, the more redundant parts of the result that do not belong to the target object. The calculation is represented by Eq. (10).

$$\begin{aligned} \begin{aligned} { F P R}=\frac{{ F P}}{R_{ s e g} \cup R_{ g t}} \end{aligned} \end{aligned}$$
(10)

Accuracy (Acc), sensitivity (Sen), and specificity (Spe) have commonly used evaluation metrics in medical image prediction classification tasks. The above evaluation metrics are calculated from a two-dimensional confusion matrix.

Accuracy represents the proportion of all samples with correct predictions, which could be obtained by the equation 11.

$$\begin{aligned} \begin{aligned} \text{ Accuracy } =\frac{{ T P}+{ T N}}{{ T P}+{ F P}+{ T N}+{ F N}} \end{aligned} \end{aligned}$$
(11)

Sensitivity represents the probability that an algorithm can correctly determine a positive sample and is calculated by Eq. (12).

$$\begin{aligned} \begin{aligned} \text{ Sensitivity } =\frac{{ T P}}{{ T P}+{ F N}} \end{aligned} \end{aligned}$$
(12)

Specificity represents the probability that an algorithm can correctly determine a negative sample and is calculated as shown in Eq. (13).

$$\begin{aligned} \begin{aligned} \text{ Specificity } =\frac{T N}{F P+T N} \end{aligned} \end{aligned}$$
(13)

3.2 Results and Discussion of the Two Tasks

3.2.1 Carotid Segmentation Task

First, we investigated the appropriate number of down-sampling for CA-UNet by testing whether the four down-sampling layers used in the conventional encoder-decoder partition network have excessive down-sampling problems. After constructing a network containing four down-sampling layers, the down-sampling modules and network layers were removed layer by layer, starting from the bottom layer. The number of channels at the bottom layer was kept constant. Table 3 shows the results of comparative experiments with different numbers of down-sampled layers.

Table 3 Comparison results of different down-sampling times
Table 4 Comparison results of different loss functions (Bold indicates that the best results were achieved in the current indicator)
Table 5 Experimental results of different carotid artery segmentation models

The experimental results show that when training CA-UNet, removing the lowest layer of down-sampling had little effect on the segmentation performance. The network performance showed a decline when the remaining down-sampling layer was removed. Removing the useless down-sampling layer could increase the number of shallow convolutional layers and channels and decrease the model parameters, effectively accelerating the training.

Different ways of calculating loss values have significant effects on the direction of learning in the training. Table 4 provides the test results of CA-UNet model using the fusion loss function compared with the Dice loss function.

The results show that compared with the Dice loss function, using the fusion loss function has significantly improved the results, and has better performance in all indicators. The most apparent decrease in the False Negative Rate indicates that by adjusting the variable parameters in the fusion loss function, the purpose of balancing the learning direction of the model is achieved.

To test the performance of CA-UNet and the fusion loss function, the same training set was used to train the models of 3D U-Net, V-Net, Zhou [29] and Zhu [35], and we tested on the same settings. The test results are presented at Table 5.

Compared with 3D U-Net, V-Net, [29], and [35], our CA-UNet model combined with fusion loss function gets the best evaluation performance with Dice coefficient, Jaccard Index, and False Negative Rate of 90.49, 82.90 and 9.96%, which is better than all other methods, and False Positive Rate of 7.14%, which is better than the other two methods. In addition, due to the optimization of the model structure and the number of down-sampling, the CA-UNet model has only 1/2 of the parameters of Zhu et al.

Table 6 Segmentation results for each volume in the test set
Fig. 8
figure 8

Comparison of CA-UNet segmentation results with ground truth

Table 6 shows the performance of each group of CTA images in the test set, applying the CA-UNet model. The best results were obtained for the sample numbered V-10, with Dice coefficient, Jaccard Index, False Negative Rate, and False Positive Rate of 96.44, 93.13, 5.53 and 1.35%. Figure 8 shows the segmentation results of the CA-UNet model proposed in this paper compared with the ground truth.

Table 7 Comparison between machine learning model and data resampling method

3.2.2 Ischemic Stroke Risk Prediction Task

The ischemic stroke risk prediction model proposed in this paper consists of three parts. The first part is a 3D image feature extraction network for carotid CTA images. The second part is a machine learning model for predicting stroke risk using electronic medical records and medical history. The third part is the fusion network. We design comparison experiments for the machine learning model, the 3D image feature extraction model, and the fusion network model on their respective datasets.

First, we conduct comparative experiments with machine learning models. In this paper, SMOTE, random under sampling (RUS), and instance hardness threshold (IHT) were selected as methods to solve the sample size imbalance problem. The results of the comparative experiments are presented in Table 7.

Table 8 Comparison of fusion prediction models, where the 3D image feature extraction network is abbreviated as 3D CTA network

In the comparison experiment of resampling methods, the accuracy and specificity of the XGBoost model using SMOTE are 90.63 and 95.16%, which are the best performance, but the sensitivity is only 11.32%. Because the SMOTE method oversamples a small number of classes for training, which leads to the misconception that the model can classify well during the training phase. However, it is still poor at classifying samples in a small number of classes in fact. Compared to the other two data resampling methods, Instance Hardness Threshold performs better on each machine learning model due to the removal of those data that are often misclassified in training. In the comparative experiments of machine learning models, using Instance Hardness Threshold as the resampling method, the Logistic Regression model performed the best overall, with accuracy, specificity, and sensitivity metrics of 79.53, 79.25 and 79.55%.

Table 9 Comparative results of predictive models for ischemic stroke risk

Next, we use the above machine learning model and 3D image feature extraction network to construct a fusion prediction network and conduct comparison experiments. The results are presented as Table 8.

The results indicate that the fusion prediction model with two sub-models weight ratios of 0.5 and 0.5 performs best. The accuracy, specificity, and sensitivity of the test set are 89.74, 94.44, and 85.71%. Its sensitivity was the highest, indicating that the model could correctly determine positive samples in the ischemic stroke risk prediction task.

In order to validate the performance of the proposed 3D image feature extraction network and the fusion prediction network, we train and test 3DResNet, 3D-CNN and 3D-DenseNet using the same settings. The test results are presented as Table 9.

The results indicate that 3D-ResNet achieved the best results in the specificity for the ischemic stroke risk prediction task. The proposed model achieved the best overall results with accuracy and sensitivity of 83.33 and 91.67%. When using machine learning models such as XGBoost alone for prediction, the results are relatively poor because the rich information in CTA images is not utilized. In contrast, the fusion prediction model, with accuracy, specificity, and sensitivity of 89.74, 94.44, and 85.71% on the test set, achieved the best results in all three metrics. Therefore, the fusion prediction network model proposed in this paper has significant advantages for the ischemic stroke risk prediction task.

4 Conclusion

We use the CA-UNet model to segment the carotid region and the fusion model to predict the risk of ischemic stroke for patients. According to the characteristics of the carotid segmentation task, we proposed to reduce the down-sampling layer and use skip connectionss which reduce the cost of model training. And we apply a multi-scale loss function for joint training which could solve the problem that features of image details would be lost in the process of down-sampling. These novel designs resulted in a significant improvement of the assessment metrics compared to the work of others. In addition, based on CA-Unet,we propose to use a fusion prediction network to predict the risk of ischemic stroke in patients., with Acc, Sen and Spe of 89.74, 94.44 and 85.71%. Although we do not currently collect as much data as other vision tasks, our models can provide reliable diagnoses and outcomes, benefiting patients and healthcare professionals. In future research, we hope to expand more valuable data, enhance results, and investigate new ways to use more medical information, such as blood test information.