Introduction

According to investigation in [1], 12.9% cancer cases around the world concentrated on lung cancer by 2012, which remained the most common cancer in the world. Moreover, the mortality of lung cancer reached up to 19.4% because of its high case fatality. As some studies show, tobacco smoking is the main indicted for lung cancer [2, 3], and the lung cancer risk for a smoker is over 15 times higher than a non-smoker [4]. Moreover, another lethal factor of lung cancer is air pollution. In recent years, most first-tier cities and some medium-sized cities of China suffered from fog and haze, which contains excessive amounts of carcinogenic particles [5]. Thus serious lung health hazard has gradually appeared with high-speed urban development. In this sense, people should enhance their health management. However, most important of all, to avoid missing the best treatment opportunities an effective method of lung nodule early detection is urgently needed.

With the development of medical imaging techniques, X-ray, ultrasonography, positron emission tomography (PET) and computerized tomography (CT) are applied to lung nodule detection. Through many researchers’ validation, CT medical imaging technology is superior to others on the sensitivity and accuracy [6, 7]. Lung nodules are diminutive masses, whose sizes are approximately from 5 mm to 40 mm in the human lung [8], and the shapes of them are usually spherical. However, sometimes these nodules can be distorted due to the influence of surrounding anatomical structures [9]. Besides, some false positive cases such as ribs, blood vessels, and airway walls, which are mistakenly considered as lung nodules, make the task of the lung nodule detection challenging [10]. In general, compared with computer aided detection (CAD) system, the manual lung nodule detection method is error-prone and time-consuming.

This paper proposed an efficient assisted diagnosis system for detection of early pulmonary nodule in computed tomography images. We first used a novel two-dimensional OTSU’s curve thresholding method to segment the pulmonary parenchyma region. Then using spatial fuzzy C-means (SFCM) method the suspected pulmonary nodules were extracted. Finally the random forest (RF) classifier was used to make classification, which obtained a preferable result. Moreover, we also discussed the information security issues when the CAD system was applied to remote diagnosis.

Related works

An increasing number of radiologists have accepted the help of computer aided system and recognized its significance, especially in the pulmonary nodule detection, segmentation and diagnosis [1113]. The main target of the CAD system mentioned here was to precisely locate the pulmonary nodule. Multifarious CAD methods designed for detecting lung nodule have been proposed by many researchers until now [14]. An automated detection method based on shape and texture features proposed by Taşcı, E. et al was used to detect juxtal-pleural nodule [15], which acquired more than 90% accuracy of detection. Penedo M G et al used a two-level artificial neural network (ANN) architecture to detect lung nodule on digitized chest radiographs, which provided results of 89–96% sensitivity and 5-7 FP’s/image [16]. Diego M. Peña et al took advantage of minimal characteristics extraction technique to detect nodules, whose sizes are larger than 4 mm [17], and the sensitivity, specificity and accuracy are 94.23, 84.75, 89.19%, respectively. To reduce false negative cases, Akram Sheeraz et al used hybrid features to design a pulmonary nodule detection system based on support vector machine (SVM) [18], which can obtain the sensitivity of 95.31%. Focusing on small lung nodules of sizes between 2 mm and 10 mm, Alex Martins Santos et al developed an automated detection system using Gaussian mixture models. Besides, Tsallis’s and Shannon’s entropy calculated as a kind of texture descriptor was combined with SVM classifier for suspected regions classification. The results demonstrated that small nodules were detected with a sensitivity of 90.6%, a specificity of 85% and a accuracy of 88.4% [19]. As the tremendously growth of data, Atsushi Teramoto et al developed a fast lung nodule detection method using a cylindrical shape filter to reduce computation cost. The correct detection rate is 80% and the detection speed is 4-36 times faster than other existing methods [20]. In addition, Gonçalves L et al used Hessian-based strategies for the lung nodule segmentation in chest CT scans. Moreover, they combined both the shape index (SI) and curvedness (CV) approach and the central adaptive medialness approach [21]. Jiang H Y et al designed a risk assessment system about lung cancer with the help of CAD and acquired anticipated results [22].

Materials

This study used Lung Image Database Consortium (LIDC) data set, which has been collected from five different sites in America [23]. A total of 80 patients’ CT scans were obtained from LIDC for analysis and the number of CT slices each patient varied between 130 and 270. The standard distance between two slices was 2.5 mm and the thickness of each slice is between 1.3 and 3.0 mm. Among these CT slices, some images without pulmonary parenchyma can be easily detected and were regarded as invalid images. Thus the valid slices can be reduced to approximately between 20 and 100 per patient’s CT scan. This data set contained various lung nodules which may appear in any part of lung, thus based on nodules’ positions in the lung, four types of lung nodules are detected including well-circumscribed, vascularized, juxtal-pleural and pleural-tail, which were named in [24]. Each scan of CT slices was reviewed and diagnosed by four radiologists individually and the diagnostic records were stored in xml files as a golden standard.

Methods

To precisely screen out the suspected nodules from numerous lung-CT slices and make predictions about them, this study presented a novel solution to pulmonary nodule detection. Considering speed of calculation and accuracy of detection, this paper designed an efficient CAD system that performed four fundamental tasks: pulmonary parenchyma segmentation, pulmonary nodule candidate detection, feature extraction and pulmonary nodule classification. Figure 1 displays the whole flow chart of this method.

Fig. 1
figure 1

Flow chart of proposed method

Pulmonary parenchyma segmentation

Most of the CAD systems need to preprocess raw CT slices to remove noises generated by the equipment or environment. However, according to the characteristics of lung CT images, this study abandon this part to improve the processing speed. As the two pulmonary lobes locate approximately the center of thoraxes, which can be clearly seen in the CT images, it is convenient to detect and depict lung region through differences of gray value. Generally, the contrast ratio of pulmonary lobes and their external is quite large thus this study utilized gray value statistics method to segment pulmonary parenchyma.

This paper employed a novel two-dimensional OTSU’s curve thresholding segmentation method based on gray and non-local spatial gray feature, which was first proposed by Zhao Feng et al [25]. Firstly, we obtained a two-dimensional threshold point I(a, b) in the given gray scale image. Here a was gray value of threshold point I which was calculated by traditional OTSU method and b was non-local mean gray value of threshold I which was calculated by equation (1).

$$ b={\displaystyle \sum_{p\in {W}^r}{w}_p{i}_p}, $$
(1)
$$ 0\le {w}_p\le 1, $$
(2)
$$ {\displaystyle \sum_{p\in {W}^r}{w}_p=1}, $$
(3)

where W represented a calculation window whose size was r × r centered on point I. Besides, w p and i p was the weight and gray value of point p in the r × r window, respectively. The value of w p can be set depending on the similarity between point I and point p and this study used Euclidean distance to compute the similarity, which is shown as equation (4).

$$ {w}_p={\left\Vert v\left({N}_I\right)-v\left({N}_p\right)\right\Vert}_2, $$
(4)

where v(N I ) and v(N p ) are vectors whose elements were composed of gray values around the center point I and p with the size of k × k, which can be called similarity window.

However, we did not just use this single point I(a, b) to distinguish between the target and background. Secondly, a curve passing the point I(a, b) was made across the two-dimensional histogram and this curve can be a straight line S(a, b) perpendicular to the diagonal, which is shown in Fig. 2.

Fig. 2
figure 2

Two dimensional histogram of threshold line

Finally, we used line S(a, b) to classify the target and background with the discriminant (5).

$$ z\left(i,j\right)=\left\{\begin{array}{ccc}\hfill 0\hfill & \hfill if\hfill & \hfill x+y\le a+b\hfill \\ {}\hfill 255\hfill & \hfill if\hfill & \hfill x+y\le a+b\hfill \end{array}\right. $$
(5)

With the help of two-dimensional OTSU’s curve thresholding segmentation method, the noises such as salt and pepper noise can be well eliminated. Moreover, through a series of logical operations, the pulmonary parenchyma was easily acquired, which is shown in Fig. 3.

Fig. 3
figure 3

Segmentation process by two-dimensional OTSU’s curve thresholding method

Pulmonary nodule candidate detection

To correctly detect the pulmonary nodule, this paper used an improved fuzzy C-means (FCM) algorithm, which was proposed in our previous work [26]. Clustering algorithm was an unsupervised statistical method which can classify or cluster the data element to a certain class or cluster. The aim of clustering was to increase the difference between each class so that the data elements in the same class could be as similar as possible. Generally, one element belongs to only one cluster by classical C-means algorithm. However, with the help of fuzzy C-means method, the same element may be sorted into several different clusters which made it widely used in the classification of multi-class complex images.

However, the basic FCM algorithm was sensitive to noise and incapable of displaying the images’ space information so that this study adopted a novel FCM method, which is named as spatial fuzzy C-means (SFCM) [26]. The special function h ij of SFCM is defined as (6), which can enhance the performance of the original algorithm.

$$ {h}_{ij}={\displaystyle \sum_{k\in NB\left({x}_i\right)}{u}_{ik}} $$
(6)

where NB(x i ) is a square window whose center point is x i and u ik is a matrix whose elements represent degree of affiliation between pixel point x i and cluster j. Besides, the improved fuzzy affiliation with spatial information is expressed as (7).

$$ {\mu}_{ij}=\frac{u_{ij}^p{h}_{ij}^q}{{\displaystyle {\sum}_{j=1}^c{u}_{ij}^p{h}_{ij}^q}} $$
(7)

In equation (7), the parameter p and q are represented for the weight of center pixel and spatial information respectively thus the fuzzy affiliation can be connected to both the center pixel and pixels around it. Actually, two steps need to be implemented for SFCM method. First, we calculated the affiliation matrix and cluster centers based on classical FCM method. Second, spatial information was integrated to obtain a new fuzzy affiliation matrix μ ij .

To avoid the curse of dimensionality caused by high-dimensional space, this study took advantage of a Gaussian kernel function k(x i , y j ), which is shown as (8).

$$ k\left({x}_i,{y}_j\right)={e}^{\frac{-{\left\Vert {x}_i-{y}_j\right\Vert}^2}{\sigma^2}} $$
(8)

As a consequence, traditional inner product was replaced by (8) and thus we can get the new objective function, which is shown as (9).

$$ \begin{array}{l}{J}_m={\displaystyle \sum_{i=1}^n{\displaystyle \sum_{j=1}^c{\mu}_{ij}^m{\left\Vert \phi \left({x}_i\right)-\phi \left({x}_j\right)\right\Vert}^2}}\cdot {\left\Vert \phi \left({x}_i\right)-\phi \left({x}_j\right)\right\Vert}^2\\ {}\kern1.5em =k\left({x}_i,{x}_i\right)-2k\left({x}_i,{v}_j\right)+k\left({v}_j,{v}_j\right)\end{array} $$
(9)

In addition, the fuzzy affiliation matrix and cluster centers were also updated, which are expressed as (10), (11), respectively.

$$ {u}_{ij}=\frac{{\left(1-k\left({x}_i,{y}_j\right)\right)}^{\frac{-1}{\left(m-1\right)}}}{{\displaystyle {\sum}_{j=1}^c{\left(1-k\left({x}_i,{y}_j\right)\right)}^{\frac{-1}{\left(m-1\right)}}}} $$
(10)
$$ {v}_j=\frac{{\displaystyle {\sum}_{i=1}^n{u}_{ij}^m\cdot k\left({x}_i,{y}_j\right)}\cdot {x}_i}{{\displaystyle {\sum}_{i=1}^n{u}_{ij}^m\cdot k\left({x}_i,{y}_j\right)}} $$
(11)

Fully utilizing the novel SFCM method as well as morphology method, this study can segment suspected pulmonary nodules including true nodules and blood vessels in the pulmonary parenchyma. However, the shape of blood vessels was usually circular or oval which was quite similar to that of the pulmonary nodules. It was therefore quite difficult for the computer to recognize true pulmonary nodules. In this study, we proposed that vascular imaging can be enhanced by utilizing differences among adjacent slices of nodules and blood vessels. Figure 4 shows that one single slice combined the vascular information of the previous one with the rear one which can be utilized to reconstruct blood vessels. Thereafter, using the above improved method we can segment the suspected pulmonary nodules which are shown in Fig. 5. The regions of red markers represent the segmented suspected pulmonary nodules and the regions labeled by blue circles are recognized as true nodules which were labeled by experienced experts.

Fig. 4
figure 4

Performance of blood vessels enhancement: (a) Original images; (b) Enhanced blood vessels images

Fig. 5
figure 5

Segmentation results of various forms of nodules, The regions of red markers represent the suspected pulmonary nodules and the regions labeled by blue circles are recognized as true nodules which were labeled by experienced experts. Four types of nodules are: (a) Isolated nodule, (b) Juxta-pleural nodule, (c) Vascularized nodule and (d) Ground-glass nodule

Feature extraction and classifiers

In this section, we designed a feature set to incorporate texture and shape information for pulmonary nodule feature description. It was widely considered by the radiologists that features, especially texture and shape features between true nodules and false positives are different and diacritical [27, 28]. In this study, we extracted 22 texture and shape features of all the nodule candidates, which were used for classification.

According to texture features depicted in [29], we extracted seven invariant moment features, which can keep good invariance under translation, rotation, scale and mirror. Four images gray statistical characteristics including mean gray value, maximum gray value, minimum gray value and gray standard deviations from the suspected nodules region were calculated. Furthermore, the average gray scale of neighbor region has been utilized to encode the texture features and the gray scale contrast value between the nodule area and the surrounding area were also used for acquiring the texture properties. A total of thirteen texture features were obtained in this study.

It was also desirable to capture the shape characteristics of the lung nodules images in conjunction with the textures, so as to increase the identification precision. In this work, we extracted the length of nodule (the long axis of the smallest fitting ellipse), the width of nodule (the shortest axis of the smallest fitting ellipse), the nodule’s eccentricity, the area of nodule which was expressed by the number of pixels, the ratio of pixel compared with the minimum convex side surrounded the area, the ratio of pixel compared with the minimum surrounding rectangle, the circular degree and the position of nodules. Nine shape features in total were used in this article.

As many researchers verified, ensemble classifiers emerged as a fast and robust classification method and performed better than single classifiers [30]. The generalization capability of ensemble classifiers benefited from various members of ensemble, which can be designed depending on the special application. Because of difference of the classifiers, the ensemble of them can weaken each classifier and average their votes. Then we can select the optimum hypothesis result. Beyond that, the ensemble classifiers can be executed in a parallel manner, which consumed less time for large data training. According to the significance of classifiers, different weights also can be allocated to these classifiers in some circumstances.

In this study, random forest (RF) approach was used to make classification, which consisted of multiple decision trees. By means of sequential forward selection, some valuable features formed the nodes of the tree and thus each decision tree can independently make judgments based on its features after data training stage. Using a total of 22 features, this study attempted different numbers of trees to experiment for comparison, whose results are shown in Fig. 6. The experiment demonstrated that RF with more than 20 trees can acquire an accuracy of over 92%. Meanwhile we can get an optimal accuracy when we chose 36 trees.

Fig. 6
figure 6

Accuracy of RF with different numbers of trees

Experiments and results

The proposed system was implemented with MATLAB 2014 based on a Windows environment. The development and test platform was a PC equipped with Quad-core 2.5GHz processor and 2GB of RAM memory. In our experiments, we used the 6400 slices of 80 CT scans from the data set LIDC, in which four radiologists in total marked 978 true nodules. All the nodule candidates were divided into two categories (positive and negative samples) to train the nodule recognition model. Then we evaluated the performance of the pulmonary nodule candidate detection method by the recall ratio, the sensitivity value and other indicators.

Performance evaluation of the pulmonary nodule candidate detection method

The novel pulmonary nodules segmentation method proposed by this study was carried out, and a total of 12,267 pulmonary nodule candidates were segmented by this algorithm, which contained 888 true pulmonary nodules and 11,379 false pulmonary nodules. This study adopted recall ratio to reflect the performance of the pulmonary nodules segmentation method and the definition of recall ratio is shown as (12). The recall ratios of four radiologists and this system are shown in Table 1, respectively. We can see that the recall ratio of this system is 90.8%, better than Radiologist 1(83.02%) and approximate to Radiologist 2 (91.1%). The result showed that the proposed pulmonary nodules segmentation method had a higher sensitivity and it was nearly equal to the results given by the experienced experts.

Table 1 True nodules detection accuracy of each radiologist and proposed system
$$ Recall=\frac{number\ of\ detected\ true\ nodules}{number\ of\ true\ nodules\ in\ the\ database}\times 100 $$
(12)

Performance evaluation of the true nodule identification method

To reduce the false positive as far as possible, 22 discriminating features were extracted including 13 texture features and 9 shape features and then designed an effective machine learning method to classify true pulmonary nodules from the pulmonary nodule candidates. In order to increase the ratio of discrimination of the shape features, this work proposed the first blood vessels imaging remodeling and enhancing, as shown in the Fig. 4.

The red arrow indicated that the isolated suspected nodule was remodeled into blood vessel. It can reduce the false positive rate effectively. In this study, three classifiers including RF, supervised vector machine (SVM) and back-propagation (BP) network were used for comparison. Here 36 trees were used in the RF, the polynomial kernel was used in the SVM and the BP network consisted of three layers structure. The performance indicators were calculated for the random 100 runs of the ten-fold cross-validation, which is shown in Table 2.

Table 2 Analysis result of RF, SVM and BP classifiers

Table 2 illustrated that RF and SVM classifiers performed better than BP network in all the indicators, the values of which were all more than 91%. Besides, RF preferably acquired these indicators with over 92%. Compared with the SVM, the accuracy of the RF was slightly less than the SVM, but the sensitivity and AUC of RF was 94.8 and 97.6%, respectively, which was superior to those of SVM, 91.3 and 94.7%. The sensitivity reflected detection rate of positive sample and the higher the value, the lesser omission ratio of positive sample, which was a worthy indicator in medical statistics.

Figure 7 showed the receiver operating characteristic (ROC) curves using three classifiers, RF, SVM and BP, for the nodule recognition under 22 basic features. The results demonstrated that RF classifier was able to control the number of false positive nodules within 10 per case, which was a little better than SVM in this study.

Fig. 7
figure 7

ROC curves of three classifiers: RF, SVM and BP

Discussion

For a CAD system, it can acquire a smaller false positive rate while raising the detection rate of smaller nodules. In Fig. 4, in order to reduce false positive rate of pulmonary nodule candidates, we first enhanced and reconstructed the vessel shape by utilizing vessel information of adjacent slices before the pulmonary nodule segmentation. This step was crucial to increase discriminability of shape feature. In Table 3, the performance of latest pulmonary nodule detection schemes are listed to compare with our work. The universal problem of these latest researches was that their data sets were small volumes thus influencing the reliability of their results. Moreover, the sizes of nodules in their data sets were in specified range, so it was difficult to popularize them to different cases. This work contained both great and small nodules recognition. The sensitivity rates acquired by this work were between the latest work. As vast nodules were included in this data set, the false positive (FP) nodules per scans were over the average level of other work. From the Table 3, we can see that our method found more number of nodules than the other approaches. That was because smaller size of nodule was used in this paper. However, it causes increase of the false positive rate. Therefore, FP/Scan of this work was slightly higher than some of schemes. But it was meaningful work to increase detection rate of smaller nodules. In addition, FP/Scan of this work can become smaller if we divided the nodules into different size ranges, which is next goal for improving our results.

Table 3 Performance comparison between this proposed work with other schemes within 5 years

As the ability of the remote terminals, back servers, and telecommunications are more powerful than before, the telemedicine is more and more acceptable and popular. The approach proposed in the study can also be used in remote diagnosis because of the easy configuration and implementation of the algorithm on mobile or other terminals with well-managed software and hardware, e.g. with the higher resolution camera or novel cloud processing units. A typical scenario as follows: the patient can take a photo copy of CT or MRI image with his/her smart phone and then upload it onto the Internet, the analysis result can be fed back after being processed in cloud server executing diagnosis algorithm [37]. Hence, the security is another issue that we should concern about when this remote diagnosis system is utilized and prevailed, e.g. user authentication and privacy protection. Wang et al [38] investigated the question of several anonymous two-factor authentication schemes, and indicated the flaws of them to give a practical suggestion, especially with an impressive list of security requirements, e.g., resistance to smart card loss attack. Subsequently, Wang et al [39] proposed an enhanced scheme with provable security defined by a formal model, which showed the feature of user un-traceability, was much proper for smart phone applications. Additionally, in our previous work [40, 41], a novel tri-factor mutual authentication scheme was also proposed with the assistant of biometric method in health care applications, as well as a user access control policies were adopted in a practical HIS (hospital information system). Thus, as a part of the health care monitoring platform, the lung nodule detection system can be implemented properly and achieve the security criterion.

Conclusion

In this study, an efficient and stable pulmonary nodule detection system was proposed, which mainly focused on a large amount of data with a small number of texture and shape features. A novel pulmonary nodule segmentation method was proposed to detect all the suspected nodules. Using random forest classifier we can reduce many false positives and obtain a preferable accuracy, sensitivity and specificity. As the rapid growth of medical image data, radiologists themselves were difficult to finish all the work of detecting pulmonary nodules within a short time. This study can help radiologists screen out amount of nodule candidates and provide reference information to them. Moreover, it is convinced that this study can be popularized and applied in the remote medical field. However, at present the CAD systems are in the development stage, which still need promoting its accuracy. Meanwhile early warning for the generation of pulmonary should be paid more attention in the future.