Introduction

Deep learning (DL) is a widely used tool in research domains such as computer vision, speech analysis, and natural language processing (NLP). This method is suited particularly to those areas, where a large amount of data needs to be analyzed and human like intelligence is required. The use of deep learning as a machine learning and pattern recognition tool is also becoming an important aspect in the field of medical image analysis. This is evident from the recent special issue on this topic [1], where the initial impact of deep learning in the medical imaging domain is investigated. According to an MIT technological review, deep learning is among the top ten breakthroughs of 2013 [2]. Medical imaging has been a diagnostic method in clinical practices for a long time. The recent advancements in hardware design, safety procedures, computational resources and data storage capabilities have greatly benefited the field of medical imaging. Currently, major application areas of medical image analysis involve segmentation, classification, and abnormality detection using images generated from a wide spectrum of clinical imaging modalities.

Medical image analysis aims to aid radiologist and clinicians to make diagnostic and treatment process more efficient. The computer aided detection (CADx) and computer aided diagnosis (CAD) relies on effective medical image analysis making it crucial in terms of performance, since it would directly affect the process of clinical diagnosis and treatment [3, 4]. Therefore, the performance of important prameters such as accuracy, F-measure, precision, recall, sensitivity, and specificity is crucial, and it is mostly desirable that these measures give high values in medical image analysis. As the availability of digital images dealing with clinical information is growing, therefore a method that is best suited to big data analysis is required. The state-of-the-art in data centric areas such as computer vision shows that deep learning methods could be the most suitable candidate for this purpose. Deep learning mimics the working of the human brain [5], with a deep architecture composed of multiple layers of transformations. This is similar to the way information is processed in the human brain [6].

A good knowledge of the underlying features in a data collection is required to extract the most relevant features. This could become tedious and difficult when a huge collection of data needs to be handled efficiently. A major advantage of using deep learning methods is their inherent capability, which allows learning complex features directly from the raw data. This allows us to define a system that does not rely on hand-crafted features, which are mostly required in other machine learning techniques. These properties have attracted attention for exploring the benefits of using deep learning in medical image analysis. The future of medical applications can benefit from the recent advances in deep learning techniques. There are multiple DL open source platforms available such as caffe, tensorflow, theano, keras and torch to name a few [7]. The challenges arise due to limited clinical knowledge of DL experts and limited DL knowledge of clinical experts. A recent tutorial attempts to bridge this gap by providing a step by step implementation detail of applying DL to digital pathology images [8]. In [9], a high-level introduction to medical image segmentation task using deep learning is presented by providing the code. In general, most of the work using DL techniques use an open source model, where the code is made available on platforms such as github. This allows researchers to come up with a running model relatively quickly for applying these techniques to various medical image analysis tasks. The challenge remains to select an appropriate DL architecture depending upon the number of available images and ground truth labels.

In this paper, a detailed review of the current state-of-the-art medical image analysis techniques is presented, which are based on deep convolutional neural networks. A summary of the key performance parameters having clinical significance achieved using deep learning methods is also discussed. The rest of the paper is organized as follows. “Medical image analysis”, presents a brief introduction to the field of medical image analysis. “Convolutional neural networks (CNNs)” and “Medical image analysis using CNN”, presents a summary and applications of the deep convolutional neural network methods to medical image analysis. In “Discussion”, the recent advances in deep learning methods for medical image analysis are analyzed. This is followed by the conclusions presented in “Conclusion”.

Medical image analysis

Medical imaging includes those processes that provide visual information of the human body. The purpose of medical imaging is to aid radiologists and clinicians to make the diagnostic and treatment process more efficient. Medical imaging is a predominant part of diagnosis and treatment of diseases and represent different imaging modalities. These include X-ray, computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), and ultrasound to name a few as well as hybrid modalities [10]. These modalities play a vital role in the detection of anatomical and functional information about different body organs for diagnosis as well as for research [11]. A typology of common medical imaging modalities used for different body parts which are generated in radiology and laboratory settings is shown in Fig. 1. Medical imaging is an essential aid in modern healthcare systems. Machine learning plays a vital role in CADx with its applications in tumor segmentation, cancer detection, classification, image guided therapy, medical image annotation, and retrieval [12,13,14,15,16,17,18].

Fig. 1
figure 1

Typology of medical imaging modalities

Segmentation

The process of segmentation divides an image in to multiple non-overlapping regions using a set of rules or criterion such as a set of similar pixels or intrinsic features such as color, contrast and texture [19]. Segmentation reduces the search area in an image by dividing the original image into two classes such as object or background. The key aspect of image segmentation is to represent the image in a meaningful form such that it can be conveniently utilized and analyzed. The meaningful information extracted using the segmentation process in medical images involves shape, volume, relative position of organs, and abnormalities [20, 21]. In [22], an iterative 3D multi-scale Otsu thresholding algorithm is presented for the segementation of medical images. The effects of noise and weak edges are eliminated by representing images at multiple levels. In [23], a hybrid algorithm is proposed for an automatic segmentation of ultrasound images. The proposed method combine information from spatial constraint based kernel fuzzy clustering and distance regularized level set (DRLS) based edge features. Multiple experiments are conducted for evaluating the method on real as well as synthetically generated ultrasound images. A segmentation approach for 3D medical images is presented in [24], in which the system is capable of assessing and comparing the quality of segmentation. The approach is mainly based on the statistical shape based features coupled with extended hierarchal clustering algorithm and three different datasets of 3D medical images are used for experimentation. An expectation maximization approach is used for tumor segmentation on brain tumor image segmentation (BRATS) 2013 dataset. The method achieves considerable performance, but is only tested on a few images from the dataset and is not shown to generalize for all images in the dataset [25].

Detection and classification of abnormality

Abnormality detection in medical images is the process of identifying a certain type of disease such as tumor. Traditionally, clincial experts detect abnormalities, but it requires a lot of human effort and is time consuming. Therefore, development of automated systems for detection of abnormalities is gaining importance. Different methods are presented in literature for abnormality detection in medical images. In [26], an approach is presented for detection of the brain tumor using MRI segmentation fusion, namely potential field segmentation. The performance of this system is tested on a publicly available MRI benchmark, known as brain tumor image segmentation. A particle swarm optimization based algorithm for detection and classification of abnormalities in mammography images is presented in [27], which uses texture features and a support vector machine (SVM) based classifier. In [28], a method is presented for detection of myocardial abnormalities using cardiac magnetic resonance imaging.

Computer aided detection or diagnosis

A Computer Aided Diagnosis (CAD) system is used in radiology, which assists the radiologist and clinical practitioners in interpreting the medical images. The system is based on algorithms which use machine learning, computer vision and medical image processing. In clinical practice, a typical CADx system serves as a second reader in making decisions that provides more detailed information about the abnormal region. A typical CADx system consists of the following stages, pre-processing, feature extraction, feature selection and classification [29]. In literature, there are methods proposed for the diagnosis of diseases such as fatty liver [30], prostate cancer [29], dry eye [31], Alzheimer [32], and breast cancer [33]. In [34], hybrid features are used for the detection glaucoma in fundus images. The optic disc is localized by employing support vector machine trained using local features extracted from the vessels [35]. A hybrid of clinical and image based features are used for multi-class classification of alzheimer disease using the alzheimer disease neuro-image initiative (ADNI) dataset with reasonable accuracy [36].

Medical image retrieval

Recent years have witnessed a broad use of computers and digital information systems in hospitals. The picture archiving and communication systems (PACSs) are producing large collections of medical images [37,38,39]. The hospitals and radiology departments are producing a large number of medical images, ultimately resulting in huge medical image repositories. An automatic medical image classification and retreival system is required to efficiently deal with this big data. A speciliazed medical image retrieval system could assist the clinical experts in making a critical decision in disease prognosis and diagnosis. A timely and accurate deceison regarding the diagnosis of a patient’s disease and its stage can be mabe by using similar cases retrieved by the reterival system [40]. Text based and content based image retrieval (CBIR) methods have been commonly used for medical image retrieval. Text based retrieval methods were initially proposed in 1970s [37], where images were manually annotated with a text based description. In case, the textual annotation is done efficiently, the performance of such systems is fast and reliable. The drawback of such systems is that they cannot perform well in un-annotated image databases. Image annotation is not only a subjective matter but also a time taking process [41]. In CBIR methods, texture, color and shape based features are used for searching and retrieving images from large collections of data [42].

A CBIR system based on Line Edge Singular Value Pattern (LESVP) is proposed in [43]. In [44], a CBIR system for skin lesion images using reduced feature vector, classification and regression tree is presented. In [40], an Bag of Visual Words (BoVWs) approach is used along with scale invariant feature transform (SIFT) for the diagnosis of Alzheimer disease (AD). In [45], a supervised learning framework is presented for biomedical image retrieval, which uses the predicted class label from classifier for retrieval. It also uses image filtering and similarity fusion and multi-class support vector machine classifier. The use of class prediction eliminates irrelevant images and results in reducing the search area for similarity measurement in large databases [46].

Evaluation metrics for medical image analysis system

A typical medical image analysis system is evaluated by using different key performance measures such as accuracy, F1-score, precision, recall, sensitivity, specificity and dice coefficient. Mathematically, these measures are calculated as,

$$ F1_{score} = 2 \times\frac{(Precision \times Recall)}{(Precision + Recall)}, $$
(1)

where,

$$ Precision= \frac{TP}{(TP+FP)}, $$
(2)

and

$$ Recall= \frac{(TP)}{(TP+TN)}, $$
(3)
$$ Accuracy = \frac{(TP+TN)}{(TP+TN+FP+FN)}, $$
(4)
$$ Sensitivity = \frac{TP}{(TP+FN)}, $$
(5)
$$ Specificity = \frac{TN}{(TN+FP)}, $$
(6)
$$ Dice Score =\frac{2 \times |P \cap GT|}{|P|+|GT|}, $$
(7)

where True Positive (TP) represents number of cases correctly recognized as defected, False Positive (FP) represents number of cases incorrectly recognized as defected, True Negative (TN) represents number of cases correctly recognized as non-defected and False Negative (FN) represents number of cases incorrectly recognized as non-defected. In Eq. 7, P denotes the prediction as given by the system being evaluated for a given testing sample and GT represents the ground truth of the corresponding testing sample.

Convolutional Neural Networks (CNNs)

Deep learning is a tool used for machine learning, where multiple linear as well as non-linear processing units are arranged in a deep architecutre to model high level abstraction present in the data [47]. There are numerous deep learning techniques currently used in a variety of applications. These include auto-encoders, stacked auto-encoders, restricted Boltzmann machines (RBMs), deep belief networks (DBNs) and deep convolutional neural networks (CNNs). In recent years, CNN based methods have gained more popularity in vision systems as well as medical image analysis domain [48,49,50].

CNNs are biologically inspired variants of multi-layer perceptrons. They tend to recognize visual patterns, directly from raw image pixels. In some cases, a minimal pre-processing is performed before feeding images to CNNs. These deep networks look at small patches of the input image, called receptive fields, by using multiple layer neurons and use shared weights in each convolutional layer. CNNs combine three architectural ideas for ensuring invariance for scale, shift and distortion to some extent. The first CNN model (LeNet-5) that was proposed for recognizing hand written characters is presented in [51]. The local connections of patterns between the neurons of adjacent layers of CNN i.e., inputs from hidden units of a layer m are taken as a subset of units in the layer m − 1, units having spatially adjacent receptive fields for exploiting the spatial local correlation. Additionally, in CNN each filter hi is replicated around the whole visual field. These filters share bias and weight vectors to create a feature map. The gradient of shared weights is equal to the sum of gradients of the shared parameters. When convolution operation is performed on sub-regions of the whole image, a feature map is obtained. The process involves convolution of the input image or feature map with a linear filter with the addition of a bias followed by an application of a non-linear filter. A bias value is added such that it is independent of the output of previous layer. The bias values allow us to shift the activation function of a node in either left or right direction. For example, for a sigmoid function, the weights control the steepness of the output, whereas bias is used to offset the curve and allow better fitting of the model. The bias values are learned during the training model and allows an independent variable to control the activation. At a given layer, the kth filter is denoted symbolically as hk, and the weights Wk and bias bk determines their filters. The mathematical expression for obtaining feature maps is given as,

$$ h_{ij}^{k} = tanh\left( \left( W^{k} * x\right)_{ij}+ b_{k}\right) , $$
(8)

where, tanh represents the tan hyperbolic function, and ∗ is used for the convolution operation. Figure 2 illustrates two hidden layers in a CNN, where layer m − 1 and m has four and two features maps respectively i.e., h0 and h1 named as w1 and w2. These are calculated from pixels (neurons) of layer m − 1 by using a 2 × 2 window in the layer below as shown in Fig. 2 by the colored squares. The weights of these filter maps are 3D tensors, where one dimension gives indices for input feature maps, while the other two dimensions provides pixel coordinates. Combining it all together, \(W_{ij}^{kl}\) represents the weight connected to each pixel of kth feature map at a hidden layer m with ith feature map of a hidden layer m − 1 and having coordinates i,j.

Fig. 2
figure 2

Hidden layers in a convolutional neural network

Each neuron or node in a deep network is governed by an activation function, which controls the output. There are various activation functions used in deep learning literature such as linear, sigmoid, tanh, rectified linear unit (ReLU). A broader classification is made in the form of linear and non-linear activation function. A linear function passes the input at a neuron to the output without any change. Since, deep network architectures are designed to perform complex mathematical tasks, non-linear activation functions have found wide spread success. ReLU and its variations such as leaky-ReLU and parametric ReLU are non-linear activations used in many deep learning models due to their fast convergence characteristic. Pooling is another important concept in convolutional neural networks, which basically performs non-linear down sampling. There are different types of pooling used such as stochastic, max and mean pooling. Max pooling divides the input image into non-overlapping rectangular blocks and for every sub-block local maxima is considered in generating the output. Max pooling provides benefits in two ways, i.e., eliminating minimum values reduces computations for upper layers and it provides translational invariance. Concisely, it provides robustness while reducing the dimension of intermediate feature maps smartly. On the other hand, mean pooling replace the underlying block with its mean value. In stochastic pooling the activation function within the active pooling region is randomly selected. In addition to down-sampling the feature maps, pooling layers allows learning features for translational and rotational invariant classification [52]. The pooling operation can also be performed on overlapping regions. In circumstances where weak spatial information surrounding the dominant regions of an image is also useful, fractional or overlapping regions for pooling could be beneficial [53].

There are various techniques used in deep learning to make the models learn and generalize better. This could include L1, L2 regularizer, dropout and batch normalization to name a few. A major issue in using deep convolutional network (DCNN) is over-fitting of the model during training. It has been shown that dropout is used successfully to avoid over-fitting [54]. A dropout layer drops certain unit connections which are selected randomly. Dropout layer is widely used for regularization. In addition to dropout, batch normalization has also been successfully used for the purpose of regularization. The input data is divided into mini batches. It is shown that using batch normalization not only speeds up the training but, in some cases, preform regularization eliminating the need for using dropout layers [55]. The performance of a deep learning method is highly dependent on the data. In cases, where the availability of data is limited, various augmentation techniques are utilized [56]. This may include random cropping, colour jittering, image flipping and random rotation [57].

Medical image analysis using CNN

There is a wide variety of medical imaging modalities used for the purpose of clinical prognosis and diagnosis and in most cases the images look similar. This problem is solved by deep learning, where the network architecture allows learning difficult information. Hand crafted features work when expert knowledge about the field is available and generally make some strict assumptions. These assumptions may not be useful for certain tasks such as medical images. Therefore, with the hand-crafted features in some applications, it is difficult to differentiate between a healthy and non-healthy image. A classifier such as SVM does not provide an end to end solution. Features extracted form techniques such as scale invariant feature transform (SIFT) etc. are independent of the task or objective function in hand. Afterwards, sample representation is taken in term of bag of words (BOW), Fisher vector or some other mechanism. The classifier like SVM is applied on this representation and there is no mechanism for the of loss to improve local features as the process of feature extraction and classification is decoupled from each other.

On the other hand, a DCNN learn features from the underlying data. These features are data driven and learnt in an end to end learning mechanism. The strength of DCNN is that the error signal obtained by the loss function is used/propagated back to improve the feature (the CNN filters learnt in the initial layers) extraction part and hence, DCNN results in better representation. The other advantage is that in the initial layers a DCNN captures edges, blobs and local structure, whereas the neurons in the higher layers focus more on different parts of human organs and some of the neurons in the final layers can consider whole organs.

Figure 3 shows a CNN architecture like LeNet-5 for classification of medical images having N classes accepting a patch of 32 × 32 from an original 2D medical image. The network has convolutional, max pooling and fully connected layers. Each convolutional layer generates a feature map of different size and the pooling layers reduce the size of feature maps to be transferred to the following layers. The fully connected layers at the output produce the required class prediction. The number of parameters required to define a network depends upon the number of layers, neurons in each layer, the connection between neurons. The training phase of the network makes sure that the best possible weights are learned, that would give high performance for the problem at hand. The advancement in deep learning methods and computational resources has inspired medical imaging researchers to incorporate deep learning in medical image analysis. Some recent studies have shown that deep learning algorithms are successfully used for medical image segmentation [58], computer aided diagnosis [59,60,61], disease detection and classification [62,63,64,65] and medical image retrieval [66, 67].

Fig. 3
figure 3

A typical convolutional neural network architecture for medical image classification

A deep learning based approach has been presented in [68], in which the network uses a convolutional layer in place of a fully connected layer to speed up the segmentation process. A cascaded architecture has been utilized, which concatenates the output of the first network with the input of succeeding network. The network presented in [69] uses small kernels to classify pixels in MR image. The use of small kernels decreases network parameters, allowing to build deeper networks, without worrying about the dangers of over-fitting. Data augmentation and intensity normalization have been performed in pre-processing step to facilitate training process. Another CNN for brain tumor segmentation has been presented in [70]. The architecture uses dropout regularizer to deal with over-fitting, while max-out layer is used as activation function. A two path eleven layers deep convolutional neural network has been presented in [71] for brain lesion segmentation. The network is trained using a dense training method using 3D patches. A 3D fully connected conditional random field has been used to remove false positives as well as to perform multiple predictions. The CNN based method presented in [72] deals with the problem of contextual information by using a global-based method, where an entire MRI slice is taken into account in contrast to patch based approach. A re-weighting training procedure has been used to deal with the data imbalance problem. A 3D convolutional network for brain tumor segmentation for the BRATS challenge has been presented in [73]. The network uses a two-path approach to classify each pixel in an MR image. In [58], a deep convolutional neural network is presented for brain tumor segmentation, where a patch based approach with inception method is used for training purpose. Drop-out, batch normalization and inception modules are utilized to build the proposed ILinear nexus architecture. The problem of over-fitting, which arises due to scarcity of data, is removed by using drop-out regularizer. Table 1 highlights the usage of CNN based architectures for segmentation of medical images.

Table 1 The application of CNN based methods for medical image segmentation

A method for classification of lung disease using a convolutional neural network is presented in [62], which uses two databases of interstitial lung diseases (ILDs) and CT scans each having a dimension of 512 × 512. A total of 14696 image patches are derived from the original CT scans and used to train the network. A method based on convolutional classification restricted Boltzmann machine for lung CT image analysis is presented in [63]. Two different datsets containing lung CT scans are used for classification of lung tissue and detection of airway center line. The network is trained on 32 × 32 image patches selected along a gird with a 16-voxel overlap. A patch is retained if it has 75% of voxel belonging to the same class. In [64], a framework for body organ recognition is presented based on two-stage multiple instance deep learning. In the first stage, discriminative and non-informative patches are extracted using CNN. In the second stage, fine tuning of the network parameters is performed on extracted discriminative patches. The experiments are conducted for the classification of synthetic dataset as well as the body part classification of 2D CT slices. In [65], a locality sensitive deep learning algorithm called spatially constrained convolutional neural networks is presented for the detection and classification of the nucleus in histological images of colon cancer. A novel neighboring ensemble predictor is proposed for accurate classification of nuclei and is coupled with CNN. A large dataset having 20,000 annotated nuclei of four classes of colorectal adenocarcinoma images is used for evaluation purposes. In [66], a deep convolutional neural network has been proposed to retrieve multimodal images. An intermodal dataset having five modalities and twenty-four classes are used to train the network for the purpose of classification. Three fully connected layers are used at the last part of the network for extracting features, which are use for the retrieval. A content based medical image retrieval (CBMIR) system based on CNN for radiographic images is proposed in [67]. Image retrieval in medical application (IRMA) database is used for the evaluation of the proposed CBMIR system. In [60], a hybrid thyroid module diagnosis system has been proposed by using two pre-trained CNNs. The models differs in terms of the number of convolutional and fully connected layers. A soft-max classifier is used for diagnosis and results are validated on 15000 ultrasound images. A semi-supervised deep CNN based learning scheme is proposed for the diagnosis of breast cancer[61], and is trained on a small set of labeled data. In [66], a CNN based approach is proposed for diabetic retinopathy using colored fundus images. The network classify the images into three classes i.e., aneurysms, exudate and haemorrhages and also provide the diagnosis. The proposed architecture is tested on dataset comprising of 80000 images. In [74, 75], deep neural network including GoogLeNet and ResNet are successfully used for multi-class classification of Alzheimer’s disease patients using the ADNI dataset. An accuracy of 98.88% is achieved, which is higher than the traditional machine learning approaches used for Alzheimer’s disease detection.

Table 2 highlights CNN applications for the detection and classification task, computer aided diagnosis and medical image retrieval. It is seen that CNN based networks are successful in application areas dealing with multiple modalities for various tasks in medical image analysis and provide promising results in almost every case. The results can vary with the number of images used, number of classes, and the choice of the DCNN model. Looking at these successes of CNN in medical domain, it seems that convolutional networks will play a crucial role in the development of future medical image analysis systems. Deep convolutional neural networks have proven to give high performance in medical image analysis domain when compared with other techniques applied in similar application areas. Table 3, summarises results of different techniques used for lung pattern classification in ILD disease. The CNN based method outperforms other methods in major performance indicators. Table 4 shows a comparison of the performance of a CNN based method and other state-of-the-art computer vision based methods for body organ recognition. It is evident that the CNN based method achieves significant improvement in key performance indicators.

Table 2 Some recent clinical applications of CNN based methods
Table 3 A comparison of methods used for ILD classification
Table 4 A comparison of CNN based method with other state-of-the-art methods for body organ recognition

Discussion

In this section, various considerations for adopting deep learning methods in medical image analysis are discussed. A roadmap for the future of artificial intelligence in medical image analysis is also drawn in the light of recent success of deep learning for these tasks.

Various deep learning architectures for medical image analysis

The success of convolutional neural networks in medical image analysis is evident from a wide spectrum of literature that is recently available [79]. There are multiple CNN architectures reported in literature to deal with different imaging modalities and tasks involved in medical image analysis [58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74]. These architectures include conventional CNN, multiple layer networks, cascaded networks, semi- and fully supervised training models and transfer learning. In most cases, the data available is limited and expert annotations are scarce. In general, shallow networks have been preferred in medical image analysis, when compared with very deep CNNs employed in computer vision applications [80, 81]. In [82], a U shaped network is used for the purpose of semi-automated segmentation of sparsely annotated volumetric data. This architecture introduces skip connections and use convolution, deconvolution in a structured manner. A modification to U-Net is proposed in [83], which is applied on a variety of medical datasets for segmentation tasks. In [84], a W-shaped network is proposed for 2D medical image segmentation task. In [85], a volumetric solution is proposed for end to end segmentation of prostate cancer. A convolutional-deconvolutional network based on a capsule architecture is proposed in [86] for lung image segmentation and is shown to substantially reduce the number of parameters required when compared to U-Net architecture. This analysis shows that different DCNN network architectures are adopted or proposed for medical image analysis. These architectures focus on reducing the parameter space, improve computation time, and handle 3D data. It is generally found that DCNN based architectures have found wider success in dealing with medical image data, when compared to other deep learning frameworks.

3D imaging modalities

A large amount of data produced in the medical domain has 3-dimensional information. This is particularly true for volumetric imaging modalities such as CT and MRI. Medical image analysis can benefit from this enriched information. Deep learning methods generally adopt different methods to handle this 3D information. This can involve converting 3D volume data into 2D slices and combination of features from 2D and multi-view planes to benefit from the contextual information [87, 88]. Recent techniques are proposed using 3D CNN to fully benefit from the available information [89, 90]. In [91], a fully 3D DCNN is used for the classification of dysmaturation in neonatal MRI image data. In [92], a two stage network is used for the detection of vascular origin lacunes, where a fully 3D CNN used in the second stage. The performance of the system is close to trained raters. In [93], a 3D CNN is used for the segmentation of cerebral vasculature using 4D CT data. In [94], brain lesion segmentation is performed using 3D CNN. A 3D fully connected conditional random field (CRF) is used for post processing. A geometric CNN is proposed in [95] to deal with geometric shapes in medical imaging, particularly targeting brain data. The utilization of 3D CNN has been limited in literature due to the size of network and number of parameters involved. This also leads to slow inference due to 3D convolutions. A hybrid of 2D/3D networks and the availability of more compute power is encouraging the use of fully automated 3D network architectures.

Limitation of deep learning and future prospects

Despite the ability of deep learning methods to give better or higher performance, there are some limitations of deep learning techniques, which could limit their application in clinical domain. Deep learning architecture requires a large amount of training data and computational power. A lack in computational power will lead to a need for more time to train the network, which would depend on the size of training data used. Most deep learning techniques such as convolutional neural network requires labelled data for supervised learning and manual labelling of medical images is a difficult task. These limitations are being overcome with every passing day due to the availability of more computation power, improved data storage facilities, increasing number of digitally stored medical images and improving architecture of the deep networks. The application of deep learning in medical image analysis also suffers from the black box problem in AI, where the inputs and outputs are known but the internal representations are not very well understood. These methods are also affected by noise and illumination problems inherent in medical images. The noise can be removed using pre-processing steps to improve the performance [58].

A possible solution to deal with these limitations is to use transfer learning, where a pre-trained network on a large dataset (such as ImageNet) is used as a starting point for training on medical data. This typically includes reducing the learning rate by one or two orders of magnitude (i.e., if a typical learning rate is 1e − 2, reduce it to 1e − 3 or 1e − 4) and increase the local learning rate of the newly introduce layers by a factor of 10. Also, as an alternative the DCNN model can be pretrained by converting ImageNet data into gray scale images. However, it may require more computation resources (such as GPUs) to train on the whole ImageNet data. The best option would be to train DCNN model on large scale annotated medical image data. This underlying task for pre-training can be as simple as organ classification [66] or binary classification task of benign or malignant images. Different modalities e.g., X-ray, MRI, and CT can be combined for this task. This pre-trained model can be used in transfer learning for fine tuning a network for a particular problem at hand.

In general, shallow networks are used in situations where data is scarce. One of the most important factors in deep learning is the training data. However, this is partially addressed by using transfer learning. However, even in the presence of transfer learning more data on the target domain will give better performance. The use of generative adversarial network (GAN) [96] can be explored in the medical imaging field in cases where the data is scarce. One of the main advantages of transfer learning is to enable the use of deeper models to relatively small dataset. In general, a deeper DCNN architecture is the better for the performance.

Conclusion

A comprehensive review of deep learning techniques and their application in the field of medical image analysis is presented. It is concluded that convolutional neural network based deep learning methods are finding greater acceptability in all sub-fields of medical image analysis including classification, detection, and segmentation. The problems associated with deep learning techniques due to scarce data and limited labels is addressed by using techniques such as data augmentation and transfer learning. For larger datasets, availability of more compute power and better DL architectures is paving the way for a higher performance. This success would ultimately translate into improved computer aided diagnosis and detection systems. Further research is required to adopt these methods for those imaging modalities, where these techniques are not currently applied. The recent success indicates that deep learning techniques would greatly benefit the advancement of medical image analysis.