Introduction

Deep convolutional neural networks (CNNs) have been shown to substantially improve common image analysis tasks in computer vision and (bio-)medical imaging. They have in particular advanced research in automatic segmentation and image classification. Dense prediction based on fully convolutional network (FCN) architectures [20] enables very accurate voxel-wise segmentation by a single forward pass of the input image through a trained CNN architecture [6]. However, FCNs also come with tremendous demand for memory and computational resources that can rarely be satisfied in clinical scenarios—in particular when envisioning a mobile application of computer-assisted diagnosis and interventions. Furthermore, the translation of deep learning into interactive clinical workflows will require processing times of few seconds, which up-to-date were only achievable using power-demanding GPUs. Surprisingly little research has been undertaken in deep learning for medical image analysis that attempts to limit model complexity. In this work, we address these challenges and present a new technique to advance state-of-the-art CNN and FCN approaches by introducing the TernaryNet—a versatile end-to-end trainable deep learning architecture that drastically reduces computational and memory demand for inference. We achieve this goal by replacing floating-point matrix multiplications with ternary convolutions (based on sparse binary kernels), with both activations and weights restricted to values of \(\{-1,0,+1\}\). They can be calculated using a masked Hamming distance, a XOR/XNOR operation followed by a popcount, and reduce computational demand by up to a factor of 16. Our approach is not merely motivated by gains in computational performance, but also to explore the theoretical advantages of explicit sparsity promotion to reduce the risk of overfitting (as detailed in the following subsection) and learn more plausible neural network models. Our work extends recent approaches from computer vision that relied on binary convolutions [25], ternary weight networks [18], hashing by continuation [2] and our initial work on sparse binary convolutions [10]. The presented approach is to the best of our knowledge the first to use binary convolutions for semantic segmentation and the very first to propose ternary convolutions (and not only ternary weights since activations are also restricted) based on masked Hamming distances.

The TernaryNet can be employed for any given image analysis task, e.g. landmark regression or image-level classification, but we chose to demonstrate its applicability to medical imaging for the automatic voxel-accurate segmentation of the pancreas in CT scans, which is a particularly demanding task. Pancreas segmentation is very important for computer-assisted diagnosis of inflammation (pancreatitis) or cancer and furthermore to provide image-based navigational guidance for interventions, including endoscopy [6]. In the following, we will motivate the use of sparse binary kernels in deep convolutional networks and discuss related work for the use of quantisation in image analysis in particular in deep networks. Section 2 contains the detailed explanation of ternary quantisation and convolutions. Starting with a short discussion of current work on CT pancreas segmentation, we describe our experimental set-up in Sect. 3 and compare different strategies and choices for model complexity reduction. We discuss our results, potentials for further research and future implications of our novel ternary convolution concept in Sect. 4 and end with some concluding remarks.

Motivation for sparse binary kernels: Convolutional neuronal networks excel in image recognition tasks by mimicking the visual cortex of mammals. The visual information is detected by photoreceptor cells and transmitted and processed using multiple layers of neurons interconnected by synapses. Computational models have the capacity to replicate these mechanisms and can furthermore represent neural activations up to extremely high numerical precision (up to 8 decimal points). However, in nature the simple structure of neural cells and environmental influences severely limit the accuracy of subtle changes in activation and in addition the need to conserve energy may lead to a sparse as possible use of neural activity. Ohlshausen and Field [24] and Lee et al. [17] therefore established the idea of sparse coding for pattern recognition and neural networks. Those works demonstrate that powerful convolutional filters can be learned using few nonzero values by means of sparsity inducing L1 norms and a feature sign searching algorithm. Furthermore, we observe that the nonzero elements of these synthetic models of V1 cells tend to be close to values +1 and \({-1}\). Therefore, a ternary approximation of weights leads to only minor degradation of representational power (see Fig. 1).

Fig. 1
figure 1

Left: Visual example of learned synthetic receptive fields (reproducing the results of [17]) using sparse coding techniques. Right: Ternarisation of weights demonstrates the low approximation error for these naturally inspired sparse filters

Related work: Due to their computational efficiency, binary codes and their comparison using the Hamming distance (which counts the number of dissimilar bits in a long binary vector) are becoming increasingly popular for demanding image analysis tasks. They have been employed for hashing-based large-scale image retrieval [3, 35], nearest-neighbour-based segmentation [7] and image registration [8]. In computer vision, binary descriptors are frequently used for real-time applications, e.g. tracking using BRIEF features [1]. There are, however, also cases where binarisation led to inadequate loss in representation quality as, e.g. reported for lung nodule classification in [5].

In our recent prior work [10], we proposed the use of sparse binary kernels with very large receptive fields inspired by BRIEF features and dilated convolutions [30, 34] that enabled highly accurate segmentations without complex network architectures. Similarly and concurrently, [14] proposed local binary convolutions that are derived from local binary patterns. A key limitation of these works is, however, that their design does not allow us to automatically train nonzero elements within binary kernels. Instead, they have to be chosen once at random (with a similar manual design as proposed in [1]). We also did not realise binary or ternary activations thus the use of efficient computations without floating-point arithmetic was not possible. An alternative solution that has recently been proposed is the use of trained ternary filter weights [18, 37]. In particular ternary weight networks [18] use a very simple, yet powerful, approximation and learning strategy based on the mild assumption of Gaussian statistics. They generalise the earlier ideas of [4, 25] for learning binary weights and clearly demonstrate that ternarisation drastically reduces the accuracy gap to high-precision weights. Another related approach by Liu et al. [19] employs decomposition methods for sparsification of convolution filters and proposes a new implementation for fast sparse matrix multiplication.

While weight quantisation has quickly matured, another important aspect that has so far been only insufficiently addressed is the quantisation or sparsification of activations. Setting approximately half of the activations to zero using a rectifying linear unit (ReLU) is common practice in deep learning. Yet more drastic quantisation, e.g. using the sign function

$$\begin{aligned} \mathop {\mathrm {sgn}}(x):=(x\ge 0\rightarrow 1)\wedge (x<0\rightarrow -1) \end{aligned}$$
(1)

as nonlinear activation leads to strong artefacts during forward passes and no gradient for backpropagation. Courbariaux et al. [4] therefore proposes an ad hoc solution that employs a rectangle (boxcar) function

$$\begin{aligned} \partial \mathop {\mathrm {sgn}}/\partial x\approx (\vert x\vert \le 1\rightarrow 1)\wedge (\vert x\vert >1\le 0\rightarrow 0) \end{aligned}$$
(2)

as a replacement, which was later also used in [25]. The downside of this approach is the fact that since two different functions are used during forward and backward propagation the training behaviour is ill-defined and potentially unstable. Cao et al. [2] propose a more justifiable approach based on the continuation of the hyperbolic tangent, which approaches the sign function with increasing slope \(\beta \) in its limit:

$$\begin{aligned} \lim _{\beta \rightarrow \infty }\tanh (\beta x)=\mathop {\mathrm {sgn}}(x) \end{aligned}$$
(3)

They prove the convergence of this optimisation when employing a sequence of increasing values of \(\beta \) during training. They limit the use of this function to the final layer within a framework for supervised hashing. In our work, we extend this concept to a ternary hyperbolic tangent as explained in detail in the following section and apply this function as nonlinearity throughout—for every activation—in our deep network models.

Method

We aim to automatically segment the pancreas in regions of interest extracted from CT volumes. For this purpose, a fully convolutional U-Net architecture [26] is chosen. However, a V-Net [21] or multi-path network will most likely lead to similarly good segmentations and would also support our findings. The U-Net model can contain several million free parameters rendering it computationally demanding and prone to overfitting. Furthermore, as common for FCN architectures an efficient inference requires an unexpectedly large amount of memory due to the use of the im2col operations. They are necessary to perform multichannel convolutions of all elements in the feature maps in parallel using matrix multiplications between activations of preceding layers with a current filter bank [13]. We propose a ternary quantisation of weights and activations that is generic and therefore applicable to reduce complexity for any (convolutional) neural network architecture including FCNs.

Fig. 2
figure 2

Visualisation of proposed ternary hyperbolic tangent as defined in Eq. 5 showing varying \(\beta \) values for increasing steepness of slopes. The analytical derivative of our new nonlinearity is shown for \(\beta =3\) on the right

Ternary weights: In order to limit the memory demand, reduce model complexity and enable inference of CNNs in practical clinical environments, it is desirable to reduce the precision of both activations and weights. Following the recent work of Li et al. [18], we aim to find the best approximation to the filter weights \(\mathbf {W}\approx \alpha {\tilde{\mathbf {W}}}\) where \(\alpha \) describes a (floating-point) scaling parameter and \({\tilde{\mathbf {W}}}\) consists of only ternary values \(\{-1,0,1\}\). It is shown in [18] that the minimal quantisation error can be obtained by calculating:

$$\begin{aligned} {\tilde{\mathbf {W}}}_i={\left\{ \begin{array}{ll} +1&{}\text { if }W_i>\Delta \\ 0&{}\text { if }\vert W_i\vert \le \Delta \\ -1&{}\text {else} \end{array}\right. } \text { with }\Delta =\frac{0.7}{n}\sum _{i=1}^n\vert W_i\vert \end{aligned}$$
(4)

and \(\alpha =\frac{1}{n_{\Delta }}\sum _i\vert \tilde{W}_i\vert \vert W_i\vert \) with \(n_{\Delta }=\sum _i\vert \tilde{W}_i\vert \). When employing quantised weights during the training of a network using stochastic gradient descent with mini-batches (i.e. in virtually any case of deep learning), it is strongly advisable [4] to accumulate gradient updates with full precision (while using \({\tilde{\mathbf {W}}}\) for both forward and backward passes); otherwise, they would usually not exceed the threshold (according to Eq. 4) necessary to flip individual bits. This simple and straightforward ternary weight approximation already yields excellent accuracies for classification tasks (only 3.6% lower top-1 scores for ImageNet compared to full-precision networks [18]).

Ternary activations: The use of ternary weight approximations alone, however, cannot reduce the huge memory and computational demand required to store and process intermediate feature maps, since the resulting activations will still be full precision. The key contribution of our work is therefore the introduction of a new activation function that enables an accurate ternarisation of intermediate features in a neural network, which we coin ternary hyperbolic tangent. This proposed function \(\mathop {\mathrm {ternTanh}}(x)\) combines two hyperbolic tangents to form plateaus around zero and beyond +1 and -1:

$$\begin{aligned} \mathop {\mathrm {ternTanh}}(x) = \frac{1}{2}\tanh (2\beta x-\beta ) - \frac{1}{2}\tanh (-2\beta x-\beta ) \end{aligned}$$
(5)

In contrast to a sign function, the ternary hyberbolic tangent is fully differentiable and can therefore be used without custom changes to the learning procedure of deep networks. The parameter \(\beta \) controls the slope and can be varied throughout the process of learning. In earlier iterations, it is beneficial to use smaller values for \(\beta \) to enable sufficient gradient flow and avoid “dying” neurons. Eventually, we aim for a discrete step function \(\mathop {\mathrm {tern}}(x)\) that can be defined as:

$$\begin{aligned} \mathop {\mathrm {tern}}(x)={\left\{ \begin{array}{ll} +1&{}\text { if }x>0.5\\ 0&{}\text { if }\vert x\vert \le 0.5\\ -1&{}\text {else} \end{array}\right. } \end{aligned}$$
(6)

Similar as above for the binary case covered in [2], the following continuation holds true (see Fig. 2 for visual example):

$$\begin{aligned} \lim _{\beta \rightarrow \infty }\mathop {\mathrm {ternTanh}}(\beta x)=\mathop {\mathrm {tern}}(x) \end{aligned}$$
(7)
Fig. 3
figure 3

Visual example for the computation of ternary convolutions without floating-point operations. Ternary values are encoded by sign and value, i.e. +1 \(\rightarrow \) ( ,\(\Box \)), -1 \(\rightarrow \) (\(\Box \),\(\Box \)) and 0 \(\rightarrow \) (\(\Box \), ). The approximation for a ternary filter bank provides scaling parameters \(\alpha \) see below Eq. 4. Ternary convolutions can be computed by masked XOR and XNOR operators followed by a bit-count according to Eq. 9. The output is batch normalised and passed on to the nonlinearity visualised in Fig. 2

Ternary convolutions and complexity analysis: In combining both ternary weights and ternary activations, we can realise important avoidance of time-consuming floating-point multiplications, which were at the core of classical deep learning architectures. In [4, 25], the idea of replacing full-precision inner products of an input tensor \(\mathbf {I}\) and a filter bank \(\mathbf {W}\) by Boolean operations and bit counting (population count) was explored for binary valued operands, i.e. \(\mathbf {I},\mathbf {W}\in \{-1,+1\}^c\), where c denotes the size of a kernel (including both spatial extend and number of features). It is straightforward to show that a matrix multiplication and its inner products can be efficiently calculated in the Hamming space:

$$\begin{aligned} \mathbf {I}_i\mathbf {W}_j = c-2\varXi \{\mathbf {I}_i\oplus \mathbf {W}_j\} \end{aligned}$$
(8)

where \(\oplus \) defines an exclusive OR (XOR) operator and \(\varXi \) a bit-count over the c bits in the rows of \(\mathbf {I}\) and \(\mathbf {W}\). Modern CPUs, FPGAs or embedded SoCs all contain instructions for efficiently calculating population counts of 64-bit strings in few cycles (using AVX extensions Intel CPUs achieve a throughput of 0.5 cycles [23]). This means that each bit-count replaces 64 floating-point multiplications and additions. Even when considering the highly optimised fused multiply addition (FMA) instructions on 256 bit wide registers (mm256-fmadd-ps), which are employed on modern Intel CPUs and that can process 8 packed FMAs in parallel in 0.5 cycles, we can gain a speed up of a factor of 8. When considering equal power consumption (floating-point operations require more complex logic), the improvements are even much higher.

Since previous work on binary quantisation of deep learning architectures [4, 25] has led to severely reduced accuracy of 12–20% for image classification tasks, we aim to extend the concept of bit counting as replacement for matrix multiplications to ternary valued networks with \(\mathbf {I},\mathbf {W}\in \{-1,0,+1\}^c\). As shown in Fig. 3, we can store ternary tensors using 2 bits per entry that encode the sign and value, respectively. We denote these two tensors as \(\mathbf {I}^s, \mathbf {I}^v\in \{0,1\}^c\) and \(\mathbf {W}^s, \mathbf {W}^v\in \{0,1\}^c\). The inner product calculation can then be realised using two bit-counts in Hamming space:

$$ \begin{aligned}&\mathbf {I}_i\mathbf {W}_j = \varXi \left\{ \overline{(\mathbf {I}_i^s\oplus \mathbf {W}_j^s)} \& (\mathbf {I}_i^v+\mathbf {W}_j^v)\right\} \nonumber \\&\quad - \varXi \left\{ (\mathbf {I}_i^s\oplus \mathbf {W}_j^s) \& (\mathbf {I}_i^v+\mathbf {W}_j^v)\right\} \end{aligned}$$
(9)

Here, & defines an AND operator, \(+\) the Boolean OR and \(\overline{A\oplus B}\) the negated XOR. A more intuitive interpretation is that all operations involving a zero value are excluded and the first part of the equation calculates all positives elements of a dot product, i.e. \(+1\cdot +1\) and \(-1\cdot -1\), while the second part subtracts the number of times an opposing sign multiplication occurs. The complete concept of an individual building block for ternary convolutions in deep networks is shown in Fig. 3. In practice further speed-ups (halving the number of bit-counts) are possible when training the weight quantisation to follow a specified degree of sparsity, e.g. by replacing the rule derived in Eq. 4 and specify \(\Delta \) so that in each kernel exactly 50% of entries are zero.

In summary, each module in our proposed TernaryNet architecture comprises a ternary approximation of filter weights together with a ternarisation of activations to enable low-power, high-speed ternary convolutions without floating-point operations. During training both weight updates for mini-batch optimisation and the activations using the new ternary hyperbolic tangent \(\mathop {\mathrm {ternTanh}}\) are kept at full precision to enable gradient flow and precise learning. By extending the strategy of [2] to ternary activations and applying a continuously increasing slope \(\beta \) during training, the network learns to cope with sparse and quantised activations, which is vital in order to avoid diverging objectives between training and testing. Batch-normalisation layers [12] are inserted between ternary convolutions and activations to accelerate the learning process and keep a zero mean of feature responses as well as an approximately unit normal distribution to ensure the nonlinearity is not easily saturated. A trained model can be stored using only 2 bits per weight and one (full-precision) scalar weighting value per feature channel—reducing the required memory by more than an order of magnitude. During model inference on unseen data, we employ the hard quantisation of Eq. 6 and thereby enable the use of Hamming distances for ternary convolutions. It is important to note that all common architectural design choices of modern deep networks, such as skip connections [26], dilated kernels [10, 34] or dense feature concatenation [6, 11] are useable with ternary convolutions.

Experiments

To demonstrate the usefulness of TernaryNets for highly efficient medical image analysis, we explore the dense prediction (semantic segmentation) of the pancreas in CT. The extension of our model to multi-organ labelling is straightforward. Providing image guidance for interventional tasks relies on fast inference executed on common clinical workstations or even mobile devices. We therefore also analyse in detail the computational operations and memory requirements in our experiments. The highly variable shape and a relatively poor contrast of the pancreas as well as confusable neighbouring abdominal anatomies make this segmentation very difficult. Therefore, networks with large receptive fields are required to robustly capture sufficient regional context, while at the same time an automatic method should delineate local objects boundaries accurately and avoid oversegmentation of similar neighbouring structures within the field-of-view. Our experiments are based on the public NIH dataset that was described in [27]. It comprises 82 high-resolution CT scans along with accurate manual segmentations for training and validation.

Comparison to state of the art: Several approaches have been evaluated in the last few years on the NIH dataset and a similar corpus of abdominal CT scans (the BCV challenge data described in [32]). Accuracies for pancreas segmentation without CNNs are often relatively low, e.g. overlap scores of 40 and 49% have been reported for two different multi-atlas techniques in [31]. Employing discrete registration within multi-atlas label fusion [9] improved accuracies for pancreas segmentation to 74% Dice, ranking first at the MICCAI 2015 BCV challenge. The approach of [16] reached 60% overlap within the same challenge by combing registration-based localisation with deep CNNs. Roth et al. achieved a Dice score of 71% [27] on the NIH dataset when combining supervoxel-based deep region regression with CNN patch classification and could further improve their accuracy to 78% [28] using holistically nested networks together with random forest classifiers. Very recently, Zhou et al. [36] achieved an astonishing performance of 82% on the NIH data by training an iterative sequence of multiple (coarse-to-fine) deep CNNs. The use of densely connected layers within a V-Net architecture (Dense V-Net [6]) resulted in a Dice overlap of 66% (on both NIH and BCV datasets), which is also the only of the mentioned deep learning approaches that did not rely on a combination of classifiers or registration. In our own previous work [10], we reached 65% Dice for the BCV dataset using (untrained) sparse binary convolutions that enabled huge receptive fields but no binary (or ternary) convolutions.

Baseline model: Our aim is not necessarily to surpass current state-of-the-art accuracies, but to demonstrate and analyse the effects of network model quantisation. We therefore employ a four-level fully convolutional U-Net architecture [26] as an exemplary baseline. To fairly assess the influence of binarisation and ternarisation, we employ the same number of channels and convolution filters for all compared models and hyperbolic tangents (except for the final prediction layer) as baseline activation function. Table 1 lists the details of the chosen architecture, including the number of floating-point operations (FMAs) required per layer. The resulting receptive field of our networks is 36 voxels. Using floating-point precision, the network requires 2.6 million weights and thus 10.6 MBytes of storage for the model weights. During training, the model requires more than 5 GBytes of memory (using a mini-batch size of 10). For inference, this can be reduced to approximately 1 GByte.

Table 1 Description of baseline U-Net model

Compared models: We have analysed in total seven variants of our baseline network to explore the effect of sparsity and quantisation to both activations and filter weights. Starting from the same baseline model, we define our TernaryNet by approximating weights using the ternary quantisation of Eq. 4 as proposed in [18]. The first layer always acts on continuous input and similar to previous work on binarisation [4, 25] we performed no weight quantisation for it. As evident from the layer details in Table 1 the computational demand of this layer, with 1.76% of total MFlops and 0.17% of all weights, is negligible. During training we varied the value of \(\beta \) in Eq. 7 linearly (and evenly with epochs) from 3.0 to 8.0 following the principal of continuation of [2]. The variant no continuation uses a fixed \(\beta =3\) for all epochs. To quantify whether our approach successfully reduces quantisation loss, we also compare a variant without quantisation that does not realise ternary convolutions. For binary convolutional networks (termed XNORnet [25], see Eq. 8), we explore the ad hoc gradient approximation according to the seminal work in [4]. As alternative, we adopt the continuation (see Eq. 3) for a classical \(\tanh \) nonlinearity. Finally, the full-precision network is compared with ReLU activations for completeness.

Data processing: We resampled the original scans of the NIH dataset that had axial dimensions of \(512 \times 512\) and 181–466 slices with thicknesses between 0.5 to 1.0 mm to isotropic voxel sizes of 1.0mm\(^3\). We then performed a region-of-interest cropping with bounding boxes of dimensions \(194 \times 122 \times 138\) around the pancreas, yielding an approximate density of 2% for organ voxels (and 98% background). There exist several accurate algorithms that automatically predict bounding boxes and/or organ locations, e.g. [29, 33], which could be employed for this task so it was considered out of scope for our study. Subsequently, we applied a zero mean unit variance transformation on the cropped CT volumes. Following related work on pancreas segmentation using CNNs [27, 36], we employ only 2D convolutions, but provide a stack of several neighbouring slices (15 in our experiments) to each network. The output for each stack will be a probabilistic map of foreground and background probabilities for the given central slice. No form of post-processing is employed, which could potentially further increase accuracy, but also influences the assessment of differences across methods.

Table 2 Dice overlap measures of pancreas for 82 CT scans (fivefold cross-validation)

Training and implementation details: We use a mini-batch size of 10 and Adam as optimiser with an initial learning rate of 0.0025. Each network is trained for 40 epochs with 150 iterations (1500 3D input stacks) without early stopping. The hyperparameters are not specifically optimised and kept same for all variants. Since we encountered a huge class imbalance, we use a weighted cross-entropy loss with 0.5 for background and 2.5 for organ pixels, but alternatively a Dice loss function [21] could automatically deal with it. We trained 5 separated folds of training and validation splits using 65–66 scans for training and 16–17 for testing. The derivatives of our ternary activation and the equivalent binary \(\tanh (x)\) can be found analytically (using automatic differentiation), for the ad hoc approximation of binary activations in Eq. 2 we implemented a custom forward and backward pass. When approximating filter kernels, we keep a copy of the full-precision weights, perform the quantisation before forward pass and restore the original values after the backward pass and before calling the optimiser that performs a gradient step. To enable a reproduction of our results and further research, our pytorch implementation and pre-trained models will be made publicly available after submission at https://github.com/mattiaspaul/TernaryNet.

Fig. 4
figure 4

Top row: A visual comparison of case # 12 of the NIH demonstrates small but significant advantages of the ternary quantisation (middle) over the better performing ad hoc binary activation and quantisation, which oversegments a neighbouring structure (left). Our approach better matches the manual segmentation (right). Bottom row: 3D visualisation of our segmentation shows a very smooth surface (left). Ranked (sorted) Dice score compared across methods demonstrate that the full-precision model is not significantly better than our heavily quantised TernaryNet. Both Binary XNORnet variants perform inferior

Results and discussion

The performance of the seven compared models is evaluated quantitatively in terms of Dice overlap between automatic prediction (without further post-processing) and manual annotation. Average Dice values (and standard deviations) are compared in Table 2 alongside with statistical significance tests and memory usage for model parameters.

Fig. 5
figure 5

Left: By employing the continuation technique with increasing \(\beta \) values during training epochs, we can significantly improve the outcome of our trained networks. The ternarised quantisation does thereby no longer affect segmentation quality measured in Dice overlap and approaches the quality of a full-precision U-Net. When comparing these validation curves with training accuracies, only a moderate gap is visible and no overfitting occurs. Right: The observed sparsity (fraction of zero values) in the trained ternary weights increases throughout the training process. This effect is more pronounced for deeper layers with high parameter counts

It can be seen that our proposed ternary convolutions perform on par with full-precision networks reaching an average Dice of 71.0%. This demonstrates the robustness and high accuracy of our proposed ternary quantisation scheme. The results are also comparable to a number of recent deep learning approaches that all relied on full precision and thus much larger and more complex models. When replacing the \(\tanh (x)\) nonlinearity with a ReLU in the full-precision model, its accuracy can be further improved by 3.8%. However, the presumptions that symmetric activations are nowadays unsuitable to reach high accuracy has been refuted. Possibly, because the U-Net and similar architectures enable a very good backwards flow of gradients through their skip connections. The performance of binary quantisation is significantly lower than our approach. This is in particular evident for the variant that uses an analytically differentiable activation (see Fig. 4). We assert that this underlines the importance of sparse activations, which can contain a larger number of zero values—a key feature of our new nonlinearity. Sparse intermediate feature maps enable the network to adapt certain filter banks to specific subproblems while being unaffected by pertubations of unrelated data.

Training one entire model (within 40 epochs) requires about 15 min on an NVIDIA Titan Xp. Inference of the full-precision network on a CPU takes about 80 s. When employing a customised OpenCL implementation for Hamming distance calculation (used for ternary convolutions in Eq. 9), we estimated inference times of 5–7 s using a dual-core mobile CPU. This represents a more than 10\(\times \) speed-up through our contributions. Further speed-ups can be gained by reducing the number of parameters in the expanding path and skipping every other slice in a 3D volume (and interpolating in between) or adjusting the ternary weight quantisation to increase sparsity and reduce the number of population counts.

When analysing the sparsity of filter weights learned by our model across epochs, shown in Fig. 5, one can see a tendency to an increase in zero values in later layers and later epochs. These findings are supported by [22], which explore more sparsity in deeper layers together with increased accuracy. In comparison with the number of trainable weights in Table 1, it is notable that layers with increased sparsity at the end of training also contain most free parameters. This indicates that the model automatically avoids overfitting and sparsity acts as a regulariser. The importance of adapting the slope in our ternary hyperbolic tangent nonlinearity during training is clearly shown in Fig. 5, where the average Dice is plotted across training epochs. Note that the evaluation on validation cases always employs ternary convolutions and accordingly quantises activations using Eq. 6.

Limitations and potential for further extensions: While the results of a TernaryNet come close to a full-precision U-Net with hyperbolic tangent activation, there is a loss in accuracy of 3.9% to the more common ReLU variant. We empirically found that using a ReLU6 (which cannot exceed an output of 6) [15] performs as well (75.7% avg. Dice). Therefore, the performance gap could most likely be closed by increasing the expressiveness of the quantised activation.

Conclusion

We have presented a pioneering approach for ternary convolutions in deep neural networks that relies on both ternarised activations and filter weights. Our work goes beyond previous efforts of binarisation that has often led to severe model degradation. In our experiments, we demonstrated that the TernaryNet maintains the high segmentation quality of the corresponding full-precision U-Net (around 71% Dice for pancreas CT with further potential for improvements), while realising 10\(\times \) speed improvements and 15\(\times \) lower memory requirements. This is in particular important when executing model inference for image-guided interventions on clinical or mobile computing hardware. We believe that the detailed description, publicly available implementation and convincing empirical findings along with the generality of our approach will help transfer the concept of ternary convolutions to other deep learning applications. We have seen a clear importance of designing a ternary activation that is analytically differentiable based on the underlying hyperbolic tangent nonlinearity as well as using a continuous adaption of its slope during training. This eases the complex training process and results in a high sparsity that is desirable for generalisation and supported by theoretical analysis in literature. When proven in other related fields of computer vision, we strongly believe that quantised networks will have an increasing impact and potentially lead to a wider adaptation of its underlying computational blocks (population counts) in mobile processors.