Keywords

1 Introduction

Due to the rapid growth of wireless communication technology, the number of Internet of Things (IoT) devices has increased dramatically in recent years. According to a Cisco, 50 billions of IoT devices will be connected to the Internet by 2020 [3]. In addition, it was estimated that data volume generated yearly by those devices will be more than 40 times bigger than the current global traffic, which will amount to 850 Zettabytes [7]. Thus, existing cloud computing infrastructure would be unable to provide good service for the analysis of new data as it simply has not enough computing power and network capacity for a large number of computation tasks. In addition, many AI applications (e.g., autonomous driving) have strict requirement on the latency of computation.

Therefore, it makes more sense to locate computations closer to the data sources, so called edge computing or on-device computing. On-device computing has better capabilities in terms of privacy, latency, scalability, reliability, diversity, and costs compared with traditional cloud computing [8, 14, 18, 19].

Deep learning (DL) tasks are usually computationally intensive and require large memory footprints. At the same time, end-devices have limited computing power and small memories to support raw large-scale DL models. Therefore, original DL models are optimized, compressed, distilled, and quantized to reduce the resource cost. There are already many methods for optimizing DL models [1, 6, 12], but most of them are related to quantization of matrix multiplication operations, while quantization of activations (built as nonlinear functions (NLFs)) has not been studied enough. Softmax layer is one of the most popular and important NLFs, but the complexity of implementation in the platform with limited hardware resources can be a bottleneck of application performance. Thus, we will focus on the usage of softmax layer in computer vision tasks as the main application.

In this paper we propose a lightweight method for efficient computation of the softmax layer at the devices with limited computational power. The method is based on the approximation of softmax by taking reciprocal of natural exponential function, which is implemented as 1-dimensional look-up-table (1-D LUT). In Sect. 2, we consider some preliminaries for understanding softmax, drawbacks of existing approximation methods, and propose our method. Section 3 shows the experimental validation of the proposed method with human segmentation tasks. Section 4 describes a plan for further extension of our research, and Sect. 5 concludes the paper.

2 Softmax Approximation

2.1 Preliminaries

In mathematics, the softmax is a function that takes a vector x of n real numbers as an input, and normalizes it into a probability distribution P(x) consisting of n probabilities proportional to the exponential of each input number. Thus, after applying softmax, each component will be in the interval σ(x i) ∈ (0, 1), and the components will add up to 1, so that they can be interpreted as probabilities. Softmax is often used in neural networks, to map the non-normalized output of a network to a probability distribution over predicted output classes. There are different representations of the softmax function depending on the application [4], but the most well-known and widely accepted version is as follows [11]:

$$\displaystyle \begin{aligned} \sigma(x_i)= \frac{e^{x_i}}{\varSigma{e^{x_i}}} {} \end{aligned} $$
(1)

In the real hardware the range of number representation is limited, thus e x computations can often lead to overflow or underflow. Therefore for practical implementation to provide stable computation, a normalization of input by \(x^*_i = x_i - max(x)\) is used [11]Footnote 1 as shown below:

$$\displaystyle \begin{aligned} \sigma(x_i)= \frac{e^{x_i-max(x)}}{\varSigma{e^{x_i-max(x)}}} {} \end{aligned} $$
(2)

2.2 Previous Arts

Softmax layer is one of the most important and widely used NLF in modern AI models, due to its smooth properties. However, many modern neural processing unit (NPU) architectures are focused on the acceleration of matrix multiplications only, as they are a majority of computational functions of a DL model. As a result, for computation of complex NLF (i.e. softmax layer), the data must be sent out of NPU to an external host (CPU, GPU, or DSP), which complicates the software development, and can also negatively impact on the overall performance and power consumption. In addition, since general purpose interface is used for data transmission, the internal data can be exposed to malicious user, which may cause an issue of data privacy.Footnote 2

To avoid involvement of the host for softmax computation, some NPU proposed dedicated HW accelerators. In many of those implementations, each of the numerator and the denominator in Eq. (1) are computed first, and then a division operation is performed, e.g., [10, 15, 17]. In such case, the HW accelerator should contain a divider circuit (fixed-point, or even floating point), which requires an additional HW cost. To avoid a big area cost for traditional dividers, the authors in [5] propose to replace the denominator with closest 2b value.Footnote 3 Then division can be implemented just as a simple bit-shift operation. Although the method described above is decreasing the hardware complexity of softmax computation it still relies on the division operation, which is not always feasible for end-devices with limited computational power.

2.3 Proposed Method

In general case, we can consider alternative softmax function σ (x i) as

$$\displaystyle \begin{aligned} \sigma^*(x_i)= \frac{score(x_i)}{norm(x)} {} \end{aligned} $$
(3)

where score(x i) and norm(x) are some scoring and normalization factors (for original function \(score(x_i) = e^{x_i}\), and \(norm(x) = \varSigma {e^{x_i}}\)).

As described in [9], we can list some desirable properties of the alternative softmax function as below:

  • Nonlinearity: for better selectivity of the scored values.

  • Numerical stability: to avoid overflow, or underflow during computation.

  • Positive: output values all should be positive, to be used for scoring.

  • Bounded: output values should be bounded by some constant, ideally, σ (x i) ∈ (0, 1).

  • Computational complexity: should be feasible for the implementation into platform with limited HW resources.

Since we consider softmax approximation for inference task in computer vision applications where softmax layer is mostly used for scoring the outputs for classification, the requirements to normalization factor norm(x) can be softer compared with the original formula Eq. (1), where inputs are mapped into the corresponding probabilities. Thus, we can use more various factors for normalization in σ (x i).

We have experimented with different approximations for score(x i) and norm(x) factors, and summarized some of the methods and their properties in Table 1.Footnote 4

Table 1 Softmax approximation methods and their properties

First, we have started with the simple approximations, ignoring normalization factor at all, i.e. applying norm(x) = 1. We have obtained identity (i.e., σ (x i) = x i, method 1 in Table 1), and natural exponentiation function (\(\sigma ^*(x_i)=e^{x_i}\), method 2 in Table 1) for approximation. However, despite their low computational complexity, the numerical stability was not good as the output of function was not bounded. To counter this issue, we have applied some normalization factors (see methods 3 to 5 in Table 1), but the numerical stability was still poor. At the same time we have noticed that method 2 (exponentiation) is showing good selectivity due to its nonlinear property (refer to the corresponding image in Table 3 in Appendix), and can be a good candidate if normalized appropriately. For this purpose we have performed several transformations as shown in Eq. (4) below:

$$\displaystyle \begin{aligned} \frac{e^{x_i}}{max{(e^{x})}} = \frac{e^{x_i}}{e^{max(x)}} = e^{x_i - max(x)} = e^{-(max(x)-x_i)} = \frac{1}{e^{max(x)-x_i}} {} \end{aligned} $$
(4)

First, we kept \(e^{x_i}\) as a scoring factor score(x i) and then we have used max(e x) as a normalization factor to bound the output by 1 and thus we have obtained method 6. However, in such case, the input values to the exponential function e x will be all negative due to the x i − max(x) term, and if e x is implemented by LUT (which is a common approach for HW with limited computation resources), then additional affine transformation is required to compensate for negative input values. To avoid this drawback, we propose to use max-normalization in the inverse way as \(x^*_i = max(x) - x_i\), then input values to e x will be all positive, which allows them to be used directly as indices to LUT values. Second, to compensate for the inverse way of max-normalization, we have used the reciprocal version of e x → 1∕e x as shown in Eq. (4).

In such case neither divider nor multiplier is needed, and only 1-D LUT is required to compute the approximated value of softmax. As a result, the computational complexity is reduced significantly, and it becomes more feasible for the implementation in HW with limited computational power.

Thus, we propose to substitute the original method for softmax computation with the inverse way of max-normalization and the reciprocal of exponential function:

$$\displaystyle \begin{aligned} \sigma(x_i)= \frac{e^{x_i}}{\varSigma{e^{x_i}}} = \frac{e^{x_i-max(x)}}{\varSigma{e^{x_i-max(x)}}} \: \rightarrow \: \sigma^*(x_i)= \frac{1}{e^{max(x)-x_i}} {} \end{aligned} $$
(5)

The properties of the proposed method \(\frac {1}{e^{max(x)-x_i}}\) are as below:

  • Nonlinearity is satisfied with the reciprocal of exponential function 1∕e x: \(\frac {1}{e^{\alpha x}} \neq \alpha \frac {1}{e^{x}}\)

  • Numerical stability is satisfied by max(x) − x i term:\((max(x)-x_i) \in [0, max(x)-min(x)] \: \rightarrow \: \frac {1}{e^{max(x) - x_i}} \in (0, 1]\)

  • Positive output values are due to the exponential function:\(\frac {1}{e^{x}} > 0 \: \forall x \in (-\infty , +\infty )\)

  • Bounded σ (x i) ∈ (0, 1] due to the inverse normalization term max(x) − x i used together with the reciprocal of exponential function 1∕e x:\((max(x)-x_i) \geq 0\: \forall x \: \rightarrow \: \frac {1}{e^{max(x) - x_i}} \in (0, 1]\)

  • Computational complexity is low, as 1∕e x can be implemented with LUT-based method, where the size of LUT is small.

Indices in LUT can be directly calculated by rounding operation as i = ⌊max(x) − x⌉. When input data are quantized by w bits then efficient quantization boundary x q can be defined asFootnote 5

$$\displaystyle \begin{aligned} e^{-x_q} &= \frac{1}{2^w-1} \\ ln(e^{-x_q}) &= ln(\frac{1}{2^w-1}) \\ -x_q &= ln(1) - ln(2^w-1) \\ x_q &= ln(2^w-1) \\ x_q &= \lceil ln(2^w-1) \rceil \end{aligned} $$
(6)

Content of LUT is computed as shown below:

$$\displaystyle \begin{aligned} LUT_{1/e} [i] = \bigg \lfloor \frac{1}{e^i} \cdot (2^w-1) \bigg \rceil , \forall i = 0, 1, \ldots , x_q + 1 {} \end{aligned} $$
(7)

Note, that LUT[i] = 0, ∀i > x q due to quantization, as no value can be encoded with w bits after efficient quantization boundary x q.

If selectivity (precision of computation) is not enough, then LUT can be scaled linearly by α as

$$\displaystyle \begin{aligned} LUT_{1/e} [i] = \bigg \lfloor \frac{1}{e^{i/\alpha}} \cdot (2^w-1) \bigg \rceil , \forall i = 0, 1, 2, \ldots , \alpha \cdot (x_q+1) {} \end{aligned} $$
(8)

3 Experimental Validation

To validate the proposed method we have used a pre-trained Unet model for human segmentation and internally prepared dataset with 1000 images. In this model softmax computation is used to predict the class of every pixel, thus requiring 307, 200 computations for typical 640 × 480 image. This model takes the image as an input, and produces the predicted grey-scale image for segmentation, where each pixel P i is in uint8 precision (with values from 0 to 255). To get the binary mask for segmentation, every pixel in those images was binarized into two classes (class 0 for “background,” and class 1 for “human”) by using threshold thr = 127, as follows:

(9)

In the model we have substituted the conventional softmax layer with the computation method as described above in Sect. 2. For practical implementation, we have selected three different precisions (uint8, uint4, uint2) and prepared the LUTs accordingly to Eq. (7). For evaluation accuracy of segmentation we used the well-known bit-wise intersection-over-union metric [13, 20] as shown below:

$$\displaystyle \begin{aligned} IoU = \frac{area(B^m_{gt, i} \bigcap B^m_{p, i})}{area(B^m_{gt, i} \bigcup B^m_{p, i})} {} \end{aligned} $$
(10)

where \(B^m_{gt, i}\) is a pixel-group of ground-truth image, and \(B^m_{p, i}\) is that of predicted segmentation image. The mIoU value was computed as a mean value among two classes.

Table 2 shows the results of our experiments for different methods of approximation and selected precision (for LUT-based method). As it comes from the table, the accuracy of human segmentation task based on the proposed approximation of softmax layer is as high as the FP32 reference. There is no, or negligibly small accuracy drop (< 0.1% for 2-bit quantization) even for very small size of LUT (3 to 8 bytes).

Table 2 Accuracy of different approximation methods. Full test over 1000 images
Table 3 Accuracy of different approximation methods. Initial test over 100 images

4 Future Work

Despite its extremely low computational complexity, the current version of the softmax approximation can be applied only to the applications where softmax layer is used for scoring (typically last layer in CNN models), calculated within one input tensor only. Thus, it cannot be directly applied to more complicated and softmax-intensive applications such as Natural Language Processing (NLP) tasks where cross-tensor probabilities must be computed more often (e.g., multi-head attention block in Transformer [16], and BERT [2] models). Therefore, we will work forward in order to extend the proposed method to other classes of AI applications.

5 Conclusion

In this paper we have proposed an efficient method for softmax approximation, which can be implemented at the platform with limited hardware resources (mobile, IoT, edge devices) for AI inference tasks. We have applied max-normalization to the input data in the inverse way, which together with the application of LUT-based method for computation of the reciprocal exponential function 1∕e x has significantly reduced the complexity of softmax layer computation. It also has the additional benefits as follows:

  • does not require any additional multiplier, divider, or adder.

  • scalable in terms of accuracy and precision (appropriate LUTs can be pre-computed off-line).

  • fixed latency of computation, which depends only on the size of the tensor.

Thus, the proposed approach provides a good alternative for HW accelerator design, simplifying the overall process of computing the softmax layer, while maintaining identical accuracy to the conventional FP32 based computation.