Lightweight Approximation of Softmax Layer for On-Device Inference

Vasyltsov, Ihor; Chang, Wooseok

doi:10.1007/978-3-030-70296-0_41

Ihor Vasyltsov⁸ &
Wooseok Chang

Part of the book series: Transactions on Computational Science and Computational Intelligence ((TRACOSCI))

1692 Accesses

Abstract

In this paper we propose a method to approximate softmax layer for computer vision applications, especially on the devices with limited hardware (HW) resources, such as mobile or edge platforms. In this paper we showed that using a max-normalization in the inverse way as $x^*_i = max(x)-x_i$ together with the substitution of e ^x by its reciprocal 1∕e ^x allows to obtain an efficient formula for inference on the devices with constrained resources. To validate our method we have conducted experiments with human segmentation model over dataset with 1,000 images. It is shown that even ultra-low quantization down to 2-bit is applicable, maintaining a negligibly small accuracy loss (0.06% for 2-bit quantization). The required size of look-up-table (LUT) is also small (3 to 8 Bytes only).

Access provided by Autonomous University of Puebla. Download conference paper PDF

LiteDepth: Digging into Fast and Accurate Depth Estimation on Mobile Devices

A Light-Weight Monocular Depth Estimation with Edge-Guided Occlusion Fading Reduction

Efficient Single-Image Depth Estimation on Mobile Devices, Mobile AI & AIM 2022 Challenge: Report

Keywords

1 Introduction

Due to the rapid growth of wireless communication technology, the number of Internet of Things (IoT) devices has increased dramatically in recent years. According to a Cisco, 50 billions of IoT devices will be connected to the Internet by 2020 [3]. In addition, it was estimated that data volume generated yearly by those devices will be more than 40 times bigger than the current global traffic, which will amount to 850 Zettabytes [7]. Thus, existing cloud computing infrastructure would be unable to provide good service for the analysis of new data as it simply has not enough computing power and network capacity for a large number of computation tasks. In addition, many AI applications (e.g., autonomous driving) have strict requirement on the latency of computation.

Therefore, it makes more sense to locate computations closer to the data sources, so called edge computing or on-device computing. On-device computing has better capabilities in terms of privacy, latency, scalability, reliability, diversity, and costs compared with traditional cloud computing [8, 14, 18, 19].

Deep learning (DL) tasks are usually computationally intensive and require large memory footprints. At the same time, end-devices have limited computing power and small memories to support raw large-scale DL models. Therefore, original DL models are optimized, compressed, distilled, and quantized to reduce the resource cost. There are already many methods for optimizing DL models [1, 6, 12], but most of them are related to quantization of matrix multiplication operations, while quantization of activations (built as nonlinear functions (NLFs)) has not been studied enough. Softmax layer is one of the most popular and important NLFs, but the complexity of implementation in the platform with limited hardware resources can be a bottleneck of application performance. Thus, we will focus on the usage of softmax layer in computer vision tasks as the main application.

In this paper we propose a lightweight method for efficient computation of the softmax layer at the devices with limited computational power. The method is based on the approximation of softmax by taking reciprocal of natural exponential function, which is implemented as 1-dimensional look-up-table (1-D LUT). In Sect. 2, we consider some preliminaries for understanding softmax, drawbacks of existing approximation methods, and propose our method. Section 3 shows the experimental validation of the proposed method with human segmentation tasks. Section 4 describes a plan for further extension of our research, and Sect. 5 concludes the paper.

2 Softmax Approximation

2.1 Preliminaries

In mathematics, the softmax is a function that takes a vector x of n real numbers as an input, and normalizes it into a probability distribution P(x) consisting of n probabilities proportional to the exponential of each input number. Thus, after applying softmax, each component will be in the interval σ(x _i) ∈ (0, 1), and the components will add up to 1, so that they can be interpreted as probabilities. Softmax is often used in neural networks, to map the non-normalized output of a network to a probability distribution over predicted output classes. There are different representations of the softmax function depending on the application [4], but the most well-known and widely accepted version is as follows [11]:

$$\displaystyle \begin{aligned} \sigma(x_i)= \frac{e^{x_i}}{\varSigma{e^{x_i}}} {} \end{aligned} $$

(1)

In the real hardware the range of number representation is limited, thus e ^x computations can often lead to overflow or underflow. Therefore for practical implementation to provide stable computation, a normalization of input by $x^*_i = x_i - max(x)$ is used [11]^{Footnote 1} as shown below:

$$\displaystyle \begin{aligned} \sigma(x_i)= \frac{e^{x_i-max(x)}}{\varSigma{e^{x_i-max(x)}}} {} \end{aligned} $$

(2)

2.2 Previous Arts

Softmax layer is one of the most important and widely used NLF in modern AI models, due to its smooth properties. However, many modern neural processing unit (NPU) architectures are focused on the acceleration of matrix multiplications only, as they are a majority of computational functions of a DL model. As a result, for computation of complex NLF (i.e. softmax layer), the data must be sent out of NPU to an external host (CPU, GPU, or DSP), which complicates the software development, and can also negatively impact on the overall performance and power consumption. In addition, since general purpose interface is used for data transmission, the internal data can be exposed to malicious user, which may cause an issue of data privacy.^{Footnote 2}

To avoid involvement of the host for softmax computation, some NPU proposed dedicated HW accelerators. In many of those implementations, each of the numerator and the denominator in Eq. (1) are computed first, and then a division operation is performed, e.g., [10, 15, 17]. In such case, the HW accelerator should contain a divider circuit (fixed-point, or even floating point), which requires an additional HW cost. To avoid a big area cost for traditional dividers, the authors in [5] propose to replace the denominator with closest 2^b value.^{Footnote 3} Then division can be implemented just as a simple bit-shift operation. Although the method described above is decreasing the hardware complexity of softmax computation it still relies on the division operation, which is not always feasible for end-devices with limited computational power.

2.3 Proposed Method

In general case, we can consider alternative softmax function σ ^∗(x _i) as

$$\displaystyle \begin{aligned} \sigma^*(x_i)= \frac{score(x_i)}{norm(x)} {} \end{aligned} $$

(3)

where score(x _i) and norm(x) are some scoring and normalization factors (for original function $score(x_i) = e^{x_i}$, and $norm(x) = \varSigma {e^{x_i}}$).

As described in [9], we can list some desirable properties of the alternative softmax function as below:

Nonlinearity: for better selectivity of the scored values.
Numerical stability: to avoid overflow, or underflow during computation.
Positive: output values all should be positive, to be used for scoring.
Bounded: output values should be bounded by some constant, ideally, σ ^∗(x _i) ∈ (0, 1).
Computational complexity: should be feasible for the implementation into platform with limited HW resources.

Since we consider softmax approximation for inference task in computer vision applications where softmax layer is mostly used for scoring the outputs for classification, the requirements to normalization factor norm(x) can be softer compared with the original formula Eq. (1), where inputs are mapped into the corresponding probabilities. Thus, we can use more various factors for normalization in σ ^∗(x _i).

We have experimented with different approximations for score(x _i) and norm(x) factors, and summarized some of the methods and their properties in Table 1.^{Footnote 4}

Table 1 Softmax approximation methods and their properties

Full size table

First, we have started with the simple approximations, ignoring normalization factor at all, i.e. applying norm(x) = 1. We have obtained identity (i.e., σ ^∗(x _i) = x _i, method 1 in Table 1), and natural exponentiation function ($\sigma ^*(x_i)=e^{x_i}$, method 2 in Table 1) for approximation. However, despite their low computational complexity, the numerical stability was not good as the output of function was not bounded. To counter this issue, we have applied some normalization factors (see methods 3 to 5 in Table 1), but the numerical stability was still poor. At the same time we have noticed that method 2 (exponentiation) is showing good selectivity due to its nonlinear property (refer to the corresponding image in Table 3 in Appendix), and can be a good candidate if normalized appropriately. For this purpose we have performed several transformations as shown in Eq. (4) below:

$$\displaystyle \begin{aligned} \frac{e^{x_i}}{max{(e^{x})}} = \frac{e^{x_i}}{e^{max(x)}} = e^{x_i - max(x)} = e^{-(max(x)-x_i)} = \frac{1}{e^{max(x)-x_i}} {} \end{aligned} $$

(4)

First, we kept $e^{x_i}$ as a scoring factor score(x _i) and then we have used max(e ^x) as a normalization factor to bound the output by 1 and thus we have obtained method 6. However, in such case, the input values to the exponential function e ^x will be all negative due to the x _i − max(x) term, and if e ^x is implemented by LUT (which is a common approach for HW with limited computation resources), then additional affine transformation is required to compensate for negative input values. To avoid this drawback, we propose to use max-normalization in the inverse way as $x^*_i = max(x) - x_i$, then input values to e ^x will be all positive, which allows them to be used directly as indices to LUT values. Second, to compensate for the inverse way of max-normalization, we have used the reciprocal version of e ^x → 1∕e ^x as shown in Eq. (4).

In such case neither divider nor multiplier is needed, and only 1-D LUT is required to compute the approximated value of softmax. As a result, the computational complexity is reduced significantly, and it becomes more feasible for the implementation in HW with limited computational power.

Thus, we propose to substitute the original method for softmax computation with the inverse way of max-normalization and the reciprocal of exponential function:

$$\displaystyle \begin{aligned} \sigma(x_i)= \frac{e^{x_i}}{\varSigma{e^{x_i}}} = \frac{e^{x_i-max(x)}}{\varSigma{e^{x_i-max(x)}}} \: \rightarrow \: \sigma^*(x_i)= \frac{1}{e^{max(x)-x_i}} {} \end{aligned} $$

(5)

The properties of the proposed method $\frac {1}{e^{max(x)-x_i}}$ are as below:

Nonlinearity is satisfied with the reciprocal of exponential function 1∕e ^x: $\frac {1}{e^{\alpha x}} \neq \alpha \frac {1}{e^{x}}$
Numerical stability is satisfied by max(x) − x _i term:$(max(x)-x_i) \in [0, max(x)-min(x)] \: \rightarrow \: \frac {1}{e^{max(x) - x_i}} \in (0, 1]$
Positive output values are due to the exponential function:$\frac {1}{e^{x}} > 0 \: \forall x \in (-\infty , +\infty )$
Bounded σ ^∗(x _i) ∈ (0, 1] due to the inverse normalization term max(x) − x _i used together with the reciprocal of exponential function 1∕e ^x:$(max(x)-x_i) \geq 0\: \forall x \: \rightarrow \: \frac {1}{e^{max(x) - x_i}} \in (0, 1]$
Computational complexity is low, as 1∕e ^x can be implemented with LUT-based method, where the size of LUT is small.

Indices in LUT can be directly calculated by rounding operation as i = ⌊max(x) − x⌉. When input data are quantized by w bits then efficient quantization boundary x _q can be defined as^{Footnote 5}

$$\displaystyle \begin{aligned} e^{-x_q} &= \frac{1}{2^w-1} \\ ln(e^{-x_q}) &= ln(\frac{1}{2^w-1}) \\ -x_q &= ln(1) - ln(2^w-1) \\ x_q &= ln(2^w-1) \\ x_q &= \lceil ln(2^w-1) \rceil \end{aligned} $$

(6)

Content of LUT is computed as shown below:

$$\displaystyle \begin{aligned} LUT_{1/e} [i] = \bigg \lfloor \frac{1}{e^i} \cdot (2^w-1) \bigg \rceil , \forall i = 0, 1, \ldots , x_q + 1 {} \end{aligned} $$

(7)

Note, that LUT[i] = 0, ∀i > x _q due to quantization, as no value can be encoded with w bits after efficient quantization boundary x _q.

If selectivity (precision of computation) is not enough, then LUT can be scaled linearly by α as

$$\displaystyle \begin{aligned} LUT_{1/e} [i] = \bigg \lfloor \frac{1}{e^{i/\alpha}} \cdot (2^w-1) \bigg \rceil , \forall i = 0, 1, 2, \ldots , \alpha \cdot (x_q+1) {} \end{aligned} $$

(8)

3 Experimental Validation

To validate the proposed method we have used a pre-trained Unet model for human segmentation and internally prepared dataset with 1000 images. In this model softmax computation is used to predict the class of every pixel, thus requiring 307, 200 computations for typical 640 × 480 image. This model takes the image as an input, and produces the predicted grey-scale image for segmentation, where each pixel P _i is in uint8 precision (with values from 0 to 255). To get the binary mask for segmentation, every pixel in those images was binarized into two classes (class 0 for “background,” and class 1 for “human”) by using threshold thr = 127, as follows:

(9)

In the model we have substituted the conventional softmax layer with the computation method as described above in Sect. 2. For practical implementation, we have selected three different precisions (uint8, uint4, uint2) and prepared the LUTs accordingly to Eq. (7). For evaluation accuracy of segmentation we used the well-known bit-wise intersection-over-union metric [13, 20] as shown below:

$$\displaystyle \begin{aligned} IoU = \frac{area(B^m_{gt, i} \bigcap B^m_{p, i})}{area(B^m_{gt, i} \bigcup B^m_{p, i})} {} \end{aligned} $$

(10)

where $B^m_{gt, i}$ is a pixel-group of ground-truth image, and $B^m_{p, i}$ is that of predicted segmentation image. The mIoU value was computed as a mean value among two classes.

Table 2 shows the results of our experiments for different methods of approximation and selected precision (for LUT-based method). As it comes from the table, the accuracy of human segmentation task based on the proposed approximation of softmax layer is as high as the FP32 reference. There is no, or negligibly small accuracy drop (< 0.1% for 2-bit quantization) even for very small size of LUT (3 to 8 bytes).

Table 2 Accuracy of different approximation methods. Full test over 1000 images

Full size table

Table 3 Accuracy of different approximation methods. Initial test over 100 images

Full size table

4 Future Work

Despite its extremely low computational complexity, the current version of the softmax approximation can be applied only to the applications where softmax layer is used for scoring (typically last layer in CNN models), calculated within one input tensor only. Thus, it cannot be directly applied to more complicated and softmax-intensive applications such as Natural Language Processing (NLP) tasks where cross-tensor probabilities must be computed more often (e.g., multi-head attention block in Transformer [16], and BERT [2] models). Therefore, we will work forward in order to extend the proposed method to other classes of AI applications.

5 Conclusion

In this paper we have proposed an efficient method for softmax approximation, which can be implemented at the platform with limited hardware resources (mobile, IoT, edge devices) for AI inference tasks. We have applied max-normalization to the input data in the inverse way, which together with the application of LUT-based method for computation of the reciprocal exponential function 1∕e ^x has significantly reduced the complexity of softmax layer computation. It also has the additional benefits as follows:

does not require any additional multiplier, divider, or adder.
scalable in terms of accuracy and precision (appropriate LUTs can be pre-computed off-line).
fixed latency of computation, which depends only on the size of the tensor.

Thus, the proposed approach provides a good alternative for HW accelerator design, simplifying the overall process of computing the softmax layer, while maintaining identical accuracy to the conventional FP32 based computation.

Notes

1.
All major DL frameworks (TensorFlow v1.7, PyTorch (with Caffe2) v0.4.0, MXNET v1.1.0, Microsoft Cognitive Toolkit v2.5.1, and Chainer v5.0.0a1) are using this safe version for softmax computation [11].
2.
For example, for CCTV or industrial data sensing application.
3.
Where b is a certain integer constant.
4.
For more details, please refer to Table 3 in Appendix, where statistical results and examples of images from initial tests are shown.
5.
Efficient quantization boundary x _q defines the biggest input value, which can be mapped into w quantization bits.

References

Y. Cheng, D. Wang, P. Zhou, T. Zhang, A survey of model compression and acceleration for deep neural networks (2017). CoRR, abs/1710.09282
Google Scholar
J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding (2018). CoRR, abs/1810.04805
Google Scholar
Fog computing and the internet of things: extend the cloud to where the things are (2015). https://www.cisco.com/c/dam/en_us/solutions/trends/iot/docs/computing-overview.pdf, Accessed 10 March 2020
B. Gao, L. Pavel, On the properties of the softmax function with application in game theory and reinforcement learning (2017, preprint). arXiv:1704.00805
Google Scholar
X. Geng, J. Lin, B. Zhao, A. Kong, M.M. Sabry Aly, V. Chandrasekhar, Hardware-aware softmax approximation for deep neural networks, in Computer Vision – ACCV 2018, ed. by C.V. Jawahar, H. Li, G. Mori, K. Schindler (Springer, Cham, 2019), pp. 107–122
Chapter Google Scholar
Y. Guo, A survey on methods and theories of quantized neural networks (2018). CoRR, abs/1808.04752
Google Scholar
Y. Han, X. Wang, V.C.M. Leung, D. Niyato, X. Yan, X. Chen, Convergence of edge computing and deep learning: a comprehensive survey (2019). CoRR, abs/1907.08349
Google Scholar
A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, L.V. Gool, AI benchmark: running deep neural networks on android smartphones (2018). CoRR, abs/1810.01109
Google Scholar
S. Kanai, Y. Fujiwara, Y. Yamanaka, S. Adachi, Sigsoftmax: reanalysis of the softmax bottleneck (2018, preprint). arXiv:1805.10829
Google Scholar
Z. Li, H. Li, X. Jiang, B. Chen, Y. Zhang, G. Du, Efficient FPGA implementation of softmax function for DNN applications, in 2018 12th IEEE International Conference on Anti-Counterfeiting, Security, and Identification (ASID) (2018), pp. 212–216
Google Scholar
M. Milakov, N. Gimelshein, Online normalizer calculation for softmax (2018). CoRR, abs/1805.02867
Google Scholar
Model compression papers (2018). https://github.com/chester256/Model-Compression-Papers, Accessed 10 March 2020
S.H. Rezatofighi, N. Tsoi, J.Y. Gwak, A. Sadeghian, I.D. Reid, S. Savarese, Generalized intersection over union: a metric and a loss for bounding box regression (2019). CoRR, abs/1902.09630
Google Scholar
M.G. Sarwar Murshed, C. Murphy, D. Hou, N. Khan, G. Ananthanarayanan, F. Hussain, Machine Learning at the Network Edge: A Survey (Open Access Archive of Cornell University, New York, USA, 2019) https://arxiv.org/abs/1908.00080. Accessed June 15, 2021
Google Scholar
Q. Sun, Z. Di, Z. Lv, F. Song, Q. Xiang, Q. Feng, Y. Fan, X. Yu, W. Wang, A high speed softmax VLSI architecture based on basic-split, in 2018 14th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT) (2018), pp. 1–3
Google Scholar
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need (2017). CoRR, abs/1706.03762
Google Scholar
K. Wang, Y. Huang, Y. Ho, W. Fang, A customized convolutional neural network design using improved softmax layer for real-time human emotion recognition, in 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS) (2019), pp. 102–106
Google Scholar
S. Wang, A. Pathania, T. Mitra, Neural network inference on mobile SoCs. IEEE Des. Test 37, 50–57 (2020)
Article Google Scholar
X. Zhang, Y. Wang, S. Lu, L. Liu, L. Xu, W. Shi, OpenEI: an open framework for edge intelligence (2019). CoRR, abs/1906.01864
Google Scholar
D. Zhou, J. Fang, X. Song, C. Guan, J. Yin, Y. Dai, R. Yang, IoU loss for 2d/3d object detection (Open Access Archive of Cornell University, New York, USA, 2019) https://arxiv.org/abs/1908.03851. Accessed June 15, 2021
Book Google Scholar

Download references

Author information

Authors and Affiliations

Samsung Advanced Institute of Technology, Samsung Electronics, Suwon-si, Gyeonggi-do, South Korea
Ihor Vasyltsov

Authors

Ihor Vasyltsov
View author publications
You can also search for this author in PubMed Google Scholar
Wooseok Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ihor Vasyltsov .

Editor information

Editors and Affiliations

Department of Computer Science, University of Georgia, Athens, GA, USA
Hamid R. Arabnia
Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, MB, Canada
Ken Ferens
Business Administration, University of Oviedo, Oviedo, Asturias, Spain
David de la Fuente
Institute of Informatics Problems, The Russian Academy of Sciences, Moscow, Russia
Elena B. Kozerenko
Technology and Information systems, Universidad de Castilla La Mancha, Ciudad Real, Ciudad Real, Spain
José Angel Olivas Varela
Facultad de Informática - CIC PBA, Universidad Nacional de La Plata, La Plata, Argentina
Fernando G. Tinetti

Appendix

In this section there are presented more details about the research on softmax layer approximation. There are shown more methods for approximation, as well their results over initial test. The experiments for initial tests were conducted the same way as in Sect. 3, but over sub-set of 100 images. Also, examples of images generated by human segmentation model for different methods of softmax computation are shown (Table 3).

As it can be seen from the table, imag e generated by human segmentation model with method 2 ($\sigma ^*(x_i) = e^{x_i}$) shows good selectivity of the method. Thus, we used it as a base for creating the finally proposed method $\sigma ^*(x_i)= \frac {1}{e^{max(x)-x_i}}$.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vasyltsov, I., Chang, W. (2021). Lightweight Approximation of Softmax Layer for On-Device Inference. In: Arabnia, H.R., Ferens, K., de la Fuente, D., Kozerenko, E.B., Olivas Varela, J.A., Tinetti, F.G. (eds) Advances in Artificial Intelligence and Applied Cognitive Computing. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-70296-0_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-70296-0_41
Published: 15 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-70295-3
Online ISBN: 978-3-030-70296-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Lightweight Approximation of Softmax Layer for On-Device Inference

Abstract

Similar content being viewed by others

LiteDepth: Digging into Fast and Accurate Depth Estimation on Mobile Devices

A Light-Weight Monocular Depth Estimation with Edge-Guided Occlusion Fading Reduction

Efficient Single-Image Depth Estimation on Mobile Devices, Mobile AI & AIM 2022 Challenge: Report

Keywords

1 Introduction

2 Softmax Approximation

2.1 Preliminaries

2.2 Previous Arts

2.3 Proposed Method

3 Experimental Validation

4 Future Work

5 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Lightweight Approximation of Softmax Layer for On-Device Inference

Abstract

Similar content being viewed by others

LiteDepth: Digging into Fast and Accurate Depth Estimation on Mobile Devices

A Light-Weight Monocular Depth Estimation with Edge-Guided Occlusion Fading Reduction

Efficient Single-Image Depth Estimation on Mobile Devices, Mobile AI & AIM 2022 Challenge: Report

Keywords

1 Introduction

2 Softmax Approximation

2.1 Preliminaries

2.2 Previous Arts

2.3 Proposed Method

3 Experimental Validation

4 Future Work

5 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation