1 Introduction

Chip density has begun to lose its fast-growing pace and has no longer doubled every two years. Accordingly, the growth in processing power of general purpose central processing units (CPUs) cannot cope with the continuously increasing demand for high-performance computing. Also, wire delay lengthens as more transistors are integrated into a processing core, consequently prolonging memory latency and further reducing CPU performance [3]. Several strategies have been proposed to address that issue, such as adding more processor cores or developing heterogeneous computing architectures (HeCAs). Among these two examples, the latter appears to be more promising because it encompasses an innovative idea of utilizing different kinds of processing elements.

Figure 1 illustrates a conventional homogeneous computing architecture (HoCA) and a simple HeCA. It can be observed that the HeCA comprises CPUs, graphics processing units (GPUs), and field-programmable gate arrays (FPGAs), while the HoCA only consists of CPUs. Therefore, the HeCA can offload complex functions, such as deep neural network inference and image filtering, onto GPUs and FPGAs, respectively, saving its CPUs for critical tasks. The smart network interface card (SmartNIC) [15] is an excellent example of how HeCAs can be utilized in practice. This device is a successor to the traditional NIC and is equipped with FPGA accelerators to support hardware offloading. As a result, it can free up the system’s resources allocated for complex tasks such as match-action, tunneling, and load balancing, thereby assisting the CPU in handling heavy network traffic up to 40 Gbps per NIC port. Additionally, SmartNICs often come with high-level programming abstractions for their FPGA engines to provide users with great flexibility. Instead of offloading only a fixed set of functions, users can now customize SmartNIC’s engines to handle their own functions, such as pattern matching operations.

Fig. 1
figure 1

Illustration of two common computing architectures. a Homogeneous computing architecture (HoCA). b Heterogeneous computing architecture (HeCA). In HoCAs, the system’s processing cores are responsible for handling every task. Conversely, HeCAs enable offloading potentially complex functions onto specialized accelerators, thereby conserving the computing power of processing cores

To accelerate the pace of development in HeCAs and hardware offloading, the emerging demand for software-to-hardware conversion has attracted growing interest. The conversion has become more challenging for 2-D spatial image filters, which require repeating a particular operation throughout the entire image in a raster-scan order. For example, the median filter, a robust means to suppress impulse noise, is computationally intensive and has resulted in a series of recent studies [4, 14, 17, 20] on its hardware implementation.

In this sense, it is also necessary to facilitate the hardware offloading of the entropy filter, a fundamental operation in computer vision that imposes a heavy computational burden on the system’s CPU. Hence, we propose a new method for local entropy calculation and present a corresponding hardware accelerator. In the remainder of this paper, we first introduce the related work on hardware offloading of entropy calculation. After that, we present the entropy concept widely adopted in computer vision and discuss a few of its applications. We then describe the proposed method and the hardware architecture. Finally, we provide and analyze implementation results, concluding the paper with our findings.

2 Related work

The adoption of hardware offloading techniques is prevalent in computer vision and network traffic monitoring, two areas demanding high-performance computing capabilities. Modern imaging systems capture high-resolution images (up to 4K or 8K), creating a challenge to process large volumes of data in a limited time frame. Similarly, network systems handle millions of packets per second, requiring real-time processing to avoid bottlenecks. Moreover, real-time applications in computer vision, such as autonomous driving and robotic-assisted surgery, necessitate immediate feedback due to their life-critical nature. Likewise, many network applications, including VoIP, video streaming, and online gaming, are latency-sensitive, as delays can significantly degrade user experience. Consequently, high-performance computing platforms like FPGAs are well-suited for these research areas.

In network traffic monitoring field, entropy is commonly used to analyze the traffic. By computing the entropy of a packet sequence over a specific time interval, traffic anomalies such as DDoS attacks can be detected. Hardware accelerators are typically employed for entropy calculation due to the high throughput requirements of high-speed links. For example, FlexSkethMon [21], a framework for traffic monitoring based on estimated network flow entropy, was implemented on a NetFPGA-SUME board, achieving a processing speed of 96 Gbps. Another entropy estimation algorithm, implemented on a NetFPGA-10 G board, processed network traffic at 30 Gbps [13]. Recently, a hardware accelerator for entropy estimation using the top-k most frequent elements demonstrated a processing speed of 181 Gbps on a Zynq UltraScale+ ZCU102 FPGA board [19]. These studies underscore the extensive use of FPGAs for the computationally intensive task of entropy estimation.

Conversely, in computer vision, there is less research on hardware accelerators for entropy calculation, partly due to the complexity of spatial relationships between image pixels. In [12], an FPGA accelerator for a small binarized neural network was presented, utilizing frame entropy to devise an adaptive filter. This filter controls the number of frames the network processes in inference mode to enhance the processing rate. The entire system was synthesized using a high-level synthesis (HLS) tool, which mapped the entropy calculation and filtering operations to the processing system (PS) rather than the high-performance programmable logic (PL). While HLS tools can synthesize FPGA accelerators from high-level programming languages (such as C/C++ and Python), the resulting accelerators are not always optimal. This trend was also observed in many algorithms for entropy calculation presented in [5].

In this paper, we leverage pipeline parallelism to design an efficient architecture for an entropy filter and implement it using Verilog HDL [1] (IEEE Standard 1364-2005). This approach fully exploits the FPGA’s high-performance computing hardware available on the PL. Detailed implementation is presented in Sect. 5.

3 Preliminaries

3.1 Overview of entropy filter

Entropy is a concept adopted in diverse fields, and it is generally associated with a state of randomness or uncertainty. In computer vision, entropy is often considered in the context of information theory and is defined as follows:

$$\begin{aligned} h(X) = -\sum _{i=1}^{n} p(x_i)\textrm{log}p(x_i), \end{aligned}$$
(1)

where h denotes the entropy value, X a discrete random variable accepting values in the set \(\{x_1, x_2,\ldots , x_n\}\), and p the probability density function (PDF) of X. Depending on whether \(\textrm{log}(\cdot )\) refers to a binary or natural logarithm, h can be measured in the unit of bits or nats (natural unit of information) [9]. Hereinafter, we only focus on the former case of the binary logarithm. Also, in the context of image filtering, (1) is restricted to a sliding window of pixels; therein lies the name local entropy.

Moreover, entropy is usually interpreted as a measurement of information. Thus, the larger the entropy, the more informative the image content. Figure 2 demonstrates a grayscale image and its corresponding entropy profile, calculated using a \(25\times 25\) sliding window. It can be observed that image patches with rich texture information are associated with high entropy values. On the contrary, those with fewer details possess low entropy values. As a result, entropy can be used to characterize the texture information of a grayscale image. This amazing feature renders entropy highly relevant in many computer vision applications, as discussed in the following.

Fig. 2
figure 2

An example image and its corresponding entropy profile (calculated using a \(25\times 25\) sliding window). An upper patch with low texture information has an entropy value of 3.342, whereas a lower patch with high texture information has an entropy value of 6.665

3.2 Applications in computer vision

Image sharpening (or, equivalently, sharpness enhancement) is a fundamental low-level operation in digital image processing, in which an image and its scaled Laplacian are fused to accentuate the sharpness. As digital images are represented by a finite discrete set of intensities, the fusion is prone to out-of-range and edge ringing problems. Accordingly, image entropy was adopted as an attention map to control the fusion, lest the original image was over-enhanced, as demonstrated in an automatic sharpening algorithm [7] (owned by Apple Inc.).

Entropy also finds applications in the fast-expanding field of pattern recognition. For example, an image separation algorithm (owned by Qualcomm) [2] utilizes entropy to identify and remove unwanted regions (such as those associated with trees or sky) from the image, facilitating object recognition algorithms. In [6], entropy serves as one of twelve image features to form a feature space, in which Mahalanobis distance is adopted to predict the image’s perceptual visibility. Similarly, in [10], entropy is one of four image features fed to a Bayes classifier for detecting salient objects.

Given the diverse applications of entropy in computer vision, it is surprising that the literature has been silent on a computationally efficient implementation of the entropy filter. We, therefore, devise a novel method for calculating local entropy and present a corresponding pipelined architecture in the upcoming sections.

4 Proposed method

4.1 Overview

Given a grayscale image \(I\in \mathbb {R}^{M\times N}\) of size \(M\times N\), an entropy filter with a square window of size \(S\times S\) traverses the image in a raster-scan order to create a corresponding entropy profile \(H\in \mathbb {R}^{M\times N}\). Under the filtering mechanism, the filter calculates the entropy value of pixel neighborhoods covered underneath the window at every pixel location. It then assigns that value to the corresponding pixel location in the profile. In the case of boundary padding adopted, the profile is the same size as the input; otherwise, the height and width are reduced by \(\lfloor S/2 \rfloor\). An illustrative description of this mechanism can be found in Fig. 3.

Fig. 3
figure 3

Illustration of the filtering mechanism

Let \(\textbf{u}=\left[ u_0,u_1,\ldots ,u_{S^2-1}\right]\) be a vector containing all pixels covered by the window. Each pixel can take on an intensity value in the discrete set \(\{x_0,x_1,\ldots ,x_{n-1}\}\), with n depending on the number of bits used to represent the image. For example, \(n=256\) and \(x_0=0,x_1=1,\ldots ,x_{n-1}=255\) in a common 8-bit representation. The problem herein is to design an efficient means to calculate the local entropy while ensuring the compactness and efficacy of the corresponding hardware accelerator. In the following subsections, we first introduce a conventional method and discuss its limitations. Thereafter, we present our solution to remold the conventional way of calculating local entropy.

4.2 Conventional method

Given \(\textbf{u}\), its entropy value h can be calculated by (i) constructing a histogram to obtain the PDF and then (ii) applying (1), as shown in Algorithm 1. Let \(\textbf{p}=[p_0,p_1,\ldots ,p_{n-1}]\) denote the histogram of \(\textbf{u}\). We first initialize \(\textbf{p}\) with zeros and then iterate through all pixels in \(\textbf{u}\). At each iteration, we utilize the pixel intensity \(u_i\) as an index (\(j\leftarrow u_i\)) to increment the corresponding bin \(p_j\). When the for loop terminates, we obtain the histogram \(\textbf{p}\), which represents the number of occurrences of each pixel intensity. Therefore, in the second loop, we divide \(p_j\) by \(S^2\) (the total number of pixels in \(\textbf{u}\)) to get the density. To calculate the entropy value h, we also initialize it with zero and iterate through all bins in \(\textbf{p}\).

The above-mentioned algorithm is deceptively simple, but arriving at an efficient hardware implementation is not straightforward. Firstly, constructing the histogram for each vector \(\textbf{u}\) returned by the filter as it traverses the image is highly inefficient. Secondly, the histogram size increases exponentially with the pixel’s word length (the number of bits utilized to represent pixel intensities), leading to a significant consumption of hardware resources. For example, an 8-bit representation requires 256 registers (or memory locations) to hold the histogram bins, and this number doubles on the increment of the word length. Consequently, in this case, the cost for filtering an image equals \(\mathcal {O}(M\cdot N\cdot S^2)\) (histogram construction) plus \(\mathcal {O}(M\cdot N\cdot n)\) (entropy calculation).

Regarding these two issues, we can resolve the former by exploiting the sliding mechanism. More specifically, we can update the histogram by simply decrementing/incrementing bins corresponding to pixel intensities that leave/enter the kernel. As a result, the cost of histogram construction can be reduced to \(\mathcal {O}(M\cdot N\cdot 2\,S)\). The sole hardware accelerator [18] we found in the literature was designed using this approach.

However, resolving the latter issue without remolding the entropy calculation to eliminate the histogram construction remains challenging. We will present our efforts in addressing this in the following subsection.

figure a

Conventional entropy calculation

4.3 Our proposed solution

We observe that it is possible to obtain the number of occurrences of pixel intensities without constructing a histogram. By sorting the pixels and then taking adjacent differences, we can determine how many pixel intensities there are and how many times each of them occurs. Let us consider a \(3\times 3\) window, and assume that \(\textbf{u}=[0,51,62,33,100,0,51,100,51]\) is the vector of pixels covered underneath. The result obtained after sorting and taking adjacent differences is \(\textbf{d}=[0,33,18,0,0,11,38,0]\). We can then easily determine the number of pixel intensities by counting the total transitions from “zero \(\rightarrow\) non-zero” and “non-zero \(\rightarrow\) non-zero.” In our example, there are four of those transitions (\(0\rightarrow 33\), \(33\rightarrow 18\), \(0\rightarrow 11\), and \(11\rightarrow 38\)), signifying that there are five different pixel intensities (0, 32, 51, 62, and 100).

Also, the first zero in \(\textbf{d}\) means that a particular pixel intensity occurs twice. The following non-zero (33) signifies the occurrence of a different intensity which appears once because of the next non-zero (18). This non-zero is followed by two zeros, signifying a three-time occurrence of another intensity. Similarly, the remaining patterns of \(\textbf{d}\) (\(11\rightarrow 38\rightarrow 0\)) signify a sole occurrence followed by a two-time occurrence of two different intensities, respectively. Hence, we obtain the histogram \(\textbf{p}=[2,1,3,1,2]\) without explicitly constructing it. With this information, we can easily calculate the entropy value.

In Algorithm 2, we present a new method for entropy calculation that facilitates the implementation of a pipelined hardware accelerator, which will be discussed later in Sect. 5. For ease of description, we adopt two auxiliary functions: \(\textrm{Sort}(\textbf{u})\), which sorts the elements of \(\textbf{u}\) in ascending order, and the Kronecker delta \(\delta (a,b)\), which returns one if \(a=b\) and zero otherwise.

In the first step, we declare a vector \(\textbf{v}\) to hold the sorted elements of \(\textbf{u}\). After that, we apply the Kronecker delta function to every adjacent pair in \(\textbf{v}\) and store the \((S^2-1)\)-element result in \(\textbf{d}\). It is noteworthy that each element of this vector is either zero or one and thus only requires a single bit for representation. Consequently, the vector \(\textbf{d}\) can be expressed compactly using \((S^2-1)\) bits, which is highly beneficial for the hardware implementation phase.

In the last step of pipelined entropy calculation, we first prepare a look-up table (LUT) holding the value of each term of the summation in (1). Theoretically, an \(S^2\)-entry LUT is sufficient for an \(S\times S\) entropy filter. However, we utilize an \((S^2+1)\)-entry LUT with a zero-indexed entry equal to zero to retain the simplicity of our method. As shown in Algorithm 2, we use a variable called \(\textrm{addr}\) to retrieve values from the LUT. Since the logarithm of zero is undefined, special handling is required when \(\textrm{addr}\) becomes zero. To address this, instead of using multiplexers to resolve the case where \(\textrm{addr}=0\) when calculating the entropy, we add an entry \(l_\textrm{addr}=0\) for \(\textrm{addr}=0\) in the LUT. This approach simplifies the design and saves hardware resources (Table 1).

figure b

Our proposed entropy calculation

Table 1 Step-by-step procedure for calculating entropy using the proposed method

After initializing the LUT, we iterate through \(\textbf{d}\) to calculate the entropy. We adopt two variables: \(\textrm{addr}\) to access the LUT’s entries and \(\textrm{cnt}\) to track the number of pixels that have been processed. In the first iteration (\(i=0\)), \(d_0\) is used to initialize three variables: h, \(\textrm{addr}\), and \(\textrm{cnt}\). The case of \(d_0\) being zero means that a particular pixel intensity occurs twice; thus, \(\textrm{addr}\leftarrow 2\) and \(\textrm{cnt}\leftarrow 2\), whereas h remains zero. Otherwise, that pixel intensity only occurs once, and \(\textrm{addr}\leftarrow 1\), \(\textrm{cnt}\leftarrow 1\), and \(h\leftarrow l_1\) correspondingly. In subsequent iterations (\(1\le i\le S^2-2\)), if \(d_i\) is zero, h remains unchanged, and \(\textrm{addr}\) is updated to track the number of occurrences of the previous pixel intensity. If \(d_i\) is one, it means a new pixel intensity occurs (hence, the term \(l_1\)), and \(\textrm{addr}\) currently points to an LUT entry corresponding to the density value of the last pixel intensity (\(l_{\textrm{addr}}\)). Therefore, the term (\(l_{\textrm{addr}}+l_1\)) is added to h to update the entropy value. During that course of calculation, \(\textrm{cnt}\) is also updated correspondingly. Finally, in the last iteration (\(i=S^2-1\)), \(\textrm{cnt}\) is used to compensate for the entropy value.

To facilitate the understanding of our proposed method, we present a step-by-step procedure for calculating entropy using Algorithm 2 in Table 1. The normal case corresponds to the example discussed at the beginning of this subsection, while the extreme case addresses the scenario where all input values are equal.

As our proposed method is hardware-design-oriented, it is not easy to compare with the conventional method from the software perspective. For example, if we utilize Batcher’s parallel sorting algorithm [11] as the \(\textrm{Sort}(\cdot )\) function, the total cost will be \(\mathcal {O}\left( M\cdot N\cdot \left( \textrm{log}S^2\right) ^2\right)\) plus \(\mathcal {O}(M\cdot N\cdot S^2)\). Compared with the conventional method, ours does not depend on n; however, it is still challenging to determine which one possesses a lower cost. Therefore, we present a pipelined hardware architecture in the next section and compare it with a conventional design [18] whose details are available in Appendix 8.

5 Pipelined architecture

Figure 4 illustrates the overall block diagram of the proposed hardware accelerator, in which the first two blocks (line memories and register banks) are common in 2-D spatial image filters. They help handle the input stream to obtain pixels covered underneath the window. Hence, an \(S\times S\) window requires \((S-1)\) line memories (each can store an image line, i.e., N pixels) and \((S-1)\) register banks (each consists of S registers).

Fig. 4
figure 4

Block diagram of the proposed hardware architecture

Fig. 5
figure 5

Block diagram of the pipelined entropy calculation

The following three blocks (sorting network, adjacent difference calculation, and pipelined entropy calculation) realize our proposed method. Concerning the first, we adopt the optimized merging-sorting network [16] for a fast and compact implementation. After that, we can easily compare each consecutive pair of \(\textbf{v}\) utilizing an exclusive OR gate followed by a reduction OR gate. The one-bit result is identical to that of the Kronecker delta function. Given the vector \(\textbf{d}\), we can calculate the entropy with the pipelined architecture in Fig. 5.

This architecture implements the for loop in section (iii.b) of Algorithm 2, where results after each iteration are stored in pipelined registers. For example, in the first iteration, we utilize the first bit \(d_0\) of \(\textbf{d}\) to initialize three variables \(\textrm{addr}\), \(\textrm{cnt}\), and h. Accordingly, in the first stage of the pipelined architecture, \(d_0\) controls three multiplexers to initialize the three corresponding registers \(\mathrm {addr_0}\), \(\mathrm {cnt_0}\), and \(\mathrm {ht_0}\). Meanwhile, we use another register (\(\textbf{d}_1^{S^2-2}\)) to store the remaining bits of \(\textbf{d}\), i.e., \(\textbf{d}_1^{S^2-2}=\left[ d_1,d_2,\ldots ,d_{S^2-2}\right]\). In the following stages, which correspond to iterations from the second to the \((S^2-1)\)-th, we extract every bit of \(\textbf{d}\) to update the intermediate results stored in \(\mathrm {addr_i}\), \(\mathrm {cnt_i}\), and \(\mathrm {ht_i}\) registers, with \(1\le i\le S^2-2\). Finally, we carry out the calculation in the last iteration to obtain the entropy value in the h register.

As mentioned earlier in Sect. 2, we do not utilize the HLS tool due to its suboptimal performance. Instead, we design the proposed hardware accelerator using Verilog HDL [1] (IEEE Standard 1364-2005) and use Xilinx Vivado v2019.1 to obtain the implementation results, which are presented in the next section. We also perform a quantitative comparison with a conventional design in [18]. The target device for both designs is an XCZU7EV-2FFVC1156 MPSoC [22] on a Zynq UltraScale+ MPSoC ZCU106 Evaluation Kit, equipped with 460,800 registers, 230,400 LUTs, and 312 block RAMs.

6 Implementation results

Table 2 summarizes the implementation results of two hardware accelerators (the conventional in [18] and ours) of a \(5\times 5\) entropy filter. Concerning memory usage, the two designs consume the same amount of block RAMs because they both need the first two blocks (line memories and register banks) to obtain \(S^2=25\) image pixels. However, it can be observed that our design is significantly smaller than the conventional. It only requires 7849 registers and 5748 LUTs, which occupy about \(1.70\%\) and \(2.49\%\) of the related hardware resources. Compared with those of the conventional design, we achieve approximately \(2.4\times\) and \(2.9\times\) reductions in register and LUT consumption. This reduction in resource utilization consequently leads to a corresponding drop (\(1.1\times\)) in power consumption.

Table 2 Implementation results of the entropy filter’s hardware accelerators

Finally, considering the maximum frequency, our design is slightly faster than the conventional. It is noteworthy that the capability of handling 764.526 megapixels per second is far beyond the requirement of computer vision edge devices.

7 Conclusions

In this paper, we introduced a growing trend toward hardware offloading to increase computing power and then pointed out that the literature lacked an efficient implementation of the entropy filter. We also argued that the conventional entropy calculation is not hardware-friendly and hereby proposed a hardware-oriented alternative. We remolded the conventional into a novel three-step method of sorting, adjacent difference calculation, and pipelined entropy calculation. After that, we presented a corresponding hardware accelerator and demonstrated its superiority over the conventional design. Implementation results on a Zynq UltraScale+ XCZU7EV-2FFVC1156 MPSoC device showed \(2.4\times\) and \(2.9\times\) reductions in resource utilization and \(1.1\times\) reduction in power consumption. With this astonishing compactness and efficacy, our proposed architecture can still provide a high throughput of handling 764.526 megapixels per second, rendering it highly beneficial for computer vision edge devices.