Keywords

1 Introduction

Computational intelligence methods (see e.g. [24, 610, 1315, 17, 2932, 34, 3840, 42, 4854, 6064]) offer suitable properties for modeling the nonlinear dynamics of various types of real objects. A different types of neural networks (see e.g. [16, 55]) or fuzzy systems (see e.g. [1827, 41, 43, 5659, 6874]) have a number of useful features such as the ability to approximate any continuous non-linearity or the ability to interpret the accumulated knowledge. However, from a practical point of view, the other features are also important. For example, the ability to implement in a hardware (e.g. FPGA) to obtain the operation model working in a real time. Moreover, the implementation should be relatively simple and economically justified.

In recent years, a large number of projects have used FPGAs to perform the control and modeling of dynamical systems. In many cases, these projects utilize neural networks [5], fuzzy systems or neuro-fuzzy systems [11, 33]. However, in some cases, the degree of complexity of used algorithms is very high and the economy of this solution is questionable.

This is due to the fact that these algorithms are mainly based on arithmetic operations for floating point numbers. In particular floating-point operations such as divide numbers [37], exponential and trigonometric functions are characterized as they have the high complexity and low performance when they are implemented on FPGA devices. FPGA hardware resources are not adapted to the efficient implementation of this type of calculations. FPGAs are well suited for the implementation of fixed-point calculations, such as addition, subtraction and multiplication. Implementation of complex arithmetic operations based on the floating-point numbers consumes a lot of resources of the FPGA hardware. For this reason, in most cases fully parallel implementation of floating-point calculations becomes economically unjustified.

In order to reduce the high consumption of resources a serial or semi-parallel data processing algorithms are used, including the recursive implementations [37]. In this case, the demand for hardware resources significantly decreases. However, computing efficiency drops significantly - which is an obvious drawback of such a solution. It should be noted that in some cases this approach is highly justified. For example, consider the control system whose duty cycle is limited by the limit frequency of operation of actuator, for example about 20 kHz. In this case, the hardware implementation of the complete control algorithm working with the cycle less than 50 \({\upmu }\)s is pointless, because the generated data are not used earlier than the mentioned time 50 \({\upmu }\)s elapses. It should be noted that there are a number of applications which drew significant benefits if the processing time is as short as possible. Examples are hardware emulators of various types of real objects used for hardware-in-the-loop (HIL) systems.

As noted in the work [46] there are existing commercial digital real-time HIL simulators that are characterized by 50 \({\upmu }\)s to 100 \(\upmu \)s time steps and computational latency, and therefore they are not able to simulate accurately the very fast dynamics of power electronics systems. The authors suggest that simulation with a time step with value of 1 \(\upmu \) or less is much more appropriate solution. In order to obtain high processing speeds various techniques are used. They cover both the structure of the implemented algorithm and methods of their implementation.

The vast majority of practical implementation on FPGA widely use triangular or trapezoidal fuzzy sets. Such sets are easier to be realized in FPGA than the ones based on a Gaussian functions [1, 47]. While many theoretically developed algorithms are based on Gaussian fuzzy sets, which sometimes are considered to be more appropriate to represent fuzzy knowledge. Moreover, if the input variables are represented by complementary membership functions of the fuzzy sets it follows another benefit, namely processing technique is applied only for activated rules. How was indicated in the paper [47], elimination of verification of the activation degree of all fuzzy rules allows to accelerate inference process. One of the possible techniques used in this field is the odd-even method [28].

The results presented in various papers show that in many cases relatively high processing speed is achieved, however, at the expense of low resolution of processed signals. This is due to the applied binary encoding using an average of 6 to 8 bits. Unfortunately, the specificity of many proposed solutions is that increasing resolution of processed words, eg. to 12-bits, causes a significant increase in the consumption of hardware resources.

In this paper we propose a new method for the implementation of fuzzy system on FPGA. This method offers good performance and accuracy with relatively low consumption of hardware resources.

This paper is organized into 4 sections. Section 2 contains an idea of designing the neuro-fuzzy structure to the limitations arising from the implementation in hardware in FPGA devices. Implementation results are presented in Sect. 3. Conclusions are drawn in Sect. 4.

2 The Method of Designing the Fuzzy Structure to the Limitations Arising from the Hardware Implementation

In this work will be considered systems using fuzzy rules of the following form

$$\begin{aligned} \mathbf{IF }~ (x_{1} ~is~ {\overline{x_{1}^{j}}}) ~\mathbf{AND }~ ...~ \mathbf{AND } ~(x_{N} ~is~ {\overline{x_{N}}})~ \mathbf{THEN } (y ~is~ {\overline{y}}), \end{aligned}$$

where \(x_i\) indicates the input to the system (\(i=1..N\)), y is the output, \( \overline{x}_i^{j}\) are input fuzzy sets for the j-th fuzzy rule (\(j=1..M\)) and \(\overline{y}^j\) are output fuzzy sets. In the considered systems Gaussian input fuzzy sets are used. The algebraic product is used as a T-norm operator. Rules consequents are a singleton type and the method of centre of gravity for singletons (COGS) is used for defuzzification. For the sake of clarity of description we will present a system with one output. It should be noted that such simplification does not constitute the loss of generality for the general idea presented in this paper.

According to the theory of fuzzy logic and common practice the implementation of fuzzy system is followed in three stages: 1. fuzzification, 2. inference, 3. defuzzification. However, because of the investigated class of fuzzy systems are functionally equivalent to a radial basis function networks [36] (it will be explained in detail in the later in the paper), in the current paper it is proposed a more appropriate method of hardware implementation.

The main features of the proposed method are: 1. operations are implemented in hardware based on fixed-point and simplified floating-point arithmetic, 2. fuzzification and inference is carried out together on the basis of functional similarity to radial basis function networks.

The proposed method is scalable and allows adjustment of the obtained processing speed and the use of hardware resources for a specific application. The next part of the work will present a detailed description of the proposed method of hardware implementation for the considered class of fuzzy structures.

2.1 The Method of Hardware Implementation of the Fuzzification and the Inference Processes

As pointed out in [66] and cited for this statement [35] the most important advantage of using fuzzy basis functions, rather than polynomials or radial basis functions, etc., is that a linguistic fuzzy IF-THEN rule is naturally related to a fuzzy basis function. It should be noted that the way of the design of the system and its implementation not necessarily have to be identical. The design method should be intuitive to the man, while the implementation should be characterized by high efficiency and low cost. Therefore, let’s look closer at the class of fuzzy systems presented in the previous section which are functionally equivalent to a radial basis function networks.

In the considered group of systems we assume that we are dealing with a Gaussian input fuzzy sets and every j-th rule uses of separate input fuzzy sets that are unshared with other rules. For each i-th input it exist a degree of membership to the i-th input fuzzy set of j-th fuzzy rule as follows:

$$\begin{aligned} { \mu _{i}^j= \exp \left( { - \left( \frac{ x_i - \overline{x}_i^{j} }{ \sigma _{i}^j } \right) ^2 } \right) , } \end{aligned}$$
(1)

where \( \overline{x}_i^{j}\) and \(\sigma _{i}^j\) are center and width of input fuzzy set. If we use the product as the T-norm, then the degree of activity of the j-th rule is

$$\begin{aligned} { {\mu ^{j}} = {\mu _{1}^j} \cdot {\mu _{2}^j} \cdot ... \cdot {\mu _{N}^j} = exp \left( - {{ \left( {\frac{ { x_{1} - \overline{x} _1^{j} } }{ \sigma _{1}^j } } \right) }^2 } - ... - {{ \left( {\frac{ { x_{n} - \overline{x} _n^{j} } }{ \sigma _{N}^j } } \right) }^2 } \right) }. \end{aligned}$$
(2)

Note that the action outlined above is similar to the way of calculating the values of the radial basis function of the following form

$$\begin{aligned} { \mu ^j= \exp { \left( - { { \Vert \mathbf {x} - \overline{\mathbf {x}}^{j} \Vert }^2 } \right) }, } \end{aligned}$$
(3)

in which the distance of the input \(\mathbf {x}\) from the center of the radial fuzzy set \(\mathbf {\overline{x}^j}\) for j-th rule is defined as follows

$$\begin{aligned} { { \Vert \mathbf {x} - \mathbf {\overline{x}}^{j} \Vert } = \sqrt{ {{ \left( {\frac{ { x_1 - \overline{x}_1 } }{ \sigma _{1}^j } } \right) }^2 } + ... + {{ \left( {\frac{ { x_N - \overline{x}_N^{j} } }{ \sigma _{N}^j } } \right) }^2 } } }. \end{aligned}$$
(4)

This phenomenon has been observed and described in the work [36] as a functional equivalence between radial basis function networks and fuzzy inference systems. The specific form, for which each of the inputs has individually defined width \(\sigma _i^j\) of the set is called Hyper Radial Basis Function (HRBF) [45]. In the later part of this work the term fuzzy system (FS) refers to the category of systems that are functionally equivalent to a radial basis function networks.

The Eq. (4) can be rewritten as

$$\begin{aligned} { { \Vert \mathbf {x} - \mathbf {\overline{x}}^{j} \Vert }^2 = w_0^j + { w_{1,1}^j x_1 } + { w_{1,2}^j \left( x_1 \right) ^2 } + ... + { w_{N,1}^j x_N } + { w_{N,2}^j \left( x_N \right) ^2 } }, \end{aligned}$$
(5)

where

$$\begin{aligned} { w_0^j = { \left( \frac{ \overline{x}_1^{j}}{\sigma _{1}^j} \right) ^2 } +...+ { \left( \frac{ \overline{x}_N^{j}}{\sigma _{N}^j} \right) ^2 } }; {w_{i,1}^j= -\frac{ 2 \overline{x}_i^{j}}{\sigma _{i}^j}}; {w_{i,2}^j= \left( \frac{ 1}{\sigma _{i}^j} \right) ^2}. \end{aligned}$$
(6)

The approach represented by formulas (3) and (5) offers noticeable benefits for the implementation, namely:

  1. 1.

    The Gaussian function is determined only once, in contrast to the classical approach, demanding the use of this function to determine the degree of membership of each input separately (1). The proposed solution is therefore beneficial because the Gaussian function is a troublesome operation in the implementation on FPGA.

  2. 2.

    All other actions necessary to determine the degree of activity of fuzzy rule are based on repetitive and simple activities such as multiplication and addition. These actions are easy to implement on FPGA, they are carried at high speed and consume relatively small hardware resources.

2.2 The Method of Implementation of Defuzzification Process

As it was mentioned earlier in the paper, the singleton membership functions with centers of \({\overline{y}^{j}}\) are used on the outputs of the rules. In the defuzzification stage the centre of gravity for singletons (7) is used because of the following features of the method: defuzzified values tend to move smoothly, have good sensitivity to change on inputs and are easy to calculate. According to the paper [12] the centre of gravity for singletons (COGSs) is the most realistic and widely used method of defuzzification in many applications.

$$\begin{aligned} y = \frac{ \sum \limits _{j = 1}^M { \mu ^j \cdot \overline{y}^{j} }}{\sum \limits _{j = 1}^M { \mu ^j }} = \frac{n}{d} \end{aligned}$$
(7)

However, from a practical point of view, it should be noted that this method is difficult to implement, because of used arithmetic division of real numbers. This operation can be avoided if fuzzy system is designed in such a way that the denominator in the formula (7) is equal to one, i.e. \({d=1}\). This approach is very comfortable and quite often used in practice. However, in some situations it may be regarded as too restrictive limitation. In the general case (eg. when using Gaussian input fuzzy sets) such a requirement is not met and the operation of real numbers division at the output is required, as shown in Eq. (7).

In many publications this issue was analyzed and various solutions have been proposed. For example, the paper [28] proposes the implementation of a division operation based on method of look-up-table (LUT) and addressing with the 6-bit word. Similarly, in the work [44] it was proposed division in which the divisor was rounded to the 8-bit number. As you can easily guess in both cases this resulted in a very low accuracy of the result.

In another work [37] the implementation of this operation on the basis of single precision floating point arithmetic was used. The result is a high accuracy but achieved at the expense of rather low performance. How it was indicated in the cited reference the obtained floating-point divider needs 26 clock cycles to establish division result. Others floating-point operations like multiplication, addition and subtraction take 5 clock cycles, while similar fixed-point operations takes only one cycle in a typical case. This indicates that the floating-point operations are much less efficient than a fixed-point ones in general. It is also important to note that the floating-point operations consume a lot of hardware resources.

In this paper it is proposed that the division operation required in (7) is performed as in some other works which uses fixed-point arithmetic. In such a case a multiplication by the inverse of the denominator is used instead of the division of two numbers. Determination of the inverse of the denominator is made on the basis of the method of look-up table (LUT). The disadvantage of such a method is that it is necessary to use a large amount of memory to store data in the table with an acceptable accuracy.

In this paper it is proposed to use the simplified 18-bit floating-point numbers to store data in the table. This approach reduces the memory consumption. FPGAs usually have a dedicated Block RAM memory, which are organized as 512 locations of 18-bits words [67]. The proposed simplified floating-point arithmetic is therefore well suited to the optimum utilization of hardware resources.

3 Implementation Results

In our investigation it was considered a problem of hardware implementation of particular parts of a fuzzy structure (FS). Considered structure has four inputs, eight rules and one output. The FS was implemented in the Spartan XC6SLX45-3C FPGA from Xilinx by means of Altium Designer and Xilinx ISE software. To encode the values of the real numbers a 32-bit fixed-point arithmetic were used. Widely known and biggest drawback of fixed-point arithmetic is the limited range and the need for continuous scaling of processed numbers. However, the use of 32-bits width words made it possible to obtain a relatively wide range at the same time fairly good accuracy. Thus, in this case this defect was somewhat minimized. Because of the necessary scaling is closely related to a specific application, this issue will be omitted for the sake of readability of the presentation. It will be presented in detail only in places that are important from the point of view of the presented algorithm.

3.1 Fuzyfication and Inference

According to the method proposed in the previous section operations of fuzzification and inference were carried out in the overall processing of input data. As a result, determination of the output value of the formula (5) requires a series of operations such as multiplication and addition. It is possible to perform these actions both in parallel, series and the in series-parallel mode. Fully parallel implementation of calculations allows us to achieve high performance at the expense of high demand on hardware resources. Serial implementation allows us to reduce the use of hardware resources, but with a significant loss of obtained processing efficiency. While a semi-serial or a semi-parallel implementation allows for compromise.

In the presented example the semi-serial implementation was used. However, the use of high-performance fixed-point calculations made it possible to achieve high-speed processing.

Elementary operations for the input of the rule (5) are executed in parallel as shown in Fig. 1. The elementary function has three 32-bits width inputs. First two inputs (W1 and W2) are the weights coefficients \({w_{i,1}^j}\) and \({w_{i,2}^j}\) respectively, the third input (X) is the input to the FS, i.e. \({x_i}\) as defined in (5). As a result this function performs several operations in one cycle. The register shown in Fig. 1 acts as a component partial sum according to the formula (5). Initial value of the register is equal to weight \(w_0^j\) and it is set in the first clock cycle. Three 32-bit fixed point multipliers (FP_MULTIPLIER_1, 2 and 3) and one 32-bit adder (ADDER_1) generates the output within the second cycle. Using the second 32-bit adder (ADDER_2) and one register the whole squared weighted sum (5) for one rule with four inputs is obtained in the fifth cycle. In the sixth cycle the LUT block indicated as EXP_FUNCTION is used to determine the output value of the nonlinear exponential function. The LUT consists of 1024 words each 12-bits width to store the shape of gausoid function with a reasonable accuracy. Summing up, in the general case the whole process of calculation of rule activation degree requires the following number of clock cycles

$$\begin{aligned} c_r = 2+N \end{aligned}$$
(8)
Fig. 1.
figure 1

The hardware implementation of the elementary function.

Fig. 2.
figure 2

The hardware implementation of the algorithm used to determine the consequents of fuzzy rules.

Fig. 3.
figure 3

The block diagram of the implemented fuzzy system with one output.

To calculate the output value y of the FS we need to perform the above described processing for all M rules. This can be done in sequence (serial method) or in parallel (for example with the use of pipelining) to obtain a different speed processing of the implemented system. As mentioned earlier in this paper was carried out the semi-serial implementation method.

3.2 The Defuzification Proces

In the proposed method Fig. 2 shows how all the rules are indicated in order to determine their activation degree \({ \mu ^j }\) and their consequent \({\mu ^j \cdot \overline{y}^{j}}\). While, Fig. 3 shows the module used for sequentially processing all rules. Two adders and two registers are used to accumulate values of activation degrees and consequents of all rules. The results, i.e. the nominator and the denominator of the Eq. (7) are obtained within the following number of clock cycles

$$\begin{aligned} c_{rms} = M \cdot c_r. \end{aligned}$$
(9)

After the nominator n and denominator d are determined the one extra clock cycle is necessary to calculate the current output value of the FS according to used method of defuzzification (7). First of these two values is multiplied by the reciprocal of the other (Fig. 3) in order to obtain the value of y according to the following formula \({ y = {n} \cdot RECIPROCAL \left( d \right) }\).

Fig. 4.
figure 4

The method of hardware implementation of the reciprocal function.

The RECIPROCAL module used for this purpose has one input and one output which are 12-bits and 27-bits width unsigned words respectively. The input has a fixed point 3.9 bit representation, i.e. three bits for the integer value and nine bits for fraction. This gives the useful range of \(d \in { ( 0; 8)}\). Since every single rule has the activation degree with a range of \(\mu ^j \in { \langle 0; 1\rangle }\) this allows to store the information about the sum of activation degree values for many rules. The upper limit for the used fixed point representation for the reciprocal input is, for example, when eight rules have activation coefficient close to unity which is rather unrealistic in properly designed system.

The RECIPROCAL module is implemented as an look-up table (LUT) located in a read-only memory and it stories 4096 words (Fig. 4). Value of the input of the module is treated as a 12-bit address, which indexes the table. Each indexed word location stores the result of operation \( RO = 1/ X \) in a 18-bit simplified floating-point (SFP) format.

The SFP is proposed in this paper a nonstandard format of encoding floating-point numbers. The SFP is an encoding format tailored to a specific application. It allows to reduce the number of bits of a binary word and to simplify their processing. The general idea is derived from the standard IEEE754 but limited to the processing of positive numbers and with a limited range. In the SFP format the 18-bit word is divided into two bitfields: 4-bits for exponent and 14-bits for mantissa. The exponent is a positive number with range of ( 0; 15). It gives the useful range for floating-point values of \({\langle 0; 32768)}\) with an acceptable accuracy. For example, the accuracy is about 0.01 % for numbers with a value close to unity or larger. In the paper the exponent is limited to the range of \({\langle 0; 8 \rangle }\) for practical reasons.

The detailed method of processing the SFP numbers is shown in Fig. 4. Two LUTs are used. First (ROMS_18x4096) stores the 18-bit words in SFP format. The second one (U_LUT_9x16) together with the fixed-point multiplier is used to change the SFP format to the fixed-point one. The U_LUT_9x16 is a binary decoder which converts the 4-bits binary number (input) to the 1-of-9 output bits. The fixed-point format which is used on the output of the RECIPROCAL module is compatible with the rest of the system. While the SFP format is used only in the RECIPROCAL module to store the data table. Such approach has allowed to reduce memory consumption by more than 50 %, while maintaining the accuracy and the processing performance at the same level.

3.3 Results

The timing analysis shows that the exemplary FS implemented in the FPGA device is able to work with clock frequency above 50MHz, which gives the reaction time below 1 \(\upmu \)s. This allows us to build a FS system, that could be useful for some kind of applications, e.g. hardware emulators.

The implementation results presented in the Table 1 are valid for a system with one output. However, for a system with multiple outputs the resource consumption will be almost the same when the serial implementation is used. Obviously, the response time will be many times larger (proportional to the number of outputs) compared to the system with one output.

Table 1. Performance and the FPGA resource usage of the exemplary fuzzy system implemented with the use of the proposed method.

4 Summary

In this paper a method of implementation of fuzzy system on FPGA devices was presented. The method applies to a class of fuzzy systems which are functionally equivalent to the radial basis function networks. Thanks to this similarity it was possible to propose the effective methods of such fuzzy systems implementation in FPGA-type programmable systems. For a demonstration of the method the results of the implementation of an exemplary fuzzy system in the FPGA was presented. The results show that the FS system with four inputs, eight rules and one output can work with the processing cycle of less than one microsecond. It makes the proposed solution useful in practice.

Presented solution is highly scalable, because depending on the requirements it is possible to shortening response time at the expense of increase the hardware resources. Similarly, it is possible to increase the number of inputs and outputs and the number of rules of the system.