Keywords

1 Introduction

Discrete Fourier transforms (DFTs) [1, 2] are used extensively in a range of computing applications like audio processing, image processing, medical imaging etc.

A number of parallel algorithms developed for DFT/FFT in different architectures can be found in the literature [27].

Using an extension of the common factor algorithm authors in [3] have shown that a simple planar 2-dimensional systolic array having N = M 2 processing elements can be used to compute DFT of size N in 2 M + 1 steps of pipelined operations, achieving the area-time complexity AT 2 = O (N 2 log 3 N).

In this paper we present an entirely different but simple direct matrix–vector multiplication approach for computing DFT in Multi-Mesh architecture. The time complexity of the algorithm is O (√N) for N point DFT. The algorithm is implemented in Multi-Mesh (MM) architecture [79] having N 2 processors where each processor is of 4 degree. As shown in Fig. 1, there are N meshes of size √N × √N each, which themselves are again arranged in √N rows and √N columns so that there will be N 2 processors in the MM network.

Fig. 1
figure 1

A simple n × n multi-mesh network with n = 4 (all links are not shown)

The important aspect of the proposed parallel algorithm is that it requires constant time multiplication. The order of time complexity is mainly governed by the data communication and addition times. The Fourier components are generated in place in parallel for each processor using the processor position co-ordinates and require three multiplication steps and two addition steps. The input of the vector to be transformed and their propagation requires (2n + 1) communication steps. The partial products are generated by a single multiplication step. After which 2(n  1) additions and (n + 1) data communication steps are required to get the transformed vector.

2 Discrete Fourier Transform

The DFT [1, 2] of a continuous function a (t) is given by

$$ A_{f} = \sum\nolimits_{0}^{N - 1} {a_{t} e^{{\frac{2\pi ifk}{N}}} } ,\;0 \le f \le N - 1 $$

where, \( i = \sqrt { - 1} \). The variable t is used to represent time, and f is used to represent frequency. The Fourier transform is used to convert a function of time into a function of frequency.

3 The Multi Mesh (MM) Network

First, the MM network proposed by the authors [7] is described. The basic building block of the MM network is n × n mesh.

A processor inside a given block can be uniquely identified by two coordinates. Again blocks are organized as matrix form so each block can be identified by two coordinates, say α and β as M (α, β). Thus, each of the n4 processors in MM can be uniquely identified using a 4-tuple. If the boundary processors of P (α, β 1 ) are connected to boundary processors of P (α, β 2 ) for all values of α, 0 ≤ α ≤ n  1, we denote these sets of links by an interconnection between the sets P (*, β 1 ) and P (*, β 2 ). Inter-block vertical connection is identified by following rule.

β, 0 ≤ β ≤ n  1, P (α, β, 0, y) are connected to P (y, β, n  1, α), where 0 ≤ y, α ≤ n  1, and inter block horizontal connection is identified by following rule.

α, 0 ≤ α ≤ n  1, P (α, β, x, 0) are connected to P (α, x, β, n  1), where 0 ≤ x, β ≤ n  1.

4 Fourier Transformation Using Matrix–Vector Multiplication

Transformation of a matrix A of size N × N by a vector B of size N is given by C = A × B, where C is the transformed vector of length N.

4.1 Parallel Implementation of Fourier Transform Using MM Network

The given matrix A is partitioned into n × n sub-matrices A α, β where 0 ≤ α, β ≤ n  1. In this way there are n 2 sub-matrices which are initially mapped to M (α, β), 0 ≤ α, β ≤ n  1, each of which contains n 2 processors P (α, β, *, *). This ‘*’ indicates all possible values from 0 to n  1. Here each processor holds a single component of matrix A and all these values are generated in place according to the position of the node P (α, β, x, y), 0 ≤ α, β, x, y ≤ n  1. For example, ω (αn + x)(βn + y) is the value of A at the node position P (α, β, x, y) and is calculated as \( \cos \frac{{\left( {\alpha n + x} \right)\left( {\beta n + y} \right)2\pi }}{n} - i \sin \frac{{\left( {\alpha n + x} \right)\left( {\beta n + y} \right)2\pi }}{n} \) using the built in complex functionality \( \omega = e^{{\frac{2\pi i}{n}}} = \cos \frac{2\pi }{n} - i \sin \frac{2\pi }{n} \). Next B values are input through the upper boundary of the MM network and are moved to all the other positions in the MM network. The movement of the B values are shown in Fig. 2 where only first four components of B vector are shown for the first column of blocks of a 4 × 4 Multi-Mesh.

Fig. 2
figure 2

Data movement steps of vector B in blocks M (*, 0) of a 4 × 4 MM: a Input through the upper boundary of the block M (0, 0). b Input values are moved to other blocks through vertical inter block links. c Propagation along row in each block. d Last row of data is moved along vertical inter block links. e Propagation along column direction within block

Once the B values are populated and A values are calculated at the corresponding node, the next step towards completion of Fourier Transform is to perform multiplication at each node.

To calculate the \( C_{i } \sum\nolimits_{j = 0}^{{n^{2} - 1}} {c_{ij} } \) for each i, where \( c_{ij} = a_{ij} \times b_{j} \), all cij ’s are to be brought in a single block M (i/n, i%n), as they are now scattered in ith rows of n different blocks. To achieve this n shift is performed along the horizontal inter block links of the MM network as shown in Fig. 3 given below.

Fig. 3
figure 3

Contents of blocks M (0, *) after n (=4) steps data movement along the horizontal inter block links in a 4 × 4 MM

Now the n 2 values of each mesh are summed along the column direction then horizontally along the first row of each block to make an element of the result vector C. After this step one more single step data movement is performed along the horizontal inter block links to bring the values of all the blocks M (α, 0), 0 ≤ α ≤ n  1, at the left boundaries as shown in and Fig. 4. The final output is obtained at the left boundary of the MM network.

Fig. 4
figure 4

Column sum followed by 0th row sum in each block of MM for n = 4 (solid lines with arrowhead indicates the direction of additions)

5 Parallel Discrete Fourier Transform Using MM Network

5.1 Algorithm PDFT

  1. A.

    Initialization step

    This step is subdivided into two parts. Register R1 s contain the initial values of matrix A, whereas registers R2 s contain the initial values of vector B.

    Step 1::

    α, β, x and y, 0 ≤ α, β, x, y ≤ n  1 do in parallel

    R1 (α, β, x, y)←

    Step 2::

    β and y, 0 ≤ α, β, y ≤ n  1 do in parallel

    R2 (0, β, 0, y)←

  2. B.

    Propagation of B vector to the other rows of MM network

    1. I.

      β and y, 0 ≤ β, y ≤ n  1 do in parallel

      R2 (y, β, n, 0) ← R2 (0, β, 0, y);

    2. II.

      α, β, 0 ≤ α, β ≤ n  1 do in parallel

      for i = 1 to n  1 do

      R2 (α, β, n  1, i) ← R1 (α, β, n  1, i  1);

    3. III.

      α, β, i, 0 ≤ α, β, i ≤ n  1 do in parallel

      R2 (i, β, 0, α) ← R2 (α, β, n  1, i);

    4. IV.

      α, β, j, 0 ≤ α, β, j ≤ n  1 do in parallel

      for i = 1 to n  1 do

      R1 (α, β, i, j) ← R2(α, β, i − 1, j);

  3. C.

    Multiplication Step

    • α, β, x and y, 0 ≤ α, β, x, y ≤ n  1 do in parallel

    • R1 (α, β, x, y) ← R1 (α, β, x, y) × R2 (α, β, x, y);

  4. D.

    Data Movement Step

    In the MM network horizontal inter block links form cycles of length 2n between the k th row of the block M (i, j) and the j th row of the block M (i, k) for j ≠ k, 0 ≤ i ≤ n  1. For a given α, if data elements in M(α, *) are shifted through n positions along the horizontal cycles, then the ith row elements of M(α, j) will also be shifted to the jth row of B(α, i), 0 ≤ α ≤ n  1.

    /* Here, `*’ indicates all possible values from 0 to n  1, but the same value for it must be used on both sides of the assignment operator */

6 Time Complexity of Parallel DFT

The number of steps required for initialization stage A is 6 and for propagation stage is 2n. Generation of partial products needs single step. Data movement stage D requires n steps. Addition stage E requires 2(n  1). Data arrangement stage F is done in single step. So, total number of steps required for the entire process is 5n + 6 or 5 N 1/2 + 6 where N 2 = n 4 which implies time complexity of this algorithm is O (n) or O (N 1/2).

7 Comparative Study of Time Complexities of Parallel DFT in Multi-Mesh and Other Architectures

The time complexities of parallel DFT computation in various architectures have been tabulated in Table 1.

Table 1 Time complexities of PDFT computation in various architectures are given below

For the purpose of comparison, we are considering same mesh size (n × n) and same total number of nodes (n 4) for both MM and Multi-dimensional mesh. The diameters of MM and 4-D mesh are 2n and 4 (n  1) respectively, and degree of each node in MM is uniformly 4, whereas the node degree of each internal node is 8 for 4-D mesh. If smaller meshes are used to construct the multi-dimensional mesh, the node degree will increase. Moreover, the time complexity of proposed DFT computation in MM outperforms multi-dimensional mesh for practical sizes of meshes, i.e., mesh sizes less than 2048 × 2048.

In Hypercube better time complexity is achieved through its high node degree and number of links with increasing dimensions.

8 Conclusion and Future Work

A parallel algorithm for DFT of vector of length N (=n 2) has been proposed with constant multiplication time and O (√N) addition and data communication time. Each processor of the MM network having N 2 (=n 4) processors, generates a single component of the Fourier matrix depending on the processor position in MM network i.e. each of the components is generated in place.

The authors in [7] have implemented n 1.42 × n 1.42 matrix-by-matrix multiplication in O (n) time using n 4 processors (same as that used in the proposed work) of MM and indicated that DFT of vector length n 1.42 can be computed in the same way. In the present work, DFT of vector length n 2 is computed as linear transformation in O (n) time using n 4 processors of MM. The Fourier matrix is calculated in-place at each node of the MM, so there is no need to input the Fourier matrix through the left boundary of each mesh (block) as was done in [7], and vector to be transformed was input only through the upper boundary of the MM whereas the input was made through the upper boundary of each mesh (block) in [7].

As a future work, we may consider a generalized MM network where total number of processors in MM network is m 2 × n 2 having m × n number of processors in each block of the MM network for m, n ≥ 3.