Keywords

1 Introduction

The authors proposed a performance model for analysing parallel algorithms. The model assumes that the (parallel) algorithm is represented as a set of operators related to each other according to a rule of dependence. Furthermore, the model has a parameterized formulation intended to exploit the different characteristics of the computing machines such as reconfigurable hardware devices [13].

Here we consider the matrix multiplication (MM) algorithm and we apply the performance model. The algorithm is simple and has not any ambition of optimization (many efforts are spent in the field of linear algebra and recent examples can be found in [1, 9, 11, 12]), instead, our aim is to discuss how easily some implementation choices could be addressed giving rise to different performance results. The focus is on the “opportunity” of implementing the algorithm in hybrid distributed/shared memory computing environments, obtaining the most important information before the implementation. The implementations of MM algorithm will be composed from multiplications with sub matrices. The general MM algorithm can be decomposed into multiple calls to matrix multiplication. These themselves can be decomposed into multiple calls to inner-kernels. The aim now is to understand how these lowest level kernels can attain high performance, then so will the MM algorithm. This paper attempts to describe how to apply the performance model that the authors have developed so as to make it accessible to a broad audience.

2 Matrix Multiplication

Given two matrices A, B \(\in \mathfrak {R}^{n \times n}\) and the computational problem

$$\begin{aligned} \mathcal {B}_{n^2}\equiv MM_{n\times n}:=A\cdot B, \end{aligned}$$
(1)

we introduce the sub problems \(matmul^i_{\frac{n}{3}\times \frac{n}{3}}\), for \(i=0,\ldots ,26\) which are defined as follows:

$$\begin{aligned} \mathcal {B}_{\frac{n}{3}\times \frac{n}{3}}\equiv matmul^i_{\frac{n}{3}\times \frac{n}{3}}:= C_i+A_i \cdot B_i, \end{aligned}$$
(2)

with \(A_i \in \mathfrak {R}^{\frac{n}{3}\times \frac{n}{3}}\), \(B_i\in \mathfrak {R}^{\frac{n}{3}\times \frac{n}{3} }\) and \(C_i\in \mathfrak {R}^{\frac{n}{3}\times \frac{n}{3} }\) blocks of A, B and C, respectively. Finally, we introduce the decomposition

$$\begin{aligned} D_{27}(MM_{n\times n}):=\{matmul^i_{\frac{n}{3}\times \frac{n}{3}}\}_{0\le i<27}. \end{aligned}$$
(3)

From (1)–(3), the decomposition matrix is:

$$\begin{aligned} M_D= \begin{bmatrix} matmul^0_{\frac{n}{3}\times \frac{n}{3}}&matmul^1_{\frac{n}{3}\times \frac{n}{3}}&matmul^2_{\frac{n}{3}\times \frac{n}{3}}&\cdots&matmul^8_{\frac{n}{3}\times \frac{n}{3}}\\ matmul^9_{\frac{n}{3}\times \frac{n}{3}}&matmul^{10}_{\frac{n}{3}\times \frac{n}{3}}&matmul^{11}_{\frac{n}{3}\times \frac{n}{3}}&\cdots&matmul^{17}_{\frac{n}{3}\times \frac{n}{3}}\\ matmul^{18}_{\frac{n}{3}\times \frac{n}{3}}&matmul^{19}_{\frac{n}{3}\times \frac{n}{3}}&matmul^{20}_{\frac{n}{3}\times \frac{n}{3}}&\cdots&matmul^{26}_{\frac{n}{3}\times \frac{n}{3}}\\ \end{bmatrix} \end{aligned}$$
(4)

The set \(D_{27}(MM_{n\times n})\) is made of 27 subproblems \(matmul^{(i+j+k)}_{\frac{n}{3}\times \frac{n}{3}}\in D_{27}\), and the problem \(MM_{n\times n}\) has concurrence degree \(r_D=9\) and dependence degree \(c_D=3\).

Suppose that the computing environment can be represented by means of the machine \(\mathcal {M}_{1,1}\) which has

  • \(P=1\),

  • \(Op_{\mathcal {M}_{1,1}}\,=\,\{\otimes ,...\}\) where \(\otimes :=\) matrix-matrix multiply,

  • \(L=2\) two memory levels,

  • \(rmem_i\) (read) and \(wmem_j\) (write) as memory accesses operators on blocks of size \(\frac{n}{3}\times \frac{n}{3}\),

  • \(tmem_1:=tblock_{mem}\),

  • for each \(\otimes \), 1 read (before the execution) and 1 write (after the execution) are needed.

According to \(D_{27}\), the sequential algorithm \(A_{D_{27},M_{1,1}}\) on \(\mathcal {M}_{1,1}\) is made of the 27 operators \(\otimes \) corresponding to the 27 sub-problems. The execution matrix corresponding to \(A_{D_{27},M_{1,1}}\) on \(\mathcal {M}_{1,1}\) has \(r_{E}=27\) rows and only one column, i.e. \(c_{E}=1\). It is the following matrix:

$$\begin{aligned} M_E= \begin{bmatrix} \otimes _0 \\ \otimes _1 \\ \vdots \\ \otimes _{26} \end{bmatrix} \end{aligned}$$
(5)

while the memory matrix \(AM_{A_{D_{27},\mathcal {M}_{1,1}}}\) has \(r_{MEM}=52\) rows and \(c_{MEM}=1\) column, and it can be described in the following way:

$$\begin{aligned} AM_{A_{D_{27},\mathcal {M}_{1,1}}}= \begin{bmatrix} rmem_0(\cdot ) \\ wmem_0(\cdot ) \\ rmem_1(\cdot ) \\ wmem_1(\cdot ) \\ \vdots \\ rmem_{26}\\ wmem_{26} \end{bmatrix} \end{aligned}$$
(6)

The execution time of algorithm \(A_{D_{27},M_{1,1}}\) is

$$\begin{aligned} T(A_{D_{27},\mathcal {M}_{1,1}}) = r_{E}\cdot T_r \end{aligned}$$
(7)

where \(T_r\) is the execution time of the row r of the matrix given in (5). It is equal to the execution time of the \(\otimes \) operator (since they are all the same).

Let \(C(\otimes )\) denote the complexity of \(\otimes \) operator, then (7) becomes:

$$\begin{aligned} T(A_{D_{27},\mathcal {M}_{1,1}}) = 27\cdot C(\otimes )\cdot tcalc \end{aligned}$$
(8)

The memory access time of the software corresponding to \(A_{D_{27},M_{1,1}}\), is

$$\begin{aligned} T_M(SW(A_{D_{27},\mathcal {M}_{1,1}}(A)))=r_{mem_1}\cdot tblock_{mem}=54\cdot tblock_{mem}, \end{aligned}$$
(9)

and its execution time is

$$\begin{aligned} \begin{aligned} T(SW(A_{D_{27},\mathcal {M}_{1,1}}))&= T(A_{D_{27},\mathcal {M}_1})+T_M(SW(A_{D_{27},\mathcal {M}_{1,1}}))\\&=27\cdot C(\otimes )\cdot tcalc + 54\cdot tblock_{mem} \end{aligned} \end{aligned}$$
(10)

2.1 The Algorithm at the First Level of Decomposition

We consider the machine \(\mathcal {M}_{9,9}\) such that

  • \(P=9\) (which we call nodes), which are organized in a \(3\times 3\) logical grid,

  • \(Op_{\mathcal {M}_{9,9}}=\{\otimes ,...\}\) where \(\otimes \) = matrix-matrix multiply,

  • \(L=3\) (two memory levels plus one level for communications),

  • \(trans_i\) denotes the memory access operator which moves a block of size \(\frac{n}{3}\times \frac{n}{3}\) in time \(tblock_{com}\) Footnote 1,

  • each node can transfer a single block concurrently, that is the machine can transfer 9 blocks at the same time.

  • for a broadcast step, each node performs a transfer (one send, other eight receive).

  • for a rolling step, each node performs two transfers (send and receive one block).

Starting, each node has a \(\frac{n}{3}\times \frac{n}{3}\) block of each matrix. If \(matmul(p \cdot i)\) is the subproblem \(matmul^{p\cdot i}_{\frac{n}{3}\times \frac{n}{3}}\in D_{27}\), the algorithm \(A_{D_{27},\mathcal {M}_{9,9}}\) is the following (i.e. the so called Broadcast Multiply Rolling (BMR) Algorithm [10]) (Fig. 1):

figure a
Fig. 1.
figure 1

The starting matrices blocks distribution among the nodes.

The execution matrix of \(A_{D_{27},\mathcal {M}_{9,9}}\) is

$$\begin{aligned} M_E= \begin{bmatrix} \otimes _0&\otimes _1&\otimes _2&\otimes _3&\otimes _4&\otimes _5&\otimes _6&\otimes _7&\otimes _8\\ \otimes _9&\otimes _{10}&\otimes _{11}&\otimes _{12}&\otimes _{13}&\otimes _{14}&\otimes _{15}&\otimes _{16}&\otimes _{17}\\ \otimes _{18}&\otimes _{19}&\otimes _{20}&\otimes _{21}&\otimes _{22}&\otimes _{23}&\otimes _{24}&\otimes _{25}&\otimes _{26} \end{bmatrix} \end{aligned}$$
(11)

and it is perfectly parallel. The memory matrix is

$$\begin{aligned} AM_{A_{D_{27},\mathcal {M}_{9,9}}}= \begin{bmatrix} trans_0(\cdot )&trans_1(\cdot )&trans_2(\cdot )&...&trans_8(\cdot ) \\ trans_9(\cdot )&trans_{10}(\cdot )&trans_{11}(\cdot )&...&trans_{17}(\cdot ) \\ trans_{18}(\cdot )&trans_{19}(\cdot )&trans_{20}(\cdot )&...&trans_{26}(\cdot ) \\ trans_{27}(\cdot )&trans_{28}(\cdot )&trans_{29}(\cdot )&...&trans_{35}(\cdot ) \\ trans_{36}(\cdot )&trans_{37}(\cdot )&trans_{38}(\cdot )&...&trans_{44}(\cdot ) \\ trans_{45}(\cdot )&trans_{46}(\cdot )&trans_{47}(\cdot )&...&trans_{53}(\cdot ) \\ trans_{54}(\cdot )&trans_{55}(\cdot )&trans_{56}(\cdot )&...&trans_{62}(\cdot ) \end{bmatrix} \end{aligned}$$
(12)

The execution time of each row of \(M_E\), is the execution time of the \(\otimes \) operator. If \(r_{E}=3\) is the number of rows of \(E_{A_{D_{27},\mathcal {M}_{9,9}}}\), the execution time of the BMR algorithm \(A_{D_{27},\mathcal {M}_{9,9}}\) is

$$\begin{aligned} T(A_{D_{27},\mathcal {M}_{9,9}}) = r_{E}\cdot T_{r} = 3\cdot C(\otimes )\cdot tcalc \end{aligned}$$
(13)

Since \(r_{mem}=7\), the memory access time of the software \(SW(A_{D_{27},\mathcal {M}_{9,9}})\) is

$$\begin{aligned} T_M(SW(A_{D_{27},\mathcal {M}_{9,9}}))=r_{mem_9}\cdot tblock_{com}=7\cdot tblock_{com} \end{aligned}$$
(14)

and its execution time is

$$\begin{aligned} \begin{aligned} T(SW(A_{D_{27},\mathcal {M}_{9,9}}))&= T(A_{D_{27},\mathcal {M}_{9,9}})+T_M(SW(A_{D_{27},\mathcal {M}_{9,9}}))\\&=3\cdot C(\otimes )\cdot tcalc + 7\cdot tblock_{com}. \end{aligned} \end{aligned}$$
(15)

Finally, the speed up of the software \(SW(A_{D_{27},\mathcal {M}_{9,9}})\) is

$$\begin{aligned} Sp(SW(A_{D_{27},\mathcal {M}_{9,9}}))=\frac{T(SW(A_{D_{27},\mathcal {M}_{1,1}}))}{T(SW(A_{D_{27},\mathcal {M}_{9,9}}))}=\frac{26\cdot C(\otimes )\cdot tcalc + 52\cdot tblock_{mem}}{3\cdot C(\otimes )\cdot tcalc + 7\cdot tblock_{com}} \end{aligned}$$
(16)

2.2 The Sequential Algorithm at the Second Level of Decomposition

Consider the subproblem \(matmul^{i}_{\frac{n}{3}\times \frac{n}{3}}\) and the decomposition

$$\begin{aligned} D^\prime _{\frac{n}{3}-1}=\{matvec^i_{\frac{n}{3}\times \frac{n}{3}}\}_{0\le i<(\frac{n}{3}-1)} \end{aligned}$$
(17)

where

$$\begin{aligned} \begin{aligned} matvec^i_{\frac{n}{3}\times \frac{n}{3}}:=&\,\text {multiply of a block}\ A_i\ \text {of}\ \frac{n}{3}\times \frac{n}{3} \\&\text { elements and a vector } B_i \text { of } \frac{n}{3} \text { elements.} \end{aligned} \end{aligned}$$
(18)

All the subproblems are independent, so the decomposition matrix of \(matmul^{i}_{\frac{n}{3}\times \frac{n}{3}}\) is

$$\begin{aligned} M_{D^\prime _{\frac{n}{3}-1}}= \begin{bmatrix} matvec^0_{\frac{n}{3}\times \frac{n}{3}}&matvec^1_{\frac{n}{3}\times \frac{n}{3}}&...&matvec^{\frac{n}{3}-1}_{\frac{n}{3}\times \frac{n}{3}} \end{bmatrix} \end{aligned}$$
(19)

and \(matmul^{i}_{\frac{n}{3}\times \frac{n}{3}}\) has concurrence degree \(\frac{n}{3}\) and dependence degree 1.

Let us introduce the machine \(\mathcal {M}^\prime _{1,1}\) corresponding to a generic node of \(\mathcal {M}_{9,9}\). Suppose that \(\mathcal {M}^\prime _{1,1}\) is such that

  • \(P=1\),

  • \(Op_{\mathcal {M}^\prime _{1,1}}=\{\boxtimes ,...\}\) where \(\boxtimes \) = matrix-vector multiply,

  • \(L=2\),

  • \(rmemv_i\) (read) or \(wmemv_j\) (write) denote the memory accesses operators moving a vector of size \(\frac{n}{3}\) in time \(tmem:=tvec_{mem}\) Footnote 2.

Since all the subproblems must be solved one after another, the execution matrix of \(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{1,1}}\) is

$$\begin{aligned} M_E= \begin{bmatrix} \boxtimes _0 \\ \boxtimes _1\\ \vdots \\ \boxtimes _{\frac{n}{3}-1} \end{bmatrix} \end{aligned}$$
(20)

Since we assume that for the execution of each operator, it is required one read (before the execution) and one write (after the execution) of a vector of size \(\frac{n}{3}+1\), the memory matrix has

$$r_{mem,D^\prime _{\frac{n}{3}-1}}=\left( \frac{n}{3}+2\right) \cdot \frac{n}{3}$$

rows. The execution time of each row of the matrix in (20) is the execution time of the \(\boxtimes \) operator. If

$$r_{E,D^\prime _{\frac{n}{3}-1}}=\frac{n}{3}$$

is the number of rows of \(E_{A_{D^\prime _{\frac{n}{3}-1}},\mathcal {M}^\prime _{1,1}}\), the execution time of the algorithm \(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{1,1}}\) is

$$\begin{aligned} \begin{aligned} T(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{1,1}})&= C(\otimes )\cdot tcalc = r_{E1D^\prime _{\frac{n}{3}-1}}\cdot T_{r}\\&=\frac{n}{3}\cdot C(\boxtimes )\cdot tcalc =\frac{n}{3}\cdot 2\cdot \left( \frac{n}{3}\right) ^2 \cdot tcalc\\&= 2\cdot \left( \frac{n}{3}\right) ^3 \cdot tcalc \end{aligned} \end{aligned}$$
(21)

and the memory access time of the software \(SW(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{1,1}})\) is

$$\begin{aligned} T_M(SW(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{1,1}}))=r_{mem,D^\prime _{\frac{n}{3}-1}}\cdot tvec_{com}=\left( \frac{n}{3}+2\right) \cdot \frac{n}{3}\cdot tvec_{mem}. \end{aligned}$$
(22)

Finally, the execution time of the software \(SW(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{1,1}})\) is

$$\begin{aligned} \begin{aligned} T(SW(A_{D^\prime _{\frac{n}{3}-1}},\mathcal {M}^\prime _{1,1}))&= T(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{1,1}})+T_M(SW(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{1,1}})) \\&=2\cdot \left( \frac{n}{3}\right) ^3 \cdot tcalc + \left( \frac{n}{3}+2\right) \cdot \frac{n}{3}\cdot tvec_{mem} \end{aligned} \end{aligned}$$
(23)

2.3 The Parallel Algorithm at the Second Level of Decomposition

We consider the machine \(\mathcal {M}^\prime _{1 \cdot 8}\) made of 8 cores/threads for each node of \(\mathcal {M}_{9,9}\). Let us assume that \(\mathcal {M}^\prime _{1 \cdot 8}\) is such that

  • \(P=8\),

  • \(Op_{\mathcal {M}^\prime _{1 \cdot 8}}=\{\boxtimes ,...\}\) where \(\boxtimes \)  = matrix-vector multiply,

  • \(L=2\),

  • \(rmemv_i\) (read) or \(wmemv_j\) (write) denote the memory access operators on a vector of \(\frac{n}{3}\) elements concurrently in time \(tvec_{mem}\) between the memory levels. Note that \(tvec_{mem}\le tblock_{mem}\).

Then, if \(matvec(t\cdot i)\) denotes subproblem \(matvec^{t\cdot i}_{\frac{n}{3}\times \frac{n}{3}}\in D^\prime _{\frac{n}{3}-1}\), we get the Multi Thread Matrix multiply Algorithm \(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{1 \cdot 8}}\):

figure b

The first 8 of the \(\frac{n}{3}\) subproblems can be solved independently by the 8 cores, and so on until they are all completed. Hence, the execution matrix of the algorithm \(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{1 \cdot 8}}\) has \(r_{E}=\frac{n}{3\cdot 8}=\frac{n}{24}\) rows and if we assume that \(\frac{n}{3}\) is a multiple of 8Footnote 3, the algorithm is perfectly parallel.

Assuming that for the execution of each operator, it is required to read (before the execution) and to write (after the execution) the vector of size \(\frac{n}{3}+1\) and that the cores can transfer their vectors concurrently, that is the machine can concurrently transfer 8 vectors, the memory matrix \(AM_{A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}_{1 \cdot 8}}}\) has

$$r_{mem,D^\prime _{\frac{n}{3}-1}}=\left( \frac{n}{3}+2\right) \cdot \frac{n}{24}$$

rows. The execution time of each row of the execution matrix is the execution time of the \(\boxtimes \) operator. If \(r_{E}=\frac{n}{24}\) is the number of rows of \(E_{A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{1 \cdot 8}}}\), the execution time of the algorithm \(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{1 \cdot 8}}\) is

$$\begin{aligned} \begin{aligned} T(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{1 \cdot 8}})&= r_{E}\cdot T_{r} = \frac{n}{24}\cdot C(\boxtimes )\cdot tcalc =\frac{n}{24}\cdot 2\cdot \left( \frac{n}{3}\right) ^2 \cdot tcalc. \end{aligned} \end{aligned}$$
(24)

If we denote by \(r_{mem,D^\prime _{\frac{n}{3}-1}}=\left( \frac{n}{3}+2\right) \cdot \frac{n}{24}\) the number of rows of the memory access matrix of the algorithm \(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}_{1 \cdot 8}}\), the memory access time of the software \(SW(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}_{1 \cdot 8}})\) we are going to implement is

$$\begin{aligned} T_M(SW(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}_{1 \cdot 8}}))=r_{mem,D^\prime _{\frac{n}{3}-1}}\cdot tvec_{mem}=\left( \frac{n}{3}+2\right) \cdot \frac{n}{24}\cdot tvec_{mem} \end{aligned}$$
(25)

and the execution time of the software \(SW(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}_{1 \cdot 8}})\) is

$$\begin{aligned} \begin{aligned} T(SW(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}_{1 \cdot 8}}))&= T(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{8,8}})+T_M(SW(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}_{8,8}}))\\&=\frac{n}{24}\cdot 2\cdot \left( \frac{n}{3}\right) ^2 \cdot tcalc + \left( \frac{n}{3}+2\right) \cdot \frac{n}{24}\cdot tvec_{mem}. \end{aligned} \end{aligned}$$
(26)

Finally, the speed up is

$$\begin{aligned} \begin{aligned} Sp(SW(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{1 \cdot 8}}))&=\frac{T(SW(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}^\prime _{1,1}}))}{T(SW(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}_{1 \cdot 8}}))}\\&=\frac{2\cdot \left( \frac{n}{3}\right) ^3 \cdot tcalc + \left( \frac{n}{3}+2\right) \cdot \frac{n}{3}\cdot tvec_{mem}}{\frac{n}{24}\cdot 2\cdot \left( \frac{n}{3}\right) ^2 \cdot tcalc + \left( \frac{n}{3}+2\right) \cdot \frac{n}{24}\cdot tvec_{mem}}>1 \end{aligned} \end{aligned}$$
(27)

Let \(A^\prime _{D_{\frac{n}{3}-1},\mathcal {M}_{9\cdot 8}}\) denote the algorithm that uses 9 nodes and 8 cores per node. We get the following expression of the speed up the algorithm that uses 1 level of parallelism in \(M_{9,9}\)

$$\begin{aligned} \nonumber Sp(SW(A_{D_{27},\mathcal {M}_{9,9}}))= & {} \frac{T(SW(A_{D_{27},\mathcal {M}_{1,1}}))}{T(SW(A_{D_{27},\mathcal {M}_{9,9}}))} \\= & {} \frac{26\cdot C(\otimes )\cdot tcalc + 52\cdot tblock_{mem}}{3\cdot C(\boxtimes ) \cdot tcalc + 7\cdot tblock_{com}} \end{aligned}$$
(28)

which should be compared to the speed up of the algorithm that uses 2 levels of parallelism in \(M_{9\cdot 8}\)

$$\begin{aligned} \nonumber Sp(SW(A_{D^\prime _{\frac{n}{3}-1},\mathcal {M}_{9\cdot 8}}))= & {} \frac{T(SW(A_{D_{27},\mathcal {M}_{1,1}})}{T(SWA_{D^\prime _{\frac{n}{3}-1},\mathcal {M}_{9\cdot 8}})}\\ \nonumber= & {} \frac{26\cdot \left( \frac{n}{3}\cdot C(\boxtimes ) \cdot tcalc + (\frac{n}{3} +2) \cdot tvec_{mem}\right) +52 \cdot tblock_{mem} }{3\cdot \frac{n}{24}\cdot \left( C(\boxtimes ) \cdot tcalc + \left( \frac{n}{3}+2\right) \cdot tvec_{mem} \right) + 7\cdot tblock_{com}}\\ \end{aligned}$$
(29)

By specializing the parameters we can estimate the performance gain that we get using two levels of parallelism instead of one.

3 Conclusion

Matrix multiplication is one of the fundamental kernels in numerical linear algebra, for almost all matrix problems such as least square problem eigenvalue problem and data assimilation problem [58, 14]. Future designs of microprocessors and large HPC systems will be heterogeneous in nature, relying on the integration of two major types of components. On the first hand, multi/many-cores CPU technology have been developed and the number of cores will continue to escalate because of the desire to pack more and more components on a chip while avoiding the power wall, instruction level parallelism wall, and the memory wall. On the other hand special purpose hardware and accelerators, especially Graphics Processing Units (GPUs) are in commodity production, and have outpaced standard CPUs in floating point performance in recent years, and have become as easy, if not easier to program than multi-core CPUs. Finally, reconfigurable architectures such as Field programmable Gate Arrays (FPGAs) offer several parameters such as operating frequency, precision, amount of memory, number of computation units, etc. These parameters define a large design space that must be explored to find efficient solutions.

To address this scenario, it is undoubted that performance analysis of MM algorithm should be re-evaluated to find out the best-practice algorithm on novel architectures. This motivated the work to investigate the performance of the standard MM algorithm, by means of the new modelling framework that the authors have introduced.

This paper attempts to describe how to apply the performance model that the authors have developed so as to make it accessible to a broad audience. The model exploits the knowledge of the algorithm and the target architecture and it could help the researchers for designing optimized implementations on emerging computing architectures, such as that one developed in [3, 4].