Keywords

1 Introduction

Hardware/software partitioning is a method of dividing a complex heterogeneous system into hardware co-processor functions and its compatible software programs. It is a prominent practice that can realize results greater than the software-only or hardware-only solutions in system-on-chip (SoC) design. This technique can improve the system performance [1] and reduce the total energy consumption [2]. The proposed partial dynamic reconfiguration method does not depend on any tool. It uses a set of algorithms to detect crucial code regions, compilation/synthesize of hardware/software modules, and updating of communication logic. Hence, it could tune up the system to give full efficiency without disruption of other SoC-related operations. Here, the genetic algorithm (GA) is used for optimization process. This is essential in system-level design, since decision-making process affects the total performance of system. This paper presents a novel system partitioning technique with in-depth analysis. The paper is organized as follows. Section 2 briefs about the previous works in this field. Section 3 presents the proposed system model for partitioning problem. Section 4 gives the results and its analysis. Section 5 concludes the paper and discusses about the future work. Last section provides the list of references.

2 Related Works

When compared to dynamic partitioning using standard software, the run-time (or) partial dynamic reconfigurable systems had attained superior performance with manually specified predetermined hardware regions. Multiple choices of preplanned reconfigurations were rapidly executed in a run-time reconfigurable system using PipeRench architecture [3] and dynamically programmable gate arrays (DPGA) [4]. The binary-level partitioning technique [5] was provided a good solution compared to source-level partitioning methods due to the functionality of any high-level language and software compiler. Since the satisfaction of performance was not considered for the cost function of this system, it may be failed to find out local minima. A mapping technique for nodes and hardware/software components was developed in [6] called GCLP algorithm. The hardware cost was minimized by the incorporation of hill-climbing heuristic algorithm with the hardware/software partitioning algorithm [7].

3 System Model for Partitioning

The problem resolution requires the system model definition to represent the important issues in the hardware/software co-design for a specific problem [8]. The system partitioning problem model is represented by the task graph (TG) flow diagram. TG is a model of directed and acyclic graph (DAG) flow with weight vectors. Formally, it is defined as \(G=(V, E)\), where ‘V’ represents the nodes and ‘E’ represents the edges. The flow direction is represented by each edge. Due to reducing the complexity of TG, it can be modified as one starting node and one ending node. Figure 1 represents the overview of the partitioning procedure. Design constraints and design specifications are given as the input to the partitioning process as a high-level specification language. The nodes can act as giant pieces of information like tasks and processes of coarse granularity or tiny types like instructions and operations of fine granularity approach.

Fig. 1
figure 1

System model for partitioning

After the system space estimation, every node is tagged with some attributes. Giant pieces of data for a node \((V_\mathrm{i,j} )\) are represented by 5 attributes as follows:

  1. (1)

    Hardware area \((\text {HA}_\mathrm{i,j} ).\)

  2. (2)

    Hardware implementation time \((\text {HT}_\mathrm{i,j} ).\)

  3. (3)

    Software memory size \((\text {SS}_\mathrm{i,j} ).\)

  4. (4)

    Software execution time \((\text {ST}_\mathrm{i,j} ).\)

  5. (5)

    The average execution time in numbers \((N_\mathrm{i,j} ).\)

Shortly,

  • Hardware module \(\left( {\text {HM}_\mathrm{i,j} } \right) =\left( {\text {HA}_\mathrm{i,j} } \right) +\left( {\text {HT}_\mathrm{i,j} } \right) +(N_\mathrm{i,j} )\)

  • Software module \(\left( {\text {SM}_\mathrm{i,j} } \right) =\left( {\text {SS}_\mathrm{i,j} } \right) +\left( {\text {ST}_\mathrm{i,j} } \right) +(N_\mathrm{i,j} )\)

Communication values \((C_\mathrm{i,j} )\) of every node are represented by three

components as follows:

  1. (1)

    Transfer time \((\text {TT}_\mathrm{i,j} )\)

  2. (2)

    Synchronization time \((\text {SynT}_\mathrm{i,j} )\)

  3. (3)

    The average communication time in numbers \((M_\mathrm{i,j} )\)

Shortly,

Communication value of node \(\left( {C_\mathrm{i,j} } \right) =\left( {\text {TT}_\mathrm{i,j} } \right) +\left( {\text {SynT}_\mathrm{i,j} } \right) +({M}_\mathrm{i,j} )\)

$$\begin{aligned} C_\mathrm{i,j} =\frac{\left( {N_{i} *\Delta \text {TT}_{i} } \right) +\left( {{N}_{j} *\Delta \text {TT}_{j} } \right) +(\text {SynT}_\mathrm{i,j} )}{\left( {\text {HT}_{i} } \right) +(\text {HT}_{j} )} \end{aligned}$$

where \((\Delta \text {TT}_{i} )=\left( {\text {ST}_{i} } \right) -\left( {\text {HT}_{i} } \right) \) and \((\Delta \text {TT}_{j} )=\left( {\text {ST}_{j} } \right) -\left( {\text {HT}_{j} } \right) .\)

Efficiency of the hardware/software system partitioning process is based on the target architecture and its mapping technique. Hence, this work considers the ‘Dynamically Reconfigurable Architecture for Mobile Systems’ (DReAM) as target architecture. Execution of hardware and software processes should be concurrently in the standard processor and the application-specific co-processor. This partitioning process concludes the assignment of modules to implement the hardware and software stages, implementation schedule (timing), and the communication interface between software and hardware modules. In general, this partitioning solution can be validated by the measurement of eminent attributes like performance and cost parameters. Hence, this paper used as three quality attributes related to design elements as follows:

  1. (1)

    The estimated hardware area is \({A}_{E}\), and the maximum available area is A.

  2. (2)

    The estimated design latency is \(T_{E}\), and the maximum allowed latency is T.

  3. (3)

    The estimated software (or) memory space is \({M}_{E}\), and the maximum available space is M.

Static-list scheduling method is used for the scheduling process [9]. It is a subtype of resource-constrained scheduling algorithm. This scheduler considers the timing estimation of every vertex and its interconnections. This scheduler unit provides the design latency (\({T}_{E})\) and the cost of communication for hardware–software co-design. Based on the hardware and software implementations, another four parameters are considered for co-design realization.

When the entire system is implemented in hardware,

  1. (1)

    The minimum design latency is MinT.

  2. (2)

    The maximum hardware area is MaxA.

When the entire system is implemented in software,

  1. (1)

    The maximum design latency is MaxT.

  2. (2)

    The maximum memory space is MaxM.

These parameters are used to create the bounding constraints for the design space.

\(0\le {A} \le \) MaxA; \(0\le {M} \le \) MaxM; MinT \(\le {T} \le \) MaxT.

3.1 System Operations

The design specifications are given in the format of ISPD98 benchmark suite [10] circuit netlist. This partitioning process has three stages.

In first stage, the processing of design specifications is divided into three subtasks. The first subtask is the separation of hardware (\(\text {HA}_{i}\) and \(\text {HT}_{i})\) and software (\(\text {SS}_{i}\) and \(\text {ST}_{i})\) estimations from the design specifications. The second subtask is to translate the design specifications into a hypergraph-based control data flow graph (CDFG) representation \({G}=({V}, {E})\). The third subtask is scheduling (\({N}_{i}\) and \({N}_\mathrm{i,j})\) of each operations in the CDFG with satisfaction of the design constraints and the priority of operations.

In second stage, the outputs of these three tasks are given into the system-level partitioning module through the registers. It has three functionalities. The operational-level analysis is the first process, used to classify the tasks whether it is suitable for hardware realization or software execution. Next, the allocation process is used to allocate the required supporting entities like functional units, interconnections, and storage elements for the scheduled hardware and software systems. This allocation is based on the speed constraint (i.e., parallel processing) and the area constraint (i.e., dynamic partial reconfiguration). Finally, an absolute data path is generated by integrating components in the basis of hardware and software partitions. Then, the partitioning data are given to the specific hardware (\(\text {HM}_{i})\) and software (\(\text {SM}_{i})\) models.

In third stage, the hardware and software models are executed separately and the outcomes are compared with their estimated values (i.e., first stage). If any controversy arises, the feedbacks are given to the second-stage process. This looping process is continued till the satisfaction of all criterions.

Next, the performance (\({C}_{\mathrm{{i,j}}})\) of hardware–software co-design is estimated and compared with target performance metrics. If any misalignment arises, the feedback is indicated to the system-level partitioning stage. Then, the entire second and third stages are recompiled, till the achievement of target performance measures. Finally, the hardware/software co-simulation and co-verification is performed, and then, the SoC is realized.

3.2 Hardware/Software Estimation

The CDFG file is given to the input of both hardware and software estimations with the settings of target technology files and processor specifications. The hardware execution is a parallel process since the specifications are modeled in VHDL library. The software execution is a sequential process since the specifications are modeled in C code. The GA technique is used to optimize these parallel and sequential processes.

Hardware estimation is based on the high-level synthesizable components, to share the control and data path between hardware and software processes. GA is used to optimize this resource sharing process [11]. The quality measures are closely associated with performance metrics like execution, implementation, transfer, and synchronization times commonly called reaction time. This reaction time is associated with each node in each execution of local DFG. For convenient, the CDFG is split into several small DFGs called local DFGs.

The response times for

Routine statements, \(T_{\text {RS}} = T_{\text {DFG}} \)

Conditional statements, \(T_{\text {CS}} =\sum \limits _{n} {P}_{n} {T}_{\text {DFGn}} \) ;

  • n—Number of iterations

  • \(\text {P}_{n} \)—Probabilities of iterations of outcomes

Looping statements, \({T}_{\text {LS}} ={nT}_{\text {DFG}} \) ;

$$\begin{aligned} {T}_{\text {CDFG}}&= {F}({T}_{\text {DFG}1} , {F}_{\text {DFG}1} ,\ldots , {T}_{\text {DFGi}} , {F}_{\text {DFGi}} )\\&\quad + {F}({T}_{\text {DFG}1} , {F}_{\text {DFG}1} ,\ldots , {T}_{\text {DFGj}} , {F}_{\text {DFGj}} ) \end{aligned}$$
$$\begin{aligned} \text {MinT}=\alpha [(\text {MaxA}*{C}_{\mathrm{{i,j}}} )+\mathop \sum \limits _{i} {T}_{i} {N}_\mathrm{i,j} ] \end{aligned}$$
  • \({T}_{i}\)—Time delay for each node

  • \(\alpha \)—Co-estimation factor

$$\begin{aligned} \text {MaxT}=\text {MinT}+\beta \mathop \sum \limits _{i} [{T}_{i} \mathop \sum \limits _{{j}=1}^{{R}_{i} } {N}_\mathrm{i,j} ] \end{aligned}$$
  • \({R}_{i}\)—Required components of each node ‘i

  • \(\beta \)—Constant, since MaxT is a higher-order term

  • \({F}_{i}\)—Number of fixed components for each node ‘i

$$\begin{aligned} {T}_{\text {CDFG}} =\text {MinT}+\beta \mathop \sum \limits _{i} [\frac{{T}_{i} }{{F}_{i} }\mathop \sum \limits _{{j}={F}_{i} +1}^{{R}_{i} } {N}_\mathrm{i,j} ] \end{aligned}$$

Register Estimation: [12]

Many input multiplexers = \(({i}^{*}\text {MUXs})\)

State machine-based control logic is used to control lines, \(\text {log}_2 {i}\)

ROM size, \((\text {STA}^{*}[\left( {1+\text {log}_2 {i}} \right) \left( {\text {REG}+\sum \limits _{i} {F}_{i} } \right) +\text {log}_2 {S}])\text {bits}\)

  • STA—Number of states

  • REG—Number of registers

Software estimation is based on the calculation of memory space occupied by instruction set and user-defined data types and data structures. The average queuing time for each memory access can be modeled as \({T}_{q} \), and the number of access is represented by \({N}_{\text {mem}} \). This calculation is necessary to estimate \(\left( {\text {TT}_\mathrm{i,j} } \right) \text {and}\left( {\text {SynT}_\mathrm{i,j} } \right) \).

Hardware estimation \(\left( {{T}_{\text {HM}} } \right) =\left( {{T}_{(\text {CDFG},\text {HM})} } \right) +\alpha T_{q} ({N}_{\text {mem},\text {HM}} )\)

Software estimation \(\left( {{T}_{\text {SM}} } \right) =\left( {{T}_{(\text {CDFG},\text {SM})} } \right) +{T}_{q} ({N}_{(\text {mem},\text {SM})} )\)

Co-estimation \(\left( {{T}_{\text {HM}/\text {SM}} } \right) =\sigma \left( {{T}_{q} } \right) +\varphi (\frac{{N}_{\text {mem}} }{{T}_{q} })\); where \(\sigma \, \text {and}\,\varphi \) are complex structures.

Table 1 Design characteristics for ISPD’98 benchmark suite

4 Analyses of Results

All the hardware/software partitioning algorithms have been experimented in a set of benchmark suites provided by ISPD’98, whose characterization is shown in Table 1. Size and values of the system graph should bound within the design space. All these examples are illustrated in the form of directed and acyclic graphs to specify the certain coarse–grain tasks. Every example has been tested in different constraints, but it always within the specified boundary conditions. The results are summarized in Table 2. These results will be analyzed from both qualitative and quantitative perspectives. The qualitative aspects will be mainly represented by the resulting cost of the solutions obtained from each method, under different constraints. The quantitative issues will be shown by means of the computation time resulting from each technique.

Table 2 Results acquired with the ISPD’98 examples

5 Conclusion and Future Work

In this paper, the commonly used biologically inspired optimization algorithm, which addresses the hardware/software partitioning problem for SOC designs, is implemented using clustering approach as well as their performance is evaluated. This evaluation process does not have any constraints on the cluster size and the number of clusters. Hence, this evaluation approach is quiet suitable to be used in reducing the design complexity of systems. This paper had shown how this problem can be solved by means of very different partitioning techniques at runtime of the system (dynamic partial reconfiguration). The problem resolution has been based on the definition of a common system model that allows the comparison of different procedures. These extensions have improved previous implementations, because they include some issues previously not considered. The constraints of these algorithms have been integrated into the cost function in a general and efficient way. This genetic algorithm-based dynamic partitioning technique has produced an average of 16.19 % accuracy in hardware/software partitioning compared to [13] and [14].

A future study could extend the system model to encompass other quality attributes, like power consumption, influence of communications, and the degree of parallelism. Also, the hybrid algorithms of these biologically inspired algorithms and their compilation are currently under study.