1 Introduction

Cognitive Radio is an emerging technology that enables a wireless transceiver to cognitively manage its wireless spectrum for improved agility and efficiency. Flexibility and reconfigurability of the implementation at various layers, including RF, baseband, and MAC layers, with cross-layer modeling and control, will be important to realize the efficiency potential of spectrum sharing. Realizing the potential of cognitive radio will also require transceivers to dynamically reconfigure communication parameters based on multidimensional criteria, including channel conditions, link performance, and user requirements. Meanwhile, increasing bandwidths and data rates pose new challenges to the baseband (BB) processing chain, as well as to radio frequency (RF) processing.

The above challenges motivate important new research directions in both software- and hardware-based design of wireless communication systems. On the software side, software defined radio (SDR) now utilizes various kinds of high-performance computing devices, ranging from multi-core programmable digital signal processors (PDSPs), streaming SIMD extensions (SSE), to general purpose graphics processing units (GPGPUs). On the hardware side, programmable baseband computation and related design chains have significantly developed to enable more efficient control of computational resources and hardware. However, practical and systematic approaches to reconfiguration based on programmable paradigms are still lacking. For example, software-based adaptive configuration of radio frequency chains is still in its infancy, but is a key ingredient of the frequency agile radios needed for cognitive devices and flexible RF spectrum use. The trend of increasing diversity and flexibility in both the functionality and the computational platforms of wireless systems results in complex design spaces that must be considered during design and implementation. The complexity of these design spaces and their novel constraints strongly motivate the development of new design methodologies.

Dataflow models offer a promising foundation for such design methodologies in part because they provide scalable and retargetable representations of signal processing applications [3]. Designers can migrate a common dataflow model of an application across different types of computing platforms, while changes are localized to the implementations of individual actors (dataflow-based functional components). The scalability and retargetability of dataflow models reduces the designer’s effort in debugging, validating, and fine-tuning a complex signal processing application that must satisfy stringent, multi-dimensional constraints.

To express dynamics in complex signal processing applications, a number of dynamic dataflow models have been proposed, including parameterized synchronous dataflow (PSDF) [2], Boolean dataflow (BDF) [7], and core functional dataflow (CFDF) [15]. PSDF provides semantics to manipulate application parameters in dataflow models at run-time. BDF introduces special control actors to allow data-dependent invocation of actors. CFDF applies the concept of actor “modes”, where different modes can have differing dataflow behavior, and mode transitions can be data-dependent. CFDF is tailored to natural design of actors with dynamic functionality, and facilitates prototyping of dataflow applications, as well as identification of more specialized dataflow behaviors [16], such as BDF, cyclo-static dataflow (CSDF) [5] or synchronous dataflow (SDF) [13].

When using CFDF, a designer specifies the behavior of the different modes of each CFDF actor, and the transitions among these modes. However, as the number of modes grows and the mode transitions become more complex, CFDF formulations can become unwieldy in terms of actor specification, analysis and implementation. In this paper, we present a novel modeling method, called parameterized set of modes (PSM), which is a high-level abstraction that efficiently represents parameterized functionality within groups of related modes for CFDF actors. PSMs enable novel ways for representing, manipulating and applying related groups of actor modes that lead to more concise formulations of actor behavior, and a unified modeling methodology for applying a variety of techniques for efficient implementation. We develop the formal foundations of PSM-based modeling, and demonstrate its utility through two case studies involving the mapping of reconfigurable wireless communication functionality into efficient implementations.

2 Background and Related Work

2.1 Background

Dataflow modeling has proven to be valuable in allowing designers of signal processing systems to describe applications in an intuitive and structured manner [3]. As system complexity increases, coarse-grained, dynamic dataflow models have gained increasing significance for their flexibility and their power in exposing high level application structure that is relevant for deriving optimized implementations.

Core functional dataflow (CFDF) is a deterministic sub-class of enable-invoke dataflow (EIDF) [15] in which dynamic functionality in an actor is specified as a set of actor modes. In each mode, the actor possesses deterministic dataflow behavior, meaning that the production/consumption rates on all actor output/input ports are known, constant values. Upon each invocation (actor firing), the actor executes in its current mode, and in addition to consuming input tokens and producing output tokens, the actor selects one next mode from its set of modes. This next mode determines the mode in which the next actor invocation executes (unless the actor mode is reset or otherwise overridden by the controlling scheduler). The next mode determined during an actor invocation can be fixed (known at compile time) or data dependent.

As the level of dynamic behavior in each actor grows, the size of the actor’s mode set may increase significantly. For an actor with a large number of parameters and corresponding variations in functionality, a large set of modes can be difficult to specify, analyze, and map into efficient implementations. Such parameterized modes can arise when applying CFDF to model cognitive radio applications, for example, due to the handling of different communication modes, sample rates, or filter configurations. In this paper, we address the problem of efficient integration of parameterization into the mode-based structure of CFDF actor models.

2.2 Related Work

A technique called mode grouping for CFDF graphs has been developed in [14]. It is demonstrated that mode grouping can improve scheduling results by aiding the discovery of statically schedulable subgraphs. In [18], CFDF is applied in simulation of dynamic communication systems. CFDF modeling is also applied as the semantic basis for the lightweight dataflow design environment, which is introduced and applied to design and implementation of wireless communication systems in [19]. These works apply the CFDF model in various useful ways, but are unable to streamline their associated analysis or implementation when manipulating groups of modes that are related through parameterization. The mode-based parameterization techniques introduced in this paper are developed to bridge this gap.

Various research efforts have been directed towards integrating dynamic behavior into dataflow models. In [12], a design framework called SysteMoc is developed for applying dataflow structures, similar to those used in CFDF, involving guarded invocations and state transitions specified by finite state machines (FSMs). The work also includes design space exploration and code synthesis for FPGA platforms. In [21], a dataflow based analysis method is proposed for SDR applications. This method adopts the concept of “SDF scenarios” to incorporate some degree of dynamism for better estimation of system resource requirements and throughput. Moreover, methods for quasi-static scheduling of statically-schedulable sub-graphs within larger dynamic dataflow graphs are explored in [10].

In the context of the related work described above, the main contributions of this paper are described as follows. We enhance the CFDF model of computation by introducing the concept of parameterized set of modes (PSM), which incorporates dynamic parameterization into actor modes, thereby increasing the effectiveness with which designers can design and implement CFDF-based, dynamic dataflow models for signal processing systems. PSM-based modeling of actors provides a common framework for integrated specification, analysis and implementation that deeply integrates mode- and parameter-based actor characterizations. Although we develop the PSM model in the context of CFDF in this paper, we envision that the ideas can be adapted to related dataflow modeling and programming techniques, such as, for example, SysteMoc [12] and CAL [9]. Exploring and applying such adaptations is a useful direction for future work.

3 Parameterized Set of Modes

In this section, we define a new modeling concept, called parameterized set of modes (PSM), which is motivated in Section 2 as a method for incorporating dynamically parameterized modes efficiently into the CFDF modeling framework.

3.1 Notation

To develop the PSM concept precisely, we first introduce some notation and review the definition of the CFDF model of computation. For a given dataflow graph actor A, we denote the set of input ports of A by (A). We also denote the set of nonnegative integers by N, and the set of Boolean values by B. We denote the values in B as true and false.

When using PSMs, we allow CFDF actors to have arbitrary sets of parameters. Following notation similar to that of parameterized dataflow graphs [2], we denote the set of parameters of a given actor A as (A), and for each parameter in p∈(A), we denote the set of permissible values of p as (p). At any given point during dataflow graph execution, an actor parameter p has associated with it a unique parameter value v∈(p), which is referred to as the configuration of p at that point in time. A configuration for A can then be specified as a set of configurations for all of the parameters in (A). Some combinations of possible parameter values may be considered invalid because they do not make sense together. The set of all valid configurations for A is denoted as (A). At a given point during execution, the specific configuration for A that is determined by its current parameter values is referred to as the active configuration of A. Similarly, the specific mode that a CFDF actor is in during a given firing is referred to as the active mode for the actor.

If S 1 and S 2 are sets, then by S 1S 2, we mean that S 1 is a subset (not necessarily a proper subset) of S 2. Thus, S 1 can be empty, equal to S 2, or a proper subset of S 2.

3.2 Review of CFDF Semantics

As introduced in [15], each CFDF actor A is characterized by a nonempty set M A of modes in which it can execute, and for any given mode mM A , the actor A consumes a fixed number of tokens per firing on each input port, and produces a fixed number of tokens per firing on each output port. These production and consumption rates may vary across different modes, but must be constant for any given mode. Each CFDF actor A is also characterized by its enabling function ε A , which determines whether or not, based on a given set of token populations on its input FIFOs, A is enabled. If A has at least one input port, then this enabling function can be viewed as a mapping

$$ \varepsilon_{A}: (T_{A} \times M_{A}) \rightarrow B, $$
(1)

where T A =N in(A) denotes the set of all possible buffer populations for input ports of A (assuming some underlying ordering of these ports) [15]. If A has no input ports, then its enabling function reduces simply to the Boolean constant true. The CFDF formulation of enabling functions can easily be generalized to take into account finite-capacity output buffers (i.e., by requiring sufficient free space on output buffers before allowing an actor to be fireable). For brevity and clarity, we suppress these details of bounded buffer CFDF execution in this paper, and we simply assume that FIFOs have unbounded token capacity, unless otherwise stated.

3.3 Motivation for Parameterized Sets of Modes

The CFDF formulation can become unwieldy when working with parameterized actors that have large parameter sets, especially if one or more actor parameters can affect the production and consumption rates of an actor. For example, consider a parameterized downsampler actor that provides an N:1 downsampling of its input signal. Such an actor requires N distinct modes in its CFDF specification even though the operation of all N alternative modes have closely related (parameterized) functionality. Using the PSM concept introduced in this section, we can group all of these related modes together into a single mode set σ, where the individual mode in σ that is active during any given actor firing is determined uniquely by the actor parameter set (in this case, by the parameter N).

As a slightly more elaborate example, consider an actor S that can function either as a downsampler or an upsampler depending on its configuration. Such an actor could be useful, for example, as part of a programmable, multistage subsystem for sample rate conversion. This actor can be parameterized with two parameters u and N, where u is Boolean-valued and indicates whether or not S functions as an upsampler (if u=f a l s e, then the actor functions as a downsampler), and N provides the upsampling or downsampling factor. Using the PSM concept, this actor can be specified precisely using two mode sets — one for the upsampling-related modes, and the other for the downsampling-related modes. In any given mode set, the production and consumption rates are determined uniquely by the actor parameters. For example, in the mode set associated with upsampling (u=t r u e), N=3 yields a consumption rate of 1 and production rate of 3.

Intuitively, a PSM-enhanced CFDF specification, or PSM-CFDF specification, allows an actor’s modes to be grouped into “clusters” or sets that have related functionality, and are therefore efficient to work with as distinct units — e.g., in terms of design tasks such as specification, analysis, optimization, profiling, and integration. In general, the actor groups may overlap, but collectively, they should “cover” the entire set of modes of the associated CFDF actor. Furthermore, the actor groups in a PSM-based specification should be related uniquely to the actor modes through the parameters of the given actor.

3.4 Formal Definition of PSM-CFDF

Given a PSM-CFDF actor A with mode set M A , a PSM ρ for A is a 3-tuple ρ = (S,C, f), where SM A , Cdomain(A), and f : CS. The set C, denoted as psa_domain(ρ), can be viewed as the set of possible actor configurations when the actor is firing in mode set S. The set S, denoted psa_modeset(ρ), is the set of modes in actor A that is associated with ρ — i.e., whenever A fires in PSM ρ, it fires one of the modes within psa_modeset(ρ). Finally, the mapping f, denoted F ρ , specifies the unique mode within psa_modeset(ρ) that is active whenever A executes in mode set S and a given actor configuration is active.

Given a PSM-CFDF actor A with mode set M A , and a set R of PSMs for A, we say that R covers A if every mode in M A is contained in the mode set of at least one element of R — that is, if

$$ M_{A} = \bigcup\limits_{\rho \in R} \textit{psa}{\_} \textit{modeset}(\rho). $$
(2)

A PSM-CFDF actor A is a CFDF actor with an associated set R of PSMs that covers A, and a family of mappings {p s a_n e x t r,c :I(F r (c))→RrR and cD O M A I N(A)}. Here, for a given mode mM A , I(m) denotes the set of all possible combinations of inputs — i.e., all possible n-tuples of token vectors, where n=i n, and the size of (number of elements in) each token vector is equal to the consumption rate of the corresponding port in mode m.

In other words, for each pair (r,c), there is a mapping p s a_n e x t r,c , called the next PSM function of PSM r under actor configuration c, that determines uniquely a specific mode m′ for any given input data set for that mode; this mode m′ can be interpreted as the next PSM for the actor — i.e., the PSM that should be active for the next firing of A.

For a PSDF-CFDF actor A, we denote the associated set of PSMs at PSMset(A), and the associated family of mappings as m a p p i n g s(A).

The next PSM function is related to the invoking function of A, as defined by CFDF semantics. In particular, for a given actor firing, the next mode, as determined by the invoking function, should agree with (be an element of) the next PSM, as determined by p s a_n e x t(r,p). For details on the CFDF invoking function, we refer the reader to [15].

The concept of PSM is a level of abstraction that helps the designer to better understand and expose connections between the actor’s parameters and modes. PSM analysis can be combined with various processes in a design framework, such as scheduling and processor selection, to name a few. By grouping into a single PSM the modes of an actor that share some common property, a designer can manipulate the associated modes and apply aspects of the property in an integrated and systematic way.

3.5 PSM Transition Graph

For a PSM-CFDF actor, the next PSM function defines the range of modes in which the actor executes in its next invocation. The structure of transitions among PSMs therefore can provide valuable information about the actor’s dynamic behavior. These transitions can be expressed formally by a construction that we call the PSM transition graph.

The PSM transition graph for a PSM-CFDF actor A is a directed graph G p s m =(V p s m ,E p s m ), where V p s m is the set of vertices and E p s m is the set of edges. The set of vertices is in one to one correspondence with the PSMs of A; the PSM transition graph vertex associated with PSM r is denoted as v p s m (r). Two PSM transition graph vertices v p s m (x) and v p s m (y) are connected by a directed edge e=(v p s m (x),v p s m (y)) if there exist an input vector ν and a configuration c such that y=p s a_n e x t x,c (ν). Such an edge e is annotated with a label, l a b e l(e)=c. Note that multiple edges can have the same label if different next PSMs are “reachable” from the same current PSM and same configuration under different input vectors. Compared to finite state machine (FSM) representation of state transitions, the PSM transition graph contains higher level information on the structure of PSMs. Such higher level structure may be difficult to extract or intuitively understand from conventional FSM-style representations (i.e., where each mode corresponds to a separate FSM state), especially when the number of modes is large or their connections are irregular.

Figure 1(b) shows an example of a PSM transition graph. Further details about the actor in this example are discussed in Section 5.4.

Figure 1
figure 1

An example of a PSM-CFDF actor: OFDM demapper example. (a) Actor interface. (b) PSM transition graph.

3.6 Implementation Considerations

When implementing a PSM-CFDF actor, we do not anticipate that designers will typically need to explicitly implement the mappings (mathematical functions) F ρ and p s a_n e x t{r,p}. These mappings are useful as analytical tools, but their explicit realization in software is not in general essential for the PSM-CSDF model — e.g., an actor designer would not need to provide a software function/method or hardware description language module that is dedicated to implementing each of these mappings. Instead, for example, critical aspects of F ρ may be validated through unit testing, and the next PSM may be determined as a by-product of actor firing — e.g., through an actor-level application programming interface (API) that is used by schedulers to invoke the actor.

3.7 Application Example

In this section, we show an example of applying PSM-CFDF concepts in actor design for a reconfigurable OFDM demodulator that is geared towards cognitive radio systems. Such systems can involve significant amounts of parameterization in actor designs. Figure 1(a) shows a parameterized demodulator actor that supports different operational modes, including QPSK and QAM16. The actor maps the B samples into an M×B bit stream. This actor has two parameters: M for the number of bits per sample, and B for the vectorization degree (see [17] for fundamental developments on actor-level vectorization for signal processing dataflow graphs). Since M represents the number of bits for each symbol, M=2,4 correspond, for example, to QPSK, QAM16, respectively. B can take on any integer value between 1 and B m a x , where B m a x is the maximum vectorization degree (e.g., as a designer or design tool might set based on memory constraints). The parameter B allows symbols to be buffered and processed together in batches (block processing). For example, if B=1, then each actor invocation processes a single input symbol; if B=10, then 10 symbols are buffered and processed together in one invocation.

The de-mapper in Fig. 1(a) is modeled as a PSM-CFDF actor A as follows. Actor configurations are specified in the form (M,B). The set of modes of the actor is given as:

$$\begin{array}{@{}rcl@{}} M_{A} & = \{\textit{INIT},\textit{QPSK}_{1},\textit{QPSK}_{2},\ldots,\textit{QPSK}_{B_{\textit{max}}}, \\ & \textit{QAM}\textit{16}_{1},\textit{QAM}\textit{16}_{2},\ldots,\textit{QAM}\textit{16}_{B_{\textit{max}}}\} \end{array} $$
(3)

Based on the functionality, M A can be clustered into 3 PSMs: {ρ i =(S i ,C i ,f i )∣i=1,2,3}, where S 1={Q P S K n ∣1≤nB max}, C 1={(2,n)∣1≤nB max}, f 1(M,B)=Q P S K B ; S 2={Q A M 16 n ∣1≤nB max}, C 2={(4,n)∣1≤nB max}, f 2(M,B)=Q A M 16 B ; S 3={I N I T}, C 3={(m,n)∣m=2,4;1≤nB max}, f 3(M,B)=I N I T.

Based on this decomposition into PSMs, Fig. 1(b) illustrates the PSM transition graph for the demapper actor. Upon initialization or reset, the actor enters the INIT mode, the only mode in ρ 3. After initialization, the actor enters a mode in ρ 1, or ρ 2, based on the configuration. For any mode in ρ 1, the ratio of the production rate p r d(A) to the consumption rate c o n s(A) is 2. Similarly, for any mode in ρ 2, p r d(A)/c o n s(A)=4. To avoid clutter in the diagram, edge labels are not shown.

3.8 Summary

In this section, we have presented an enhancement to the framework of CFDF modeling called parameterized set of modes (PSM), and we have introduced the PSM-CFDF approach to the modeling of dynamic dataflow actors with dynamically variable parameters. To illustrate the approach, we have presented a detailed example of an OFDM demapper actor that is modeled in terms of PSM-CFDF semantics. This example and its associated PSM transition graph representation concretely illustrate the novel form of higher level modeling structure that is exposed by the PSM modeling concept and the associated PSM-CFDF design methodology.

4 PSM-level Static Scheduling for CFDF Graphs

In this section, we demonstrate the application of PSM to efficient scheduling of CFDF-based programs.

A general scheduling approach for CFDF graphs is the so-called canonical scheduling approach discussed in [16]. In canonical scheduling, a sequential ordering L of the dataflow graph actors is constructed [16]. At run-time, the scheduler iteratively traverses the list L, and upon visiting each actor A, the scheduler checks the enabling condition (availability of sufficient input data) for A, and invokes A if the enabling condition is satisfied. This scheduling approach is useful in the sense that it is very general (applicable to any CFDF graph), easy to understand, and easy to implement. However, the efficiency of canonical scheduling can be relatively low because of the frequency with which enabling conditions must be checked.

4.1 Statically Schedulable Regions

Static schedules, where the sequence of actor firings is deterministic and unconditional (not guarded by actor-level checking of enabling conditions) can be significantly more efficient and predictable compared to dynamic scheduling approaches, such as canonical scheduling. Even if the overall dataflow graph does not allow for static scheduling (due to the presence of dynamic dataflow), it may be possible to identify “statically schedulable regions” of the graph — i.e., parts of the graph that can be scheduled statically. Such regions can be scheduled using efficient static scheduling techniques, which have been developed extensively in the literature (e.g., see [3]), and then the static schedules for the different regions can be integrated through a “top-level” dynamic scheduling mechanism.

In this section, we develop PSM-based methods for constructing and applying statically schedulable regions for efficient implementation of CFDF graphs. The concept of statically schedulable regions itself is not new, and has been studied in depth, for example, in the implementation of CAL programs [11]. Our contribution in this section, which we refer to as PSM-level static scheduling, is to demonstrate methods for integrating the concepts of PSMs and statically schedulable regions, therefore combining the benefits of both approaches, and enabling structure exposed from PSMs to help guide the construction of efficient schedules. More specifically, in our development of PSM-level static-scheduling, we utilize information about actor parameters to form hierarchical PSMs, where each hierarchical PSM is constructed based on combinations of actor modes that share common scheduling properties.

In the remainder of this section, we outline our proposed PSM-level static scheduling approach and present experimental results on an application example.

4.2 PSM-level Static Scheduling

PSM-level static scheduling is a hierarchical scheduling technique, where subgraphs within a dataflow specification are combined into hierarchical actors, and execution of a hierarchical actor corresponds to execution of a schedule for the associated subgraph. If H is a hierarchical actor with associated subgraph G, we say that H encompasses G, and G is the nested subgraph of H.

In the class of CFDF-PSM specifications addressed in this work, a hierarchical actor contains a set of modes, and can also contain a set of PSMs, just as non- hierarchical (leaf-level) actors. In the case of a hierarchical actor H, each mode m of H corresponds, respectively, to a mapping Z m :V e γ, where G e =(V e ,E e ) denotes the graph encompassed by H, γ is the set of all actor modes across all actors in V e , and Z m (v)∈M v for all v. Recall here that M v represents the set of modes for a given actor v.

Intuitively, execution of H in a given mode mM H corresponds to execution of the encompassed graph with all actors operating in the modes specified by Z m . The duration (termination criterion) of such an execution is a design issue associated with the construction of H, similar in some ways to the concept of “subsystem iteration” in parameterized dataflow [2]. In this paper, we assume that each execution of H in a given mode m corresponds to execution of a minimal static periodic schedule of the SDF graph, denoted G s d f (H,m), that results from fixing the actors in G e based on the mode assignments specified by Z m . Exploration of other kinds of termination criteria in this context is a useful direction for further work.

In our development of PSM-level static scheduling in this paper, we assume that the hierarchical actors employed are provided as part of the specification — i.e., as part of the design hierarchy. Another interesting direction for future work is in the development of automated methods to group (cluster) subgraphs into hierarchical actors for PSM-level static scheduling.

4.3 Construction of SDF Scheduling PSMs

Building on the concepts introduced in Section 4.2, we introduce a simple method to partition the mode set M H of a hierarchical actor H in a manner that facilitates construction of statically schedulable regions. This leads to a unique partitioning of M H into a set of PSMs that we refer to as SDF scheduling PSMs. The method is useful in systematically decomposing the structure of a hierarchical PSM-CFDF actor in a manner that that captures subsystem-level, multi-mode behavior that is common in cognitive radio systems.

The process of constructing SDF scheduling PSMs operates by iterating through all modes in H, and dividing the modes into subsets (PSMs) S 1,S 2,…,S k , where all modes in a given S i correspond to the same SDF repetitions vector for the encompassed graph G(e). In other words, if m 1,m 2S i , and aV e , then q 1(a)=q 2(a), where q 1 and q 2 denote, respectively, the SDF repetitions vectors of G s d f (H,m 1) and G s d f (H,m 2). The resulting mode sets S 1,S 2,…,S k are then parameterized with one more scheduling parameters that can be configured and adapted based on considerations such as the given performance constraints, repetitions vectors q i , and structure of G(e). This process depends on fundamental properties of the SDF repetitions vector and requires that the set of SDF graphs {G s d f (H,m)∣mM h } satisfy SDF consistency conditions. For details on SDF fundamentals and consistency conditions, we refer the reader to [13].

In cognitive radio systems, actors can often be configured statically or dynamically by various parameters, resulting in large sets of possible actor modes. If the actors’ mode spaces are viewed independently, the total number of possible mode combinations to consider can grow exponentially, making the system unwieldy and inefficient for scheduling analysis. The integration of PSM techniques to hierarchical CFDF modeling techniques, as introduced in this section, introduces an alternative, more compact designs space — the design space of scheduling parameters for the PSMs S 1,S 2,…,S k — that facilitates efficient scheduling, including the application of SDF scheduling techniques to statically schedulable regions.

4.4 Synthetic Example

To illustrate the PSM-level static scheduling technique introduced in Section 4.2 and 4.3, Fig. 2 shows a synthetic CFDF graph with 2 parameters, p 1 and p 2. Intuitively, the parameters p 1 and p 2 control (select) the modes of A and C, respectively, and p 1 and p 2 together control the mode of B. The parameter values and their corresponding actor modes, production rates, and consumption rates are shown in Table 1. Here, the special actor c t r l reads parameter values from an input source (e.g., a file), checks their validity, and sends them as tokens to A, B and C.

Figure 2
figure 2

A synthetic CFDF graph that is used to illustrate PSM-level static scheduling concepts.

Table 1 Details of actor parameters, modes, and dataflow rates.

Now suppose that H is a hierarchical actor that encompasses the subgraph associated with actors A, B, and C. The actors enter “initialization modes” A 0, B 0 and C 0, respectively, upon system reset, and wait for parameter tokens that are passed from c t r l. After receiving the parameter values, the actors continue to their respective operational modes, as specified by the received parameters, until all data from s r c has been processed.

Analyzing the repetitions vectors in M H , and the mode space of H, and constructing SDF scheduling PSMs leads to the PSMs outlined in Table 2. The common repetitions vectors in the same scheduling PSM allows a common static schedule to be applied across all modes in that PSM. For example, for P S M 1, the static schedule σ 1=A B C can be applied as the schedule for H. Similarly, for all modes in P S M 2, we can apply the static schedule σ 2=A B(2C). Here, we apply looped scheduling notation, where a parenthesized term of the form (m X), where m is a non-negative integer (or a symbolic expression that resolves to a non-negative integer) and X is a sequence of actor firings, represents the successive execution m times of the sequence X. For background on the construction and manipulation of looped schedules for synchronous and parameterized dataflow graphs, we refer the reader to [2, 4].

Table 2 Scheduling PSMs of the hierarchical actor H.

For the entire application graph in this example, we can apply the schedule σ t o p =s r c σ H (n s n k), where n is the mode-dependent firing rate (iteration count) for s n k, and σ H is configured dynamically as σ 1 or σ 2 based on the currently-active scheduling PSM.

We constructed the PSMs and schedules outlined here by hand, and based on these constructions, we implemented this synthetic application graph using the lightweight dataflow environment (LIDE), which is a tool for experimenting with dataflow techniques in arbitrary simulation- or platform-oriented languages, such as C, CUDA, MATLAB, and Verilog [19, 20]. Specifically, in our experiments we employed LIDE-C and LIDE-CUDA, which are C- and CUDA-oriented versions of the LIDE environment, respectively.

We implemented each actor as a simple sample rate converter that inserts or discards tokens to achieve the specified dataflow rates. The experiment is carried out using a desktop computer equipped with an Intel Core i7-2600K 8-core CPU, and 16GB memory. Figure 3 shows the execution time of the graph using CFDF canonical scheduling and PSM-level static scheduling. For our implementation of PSM-level static scheduling, we used the hierarchy of schedules σ t o p , σ 1, and σ 2 defined above. In this example, the average execution time improvement of PSM-level static scheduling among the different modes of H is 11.9%.

Figure 3
figure 3

Execution time comparison between canonical scheduling and PSM-level static scheduling for the synthetic example of Fig. 2.

Although it is based on a synthetic dataflow graph, the simplicity of this example helps to demonstrate concisely and concretely the proposed PSM-level static scheduling approach, and the potential for performance improvement using the approach.

4.5 Application Example

In this section, we demonstrate a practical example of PSM-level static scheduling that is relevant to the cognitive radio domain. Figure 4 shows a dynamically configurable RPSK modulator that supports multiple source rates and multiple Phase-Shift-Keying (PSK) modulation schemes. The hierarchical actor R encompasses a subgraph that contains two CFDF actors s r c (using a minor abuse of notation), and T, whose modes are shown in Table 3. Here, r and m specify the source rate and the modulation scheme, respectively.

Figure 4
figure 4

A dynamically configurable RPSK modulator in CFDF.

Table 3 Dynamic actors in the RPSK application.

Using PSM-level static scheduling, we derive 4 PSMs, as shown in Table 4. The static schedule for each PSM is then constructed by hand, implemented in LIDE, and compared with canonical scheduling, as in Section 4.4. We see from the results that in this example, the performance improvement from applying PSM-level static scheduling is higher compared to that of the small, synthetic example in Section 4.4. In terms of the execution time per graph iteration (i.e., per minimal periodic scheduling iteration of the derived SDF subgraphs), PSM-level static scheduling outperforms canonical scheduling by an average of 45.4%, as shown in Fig. 5. Here, the average is taken across the 6 operational modes for the hierarchical actor R.

Figure 5
figure 5

Execution time comparison between canonical scheduling and PSM-level static scheduling for the RPSK application.

Table 4 PSMs of the hierarchical actor R in the RPSK application.

4.6 Summary of PSM-level Static Scheduling

In this section, we have demonstrated a specific method, called PSM-level static scheduling, for applying the PSM modeling approach. There are many possible ways of applying PSMs in the design process, and the method presented in this section can be viewed as a specific way that we have studied and experimented with to help validate the utility of the PSM model. Although the PSM-level static schedules experimented with in this section were constructed by hand, their foundation in the PSM and CFDF formalisms makes them amenable to derivation through general, automated techniques. Development of such automated tool support for PSM-level static scheduling and other applications of PSMs is a useful direction for further investigation.

5 PSM-level Actor Mapping on Heterogeneous Platforms

5.1 Overview

In this section, we demonstrate the application of PSMs to mapping actors in a CFDF-based dataflow program onto a heterogeneous platform. The targeted platform here consists of a general purpose CPU (called “host”), and a graphics processing unit (GPU) that is used to accelerate selected actors. The GPU is controlled by the host, and has a separate memory address space.

The execution of an actor in this environment on the GPU device generally involves three steps: host-to-device data transfer, on-device execution, and device-to-host data transfer. The data transfers between processors can result in significant overhead, which makes it unfavorable in some scenarios, such as when the amount of data to be processed is relatively small. Thus, the selection of actors to execute on the GPU (processor assignment) is an important problem for performance optimization.

We first formulate a general version of the processor assignment problem that is addressed in this section, and we describe our PSM-level processor selection approach in this general context. Then we present experimental results for PSM-level processor selection on the specific CPU-GPU heterogeneous platform described above.

5.2 PSM-level Processor Selection

Suppose that we have a CFDF graph G=(V,E), and a target platform consisting of a (possibly heterogeneous) processor set P={p 1,p 2,…,p n }. Also, for an actor A in G, let M A denote the set of CFDF modes of A. The objective of PSM-level processor selection is to derive a set of PSMs and a “top-level” quasi-static schedule with the goal of optimizing a pre-defined performance metric. More specifically, PSM-level processor selection involves the following tasks:

  • for each actor A, derivation of a set of n PSMs s e l e c t i o n(A)=ν(A,1),ν(A,2),…ν(A,n), where each ν(A,i) represents the subset of modes in M A that are to be assigned (during graph execution) to processor p i ;

  • construction of a “top-level”, quasi-static schedule that executes actors in G based on the dynamically-determined processor assignment defined by {s e l e c t i o n(A)}∣AV together with the current parameter values (actor configurations) of the actors in V.

In our development of PSM-level processor selection in the remainder of this section, our targeted performance metric is throughput. However, the proposed processor selection framework can be readily targeted to other metrics, such as latency or memory utilization or to composite metrics, such as latency-constrained throughput optimization, and memory-constrained latency optimization.

5.3 Profile-based Selection

In this section, we develop a profile-driven approach to PSM-level processor selection. We refer to this approach as profile- and PSM-based processor selection (PAPPS). In PAPPS, a three-dimensional “profile table” is used to characterize the performance of specific actor modes on specific processors. In particular, for a given mode mM A for an actor A, and a given processor pP, p r o f(A,m,p) provides an estimate of the execution time of mode m for actor A on processor p. The profile table entries for a given actor can be obtained, for example, by iteratively (e.g., through appropriate simulation scripts) executing the actor on each processor in every mode and averaging the results for each mode.

After the profile table is constructed, PSMs for each actor A are formed by grouping together modes that perform best on a specific processor with ties being broken arbitrarily. Thus, for each actor A and each i∈{1,2,…n}, we have that

$$ \nu(A, i) = \bigcup \{\{m\} \mid i = \mathit{argmin}_{j}(A, m, p_{j})\}. $$
(4)

In the PAPPS approach, ties with respect to the a r g m i n function in Eq. 4 are resolved arbitrarily (as implied earlier), although more sophisticated schemes can be envisioned that take ties or “near-ties” (multiple alternatives that have competitive performance) into account in strategic ways. Such exploration of more sophisticated PSM-based processor assignment schemes is an interesting direction for further work.

Once the PSMs are constructed based on Eq. 4, a top-level, quasi-static scheduler is used to visit actors according to some scheduling policy, and to execute each visited actor A using a target processor that is (dynamically) selected based on the currently-active PSM for A. In other words, each time an actor A is visited by the scheduler, the current mode m of A is examined to determine the active PSM (i.e., the unique ν(A,i) that contains m), and then processor p i is selected as the processor on which to execute the next firing of A.

Canonical scheduling, described in Section 4, is a general policy that can be used as the top-level scheduling policy in this context. However, in some cases, static analysis of the parameterized application structure can be applied to streamline the policy — for example, by statically fixing the order of schedule traversal in a way that eliminates or greatly reduces the need for run-time enable condition checking. We demonstrate a simple example of such static-analysis-based streamlining in Section 5.4.

5.4 OFDM Demodulation

To demonstrate the PAPPS approach, we have applied it to an OFDM demodulator and a heterogeneous CPU/GPU implementation platform, as described in Section 5.1. Orthogonal frequency division multiplexing (OFDM) is used extensively in high-speed wireless communication systems because of its spectral efficiency, robustness in terms of multi-path propagation, and high bandwidth efficiency [8]. The OFDM demodulator is one of the fundamental subsystems of LTE and WiMAX wireless communication systems.

Figure 6 illustrates a runtime-reconfigurable OFDM demodulator that is modeled as a CFDF graph. Here, actor S R C represents a data source that generates random values to simulate a sampler. In a wideband OFDM system, information is encoded on a large number of carrier frequencies, forming an OFDM symbol stream. In baseband processing, a symbol stream can be viewed in terms of consecutive vectors of length N. The symbol is usually padded with a cyclic prefix (CP) of length L to reduce inter-symbol inference (ISI) [1]. In Fig. 6, the CP is removed by actor R C P. Then, actor F F T performs a fast Fourier transform (FFT) to convert the symbol stream to the frequency domain.

Figure 6
figure 6

PSM-CFDF model of a configurable OFDM demodulator. (a) Original dataflow graph. (b) Vectorized dataflow graph.

In practical systems, further processing, such as frequency domain synchronization and channel estimation, is required to remove various channel effects. In this case study, however, we use a simpler design that directly performs symbol demapping to illustrate the PAPPS methodology. Actor D e m a p is a parameterized symbol demapper that performs M-ary QAM demodulation, with a configurable QPSK configuration (M=2 or M=4). The output bits are collected by the data sink (actor S N K).

For the targeted CPU/GPU platform described in Section 5.1, all of the actors in our OFDM demodulation system have CPU implementations, and some of the actors have GPU implementations.

Each actor A has a parameter, called the vectorization degree and denoted by β(A), which is the number of OFDM symbols to be processed in a single activation (scheduler visit) of the actor. If the actor A is understood from context, then we sometimes drop the “ (A)” and simply write β. Vectorization of signal processing dataflow graph actors, also referred to as “block processing”, is useful in optimizing throughput, which is the targeted objective in our development of PSM-level processor selection (see Section 5.2) [17]. Here we assume that the same demapping scheme can be applied to all symbols to be processed in one activation, so that SIMD processing can be applied in vectorized executions.

In addition to β, actors in this design have a parameter M, which prescribes the number of bits per symbol. For example, if M=4 and β=10, this means that the system is operating in a mode that uses QAM16 as the demapping scheme, and executes actors in blocks of 10 firings each. A third actor parameter is the OFDM symbol length, which we denote by N.

The parameter values in this example determine the mode of each actor, and the actor mode determines the production and consumption rates. Note that this is not always the case in CFDF actors, where, for example, the next mode for an actor can be different from the current mode even though there is no change in parameter settings (e.g., see [15]). However, because there is no such dynamics involved with next mode determination in this example, the actors can be mapped into corresponding parameterized synchronous dataflow (PSDF) actors [2]. The example, therefore demonstrates the applicability to PSM techniques to PSDF graphs.

Table 5 shows the valid parameter values for the actors in our OFDM demodulation system. The mode set of D e m a p is given by Eq. 3 in Section 3. Similarly, for other actors, valid combinations of parameter values lead uniquely to their mode settings. These details for the other actors are omitted here for brevity.

Table 5 Actor parameters in the OFDM demodulator system.

5.5 Application of PAPPS to the OFDM Demodulation System

The PSM-CFDF actors R C P, F F T and D e m a p are each implemented on both the CPU and GPU processors. Following the profiling approach described in Section 5.3, each actor A is profiled in every mode in its mode set M A for both the CPU and GPU implementations. The results are then used to construct the profile table prof.

In our experiments, an NVIDIA GTX680 GPU with 2GB memory and an Intel Core I7 3.4GHz CPU with 8GB memory are used for GPU implementation and CPU implementation, respectively. Figure 7 illustrates the profile table prof for the actors. The maximum latency for all vectorization degrees considered is less than 8 ms, which is tolerable in many software defined radio contexts. In the case of R C P, which removes the cyclic prefix from the received signal, the CPU implementation performs better in all settings. This is due to the small amount of computation performed in this actor compared to the large CPU-GPU memory transfer overhead. As a result, s e l e c t i o n(R C P) contains only one non-empty PSM; the PSM associated with the GPU has no modes.

Figure 7
figure 7

Actor profiles for application of PAPPS to the OFDM demodulator: (a) R C P actor; (b) F F T actor; (c) D e m a p actor in 16-QAM modes; (d) D e m a p actor in QPSK modes.

For the F F T actor, the GPU implementation always performs better than the CPU implementation in the same mode. Thus, for this actor, the PSM associated with the CPU has no modes. For the D e m a p actor in the 16-QAM modes (M=4), the GPU implementation outperforms the CPU implementation for all values of the vectorization degree β. In the QPSK modes (M=2), there is less difference in performance, and the CPU implementation generally performs better for lower β values, while the GPU implementation performs better for higher β values. The smaller computational load in the QPSK modes makes the memory transfer overhead more significant, which leads to a smaller performance gain from the GPU. In summary, the D e m a p actor has two non-empty PSMs ν(D e m a p,p 1) and ν(D e m a p,p 2).

Table 6 shows the grouping of actor modes into PSMs when applying the PAPPS method based on the achieved profiling results illustrated in Fig. 7.

Table 6 PSM grouping based on CPU and GPU performance profiles for processor selection. P S M 1 and P S M 2 are the sets of modes that have shorter execution times for CPU- and GPU-based execution, respectively.

We have implemented the OFDM demodulator system on the targeted CPU/GPU platform using a PAPPS-based processor selection scheme based on the PSMs illustrated in Table 6. We streamlined the top-level scheduler (see Section 5.3) by observing that even though the production and consumption rates of actors can vary based on the active actor modes, the variations in this application are interdependent such that the dataflow graph exhibits SDF behavior, and furthermore, the repetitions vector remains constant. In particular, the repetitions vector is specified by q(A)=1 for each actor A regardless of what actor modes are active. This allows us to implement the top-level scheduler without any run-time checks for actor enabling conditions. Note, however, that even though SDF techniques are employed, the derived scheduler should not be viewed as a form of static scheduling because the processor assignment can change dynamically.

As in the case study of Section 4, we implemented the top-level scheduler by hand. This scheduler implementation incorporates the PAPPS method for dynamic processor selection based on the PSM decompositions illustrated in Table 6. Building on the developments of this section to construct automated scheduler derivation for PAPPS-based implementation is an interesting direction for future work.

5.6 Experimental Results

We compared the application throughput of alternative implementations in terms of the execution time per (vectorized) application iteration, where an application iteration in this context corresponds to the processing required for (β×N) symbols of the enclosing OFDM system. Because we compare alternative processor selection schemes with β fixed for each comparison point, this method of throughput comparison does not favor any particular kind of scheme.

Figure 8 shows the execution time per application iteration for three types of processor selection schemes: (1) all actors are assigned to the CPU (“CPU”), (2) R C P, F F T and D e m a p, the most computationally- intensive actors, are assigned to the GPU (“GPU”), and (3) processor selection is performed dynamically using our implemented PAPPS-based scheduler (“PAPPS”). The speedups achieved by using PAPPS, compared to methods (1) and (2), are also shown in the figure. The average speedup achieved by PAPPS in this application over a CPU implementation is more than 1.5X. In the setting where the largest amount of data is present (1024-FFT and 16-QAM), the average speedup is more than 2X over all vectorization degrees. The achieved speedup is limited by the cost of data transfer between CPU and GPU memory for each actor. This data transfer overhead has been taken into account in the reported speedup values.

Figure 8
figure 8

Execution time and speedup under three types of processor selection schemes for the OFDM demodulator system: (a) 1024-pt FFT, 16-QAM; (b) 512-pt FFT, 16-QAM; (c) 1024-pt FFT, QPSK; (d) 512-pt FFT, QPSK. Solid lines represent execution times while dashed lines represent the speedup obtained by using the PAPPS approach. The brown dashed line with an “up-triangle” represents the speedup of PAPPS over CPU (scheme (1)); the black dashed line represents the speedup of PAPPS over GPU (scheme (2)).

Compared to the GPU implementation scheme (scheme (1)), the PAPPS scheme achieves an average of 20% improvement in throughput over the GPU scheme. However, the vectorization step applied in our implementation generally results in increased latency for the system. In wireless communication applications, latency is a critical design constraint, and thus, vectorization should be applied carefully to ensure that excessive latency does not result.

In our experiments, the vectorization degree is set to be no more than 100. As shown in Fig. 8, this results in a maximum latency of 8ms, which is reached when N=1024 and β=100. This is at a tolerable level of latency for many kinds of software radio systems. For example, 8ms is only a small fraction of the typical 250ms end-to-end delay for data packets, which is described for the communication systems discussed in [6]. In cases where there are more stringent latency constraints, the vectorization degree can be bounded more tightly to trade off throughput performance for decreased latency.

The experiments presented in this section along with the other examples discussed in this paper are provided to give a concrete idea of the kind of approaches that are supported by the PSM framework. These can be viewed as representative examples that help to give a sense of the diverse possibilities for applying the proposed methods. Further study into applying these methods and developing design optimizations that build on them is a useful direction for future investigation.

6 Conclusions

In this paper, we have introduced a new dataflow modeling technique called parameterized set of modes (PSM) and demonstrated its relevance and application to design and implementation of signal processing systems for cognitive radio applications. PSMs enable novel ways for representing, manipulating and applying related groups of actor modes that lead to more concise formulations of actor behavior, and a unified modeling methodology for applying a variety of techniques for efficient implementation. To demonstrate the utility and versatility of PSMs in signal processing system design processes, we have developed two case studies involving mapping of important kinds of reconfigurable wireless communication subsystems into efficient implementations. The PSM methods introduced in this paper allow implementation techniques like those introduced in the case studies to be developed according to a common modeling framework, which allows such techniques to be better understood, integrated, and optimized. Several useful directions for future work have also emerged from the developments of this paper, including the investigation of automated techniques for applying PSMs to efficient static region derivation and to processor selection on heterogeneous platforms.