1 Introduction

While parallel programming is vital to effectively utilise the computing power, it is challenging as it requires manual analysis of an algorithm to identify potential parallel computations. Parallel program development often uses algorithmic skeletons to hide the complexity of implementing the parallelism from the developer. In this work, we are particularly interested in data parallel skeletons, such as map, map-reduce and accumulate [1]. While skeleton-based programming eases the task of the developer, it still requires manual analysis of a program to identify computations that can be defined using the skeletons.

In the case of data parallel skeletons, a given operation is applied in parallel on each element in a given input. Consequently, there is a match between the structure of the input and the recursive structure of the skeleton definition. However, the structure of the data types and the algorithmic structure of a given program may not match with those of the skeletons, which is challenging to resolve [2, 3]. Further, programs that are defined using skeletons often introduce inefficient intermediate data structures [4]. Consider the programs shown in Examples 1 and 2 which compute the product of two matrices and the dot-product of two binary trees, respectively. In Example 1, (transpose yss) and (zipWith \((*)\) xs ys) are intermediate data structures that are subsequently decomposed by map and foldr, respectively.

Example 1

(Matrix multiplication—original program)

figure a

Example 2

(Dot-product of binary trees—original program)

figure b

Even though these programs can be parallelised using the skeletons map, reduce and zipWith (defined over lists and binary trees with parallel implementations), Example 3 retails the original intermediate data structures from Example 1, while Example 4 introduces a new intermediate data structure (zipWith \((*)\) xt yt) due to the used of skeletons.

Example 3

(Matrix multiplication—hand-parallel program using skeletons)

figure c

Example 4

(Dot-product of binary trees—hand-parallel program using skeletons)

figure d

The dot-product of binary trees program can also be parallelised by evaluating the two recursive calls \((dotP\ xt_1\ yt_1)\) and \((dotP\ xt_2\ yt_2)\) simultaneously. An example of such a parallel definition of the dot-product program using the Glasgow Parallel Haskell (GpH) is shown in Example 5 and this avoids the use of intermediate data structures.

Example 5

(Dot-product of binary trees—hand-parallel program using GpH)

figure e

However, such parallelisation requires manual analysis of a given program which is tedious for non-trivial programs. Also, note that there is no obvious skeleton corresponding to the definition of this program. Therefore, we observe that it is challenging to obtain a parallel version for a given program that uses fewer intermediate data structures and is defined using skeletons.

In this paper, we propose a transformation method with the following aspects to resolve these issues:

  1. 1.

    Reduce the number of intermediate data structures in a given program using an existing transformation called distillation [5].

  2. 2.

    Automatically create a new data type for the program so that the structure of the new data type matches the algorithmic structure of the program.

  3. 3.

    Automatically create skeletons that operate over the new data type and identify potential instances of the skeletons in the program. The transformed program can then be executed in parallel using efficient implementations for the skeletons.

In our earlier works, we presented techniques to transform programs defined over polytypic [6] and list inputs [7] individually. In this paper, we integrate the two transformations and present it along with evaluation results. In Sect. 2, we introduce the language used in this work. We present the proposed parallelisation transformation in Sect. 3. In Sect. 4, we present the results of evaluating three example programs that are transformed using our proposed method. In Sect. 5, we present concluding remarks on our transformation method and discuss related work.

2 The Language

The higher-order language used in this work is shown in Definition 1 and uses call-by-name evaluation.

Definition 1

(Language grammar)

figure f

A program can contain data type declarations of the form shown in Definition 1. Here, T is the name of the data type, which can be polymorphic, with type parameters \({\alpha }_1, \ldots , {\alpha }_M\). A data constructor \(c_k\) may have zero or more components, each of which may be a type parameter or a type application. An expression e of type T is denoted by e :  : T.

A program can also contain an expression which can be a variable, constructor application, function definition, function call, application, let-expression or \(\lambda \)-abstraction. Variables introduced in a function definition, let-expression or \(\lambda \)-abstraction are bound, while all other variables are free. The free variables in an expression e are denoted by fv(e). Each constructor has a fixed arity. In an expression \(c\ e_1 \ldots e_N\), N must be equal to the arity of the constructor c. For ease of presentation, patterns in function definition headers are grouped into two – \(p^k_1 \ldots p^k_M\) are inputs that are pattern-matched, and \(x^k_{(M+1)} \ldots x^k_N\) are inputs that are not pattern-matched. The series of patterns \(p^k_1 \ldots p^k_M\) in a function definition must be non-overlapping and exhaustive. We use [] and ( : ) as shorthand notations for the Nil and Cons constructors of a cons-list.

Definition 2

(Context) A context E is an expression with holes in place of sub-expressions. \(E[e_1, \ldots , e_N]\) is the expression obtained by filling holes in context E with the expressions \(e_1, \ldots , e_N\).

3 Parallelisation Transformation

Given a program defined in the language presented in Definition 1, our proposed parallelisation transformation is performed in three stages:

  1. 1.

    Apply the distillation transformation on the given program to reduce the number of intermediate data structures used. This produces a distilled program. (Sect. 3.1)

  2. 2.

    Apply the encoding transformation on the distilled program to combine the inputs into a data structure that matches the algorithmic structure of the distilled program. This produces an encoded program. (Sect. 3.2)

  3. 3.

    Identify potential skeleton instances in the encoded program. This produces an encoded parallel program. (Sect. 3.4)

3.1 Distillation Transformation

Given a program in the language from Definition 1, distillation [5] is an unfold/fold-based technique that transforms the program to remove intermediate data structures and yields a distilled program that can potentially have super-linear speedups. The syntax of a distilled program \(de^{\{\}}\) is shown in Definition 3. Here, \(\rho \) is the set of variables introduced by \(\mathbf {let}\)–expressions that are created during generalisation. The bound variables of let-expressions are not decomposed by pattern-matching in a distilled program. Consequently, \(de^{\{\}}\) is an expression that has fewer intermediate data structures.

Definition 3

(Distilled form grammar)

figure g

Example 6 presents the distilled version of the matrix multiplication program from Example 1 and does not use intermediate data structures.

Example 6

(Matrix multplication—distilled program)

figure h

3.2 Encoding Transformation

A distilled program is defined over the original program data types. In order to transform these data types into a structure that reflects the structure of the distilled program, we apply the encoding transformation detailed in this section. In this transformation, we encode the pattern-matched arguments of each recursive function f in the distilled program into a single argument which is of a new data type \(T_f\) and whose structure reflects the algorithmic structure of function f.

Consider a recursive function f, with arguments \(x_1, \ldots , x_M, x_{(M+1)}, \ldots , x_N\), of the form shown in Definition 4 in a distilled program. Here, a function body \(e_k\) corresponding to function header \(f\ p^k_1 \ldots p^k_M\ x^k_{(M+1)} \ldots x^k_N\) in the definition of f may contain one or more recursive calls to function f.

Definition 4

(Recursive function in distilled program)

figure i

3.2.1 Encoding Into New Type

The three steps to encode the pattern-matched arguments of function f into a new type are as follows:

  1. 1.

    Declare a new data type for the encoded argument:

    First, we declare a new data type \(T_f\) for the new encoded argument. This new data type corresponds to the data types of the original pattern-matched arguments of function f. The definition of the new data type \(T_f\) is shown in Definition 5.

Definition 5

(New encoded data type \(T_f\))

figure j

Here, a new constructor \(c_k\) of the type \(T_f\) is created for each set \(p^k_1 \ldots p^k_M\) of the pattern-matched arguments \(x_1 \ldots x_M\) of function f that are encoded. As stated above, our objective is to encode the arguments of function f into a new type whose structure reflects the recursive structure of f. To achieve this, the components bound by constructor \(c_k\) correspond to the variables in \(p^k_1 \ldots p^k_M\) that occur in the context \(E_k\) and the encoded arguments of the recursive calls to function f.

  1. 2.

    Define a function to build encoded argument:

    Given a function f of the form shown in Definition 4, we define a function \(encode_f\), as shown in Definition 6, to build the encoded argument for function f.

Definition 6

(Definition of function encode \(_f\))

figure k

Here, the original arguments \(x_1 \ldots x_M\) of function f are pattern-matched and consumed by encode \(_f\) in the same way as in the definition of f. For each pattern \(p^k_1 \ldots p^k_M\) of the arguments \(x_1 \ldots x_M\), function encode \(_f\) uses the corresponding constructor \(c_k\) whose components are the variables \(z^k_1, \ldots , z^k_L\) in \(p^k_1 \ldots p^k_M\) that occur in the context \(E_k\) and the encoded arguments of the recursive calls to function f.

  1. 3.

    Transform the distilled program:

    After creating the encoded data type \(T_f\) and the encode \(_f\) function for each function f, we transform the distilled program as shown in Definition 7 by defining a function \(f'\), which operates over the encoded argument, corresponding to function f.

Definition 7

(Definition of transformed function over encoded argument)

figure l

The two steps to transform function f into function \(f'\) that operates over the encoded argument are:

  1. (a)

    In each function definition header of f, replace the original pattern-matched arguments with the corresponding pattern of their encoded data type \(T_f\). For instance, a function header \(f\ p_1 \ldots p_M\ x_{(M+1)} \ldots x_N\) is transformed to the header \(f'\ p\ x_{(M+1)} \ldots x_N\), where p is the pattern created by \(encode_f\) corresponding to the pattern-matched arguments \(p_1, \ldots , p_M\).

  2. (b)

    In each call to function f, replace the original arguments with their corresponding encoded argument. For instance, a call \(f\ x_1 \ldots x_M\ x_{(M+1)} \ldots x_N\) is transformed to the function call \(f'\ x\ x_{(M+1)} \ldots x_N\), where x is the encoded argument corresponding to the original arguments \(x_1, \ldots , x_M\).

3.2.2 Encoding Into List

From Definition 5, we observe that the new encoded data type \(T_f\) that is created is essentially an n-ary tree. If we observe that \(T_f\) is a linear data type (i.e. \(n = 1\)), then we perform the encoding transformation such that the pattern-matched inputs are encoded into a cons-list instead of into a new data type. This will allow us to potentially parallelise the encoded program using existing parallel list-based skeletons as discussed in Sect. 3.4. In this section, we present the version of our encoding transformation to combine the inputs into a list of type \([T'_f]\). Note that, if the type \(T_f\) created by Definition 5 is linear, then each function body \(e_k\) for the recursive function f will contain at most one recursive call. The three steps to encode the pattern-matched inputs of function f are as follows:

  1. 1.

    Declare a new data type for elements of the encoded list:

    First, we declare a new data type \(T'_f\) for the elements of the encoded list. This new data type corresponds to the data types of the original pattern-matched arguments of function f. The definition of the new data type \(T'_f\) is shown in Definition 8.

Definition 8

(New encoded data type \(T'_f\) for list)

figure m

Here, a new constructor \(c_k\) of the type \(T'_f\) is created for each set \(p^k_1 \ldots p^k_M\) of the pattern-matched arguments \(x_1 \ldots x_M\) of function f that are encoded. As stated above, our objective is to encode the arguments of function f into a list where each element contains pattern-matched variables consumed in an iteration of f. To achieve this, the variables bound by constructor \(c_k\) correspond to the variables \(z_1, \ldots , z_L\) in \(p^k_1 \ldots p^k_M\) that occur in the context \(E_k\) (if \(e_k\) contains a recursive call to f) or the expression \(e_k\) (otherwise). Consequently, the type components of constructor \(c_k\) are the data types of the variables \(z_1, \ldots , z_L\).

  1. 2.

    Define a function encode \(_f\):

    Given a recursive function f, we define a function \(encode_f\) as shown in Definition 9 to build the encoded list in which each element is of type \(T'_f\).

Definition 9

(Definition of function \(encode_f\) for list)

figure n

Here, for each pattern \(p^k_1 \ldots p^k_M\) of the pattern-matched arguments, the encode \(_f\) function creates a list element. This element is composed of a new constructor \(c_k\) of type \(T'_f\) that binds \(z^k_1, \ldots , z^k_L\), which are the variables in \(p^k_1 \ldots p^k_M\) that occur in the context \(E_k\) (if \(e_k\) contains a recursive call to f) or the expression \(e_k\) (otherwise). The encoded input of the recursive call \(f\ x^k_1 \ldots x^k_M\ x^k_{(M+1)} \ldots x^k_N\) is then computed by \(encode_f\ x^k_1 \ldots x^k_M\) and appended to the element to build the complete encoded list for function f.

  1. 3.

    Transform the distilled program:

    After creating the data type \(T'_f\) for the encoded list and the encode \(_f\) function for each function f, we transform the distilled program as shown in Definition 10 by defining a recursive function \(f'\), which operates over the encoded list, corresponding to function f.

Definition 10

(Definition of transformed function over encoded list)

figure o

The two steps to transform function f into function \(f'\) that operates over the encoded list are:

  • In each function definition header of f, replace the pattern-matched arguments with a pattern to decompose the encoded list, such that the first element in the encoded list is matched with the corresponding pattern of the encoded type \(T'_f\).

  • In each call to function f, replace the pattern-matched arguments with their encoding.

3.3 Steps to Apply Encoding Transformation

In summary, the steps to apply the two versions of our encoding transformation on a given program are as follows:

  1. 1.

    Apply the distillation transformation on a given program. This produces a distilled program.

  2. 2.

    For each recursive function f in the distilled program,

    1. (a)

      Create an encoded data type \(T_f\) as described in Definition 5. This new data type \(T_f\) is essentially an n-ary tree.

    2. (b)

      IF \(n > 1\)

      THEN Apply the encoding transformation in Sect. 3.2.1 to combine the inputs into a new data type \(T_f\) whose structure matches that of function f.

      ELSE Apply the encoding transformation in Sect. 3.2.2 to combine the inputs into a cons-list of type \([T'_f]\).

    This produces an encoded program.

For the distilled matrix multiplication program in Example 6, we observe that the encoded data types \(T_{mMul_1}\), \(T_{mMul_2}\) and \(T_{mMul_3}\) for recursive functions \(mMul_1\), \(mMul_2\) and \(mMul_3\) are all linear (i.e. \(n = 1\)) as shown below.

figure p

Therefore, we apply the encoding transformation presented in Sect. 3.2.2 to obtain an encoded matrix multplication program that operates over encoded lists of types \([T'_{mMul_1}]\), \([T'_{mMul_2}]\) and \([T'_{mMul_3}]\) as shown in Example 7.

Example 7

(Matrix multplication—encoded program)

figure q

3.4 Parallelisation Using Skeletons

In our work, we are interested in identifying parallel computations that can be modelled using map- and reduce-based skeletons as they are versatile and widely applicable in parallel programming. In this section, we discuss the definitions and parallelisation of list-based and polytypic skeletons that are most relevant to the examples used in this paper.

Using the sequential definitions of these skeletons, we automatically identify their instances in an encoded program and replace them with suitable calls to their parallel counterparts. The rules to achieve this are available in our earlier related work [7], which also discusses in detail the properties of the transformation to encode inputs into a list.

3.4.1 List-Based Skeletons

In this paper, the sequential definitions of list-based map and map-reduce skeletons whose instances we identify in an encoded program are as follows:

figure r

We identify instances of these definitions and replace them with suitable calls to their parallel counterparts in the Eden library [8] that provides implementations of parallel map and map-reduce skeletons in the following forms:

figure s

3.4.2 Polytypic Skeletons

For the discussions in this paper, we present polytypic reduce and map-reduce skeletons in Examples 8 and 9, respectively, defined over a data type \(T_f\) of the form shown in Definition 5.

Example 8

(Reduce skeleton for encoded data type \(T_f\))

figure t

Example 9

(Map-reduce skeleton for encoded data type \(T_f\))

figure u

A simplistic approach to parallelise these polytypic \(reduce_f\) and \(mapReduce_f\) definitions is by evaluating each recursive call simultaneously as shown in Examples 10 and 11, respectively.

Example 10

(Parallel reduce skeleton for type \(T_f\) using GpH)

figure v

Example 11

(Parallel map-reduce skeleton for type \(T_f\) using GpH)

figure w

Here, t is a threshold value to control the number of parallel threads created by the polytypic skeletons using the rpar construct. For an n-ary encoded data type \(T_f\) (where \(n > 1\)), the initial value of t can be determined using the following simple rule-of-thumb where P is the number of processor cores:

Based on our transformation steps presented in Sect. 3.3, if \(n = 1\), then the inputs will be encoded into a list to identify list-based parallel skeletons.

4 Evaluation

In this paper, we present the evaluation of three programs—matrix multiplication, dot-product of binary trees and maximum prefix sum—to illustrate interesting aspects of our transformation. The programs are evaluated on a 12-core Intel Xeon E5 processor each clocked at 2.7 GHz and 64 GB of main memory. GHC version 7.10.2 is used for the sequential versions of the benchmark programs, the Eden language compiler based on GHC 7.8.2 is used for list-based map and map-reduce skeletons, and the polytypic skeletons are parallelised using the Glasgow Parallel Haskell extension for GHC 7.10.2.

For all parallel versions of a benchmark program, only those skeletons that are present in the top level are executed using their parallel implementations and nested skeletons are executed using their sequential versions. The reason for this approach is to avoid uncontrolled creation of too many threads which we observe to result in inefficient parallel execution where the cost of thread creation and management is greater than the cost of parallel execution.

4.1 Example: Matrix Multiplication

The original sequential, hand-parallel, distilled and encoded versions of the matrix multiplication program were presented in Examples 13,  6 and 7, respectively. Following this, we parallelise the encoded program using the list-based map and mapRedr skeletons presented in Sect. 3.4.1 and obtain the encoded parallel program shown in Example 12. Here, we observe that the encoded \(mMul'_1\) and \(mMul'_3\) functions are instances of the map and mapRedr skeletons and hence are defined using the farmB and parMapRedr1 skeletons from the Eden library. The farmB skeleton requires the degree of parallelism to be specified which is usually given as the number of processors (noPe).

Example 12

(Matrix multplication—encoded parallel program)

figure x

Figure 1 presents the speedups achieved by the encoded parallel program in comparison with the original, distilled and hand-parallel versions. An input of size NxM denotes the multiplication of matrices of sizes NxM and MxN. When compared to the original program, we observe that the encoded parallel version achieves a positive speedup for all input sizes. For input size 100x1000, the speedup achieved is 6x-25x more than the speedups achieved for the other input sizes. This is due to the intermediate data structure transpose yss, which is of the order of 1000 elements for input size 100x1000 and of the order of 100 elements for the other inputs, that is absent in the encoded parallel program. This can be verified from the comparison with the distilled version, which is also free of intermediate data structures. Hence, the encoded parallel program has a similar speedup for all input sizes when compared to the distilled version.

Fig. 1
figure 1

Speedup of matrix multiplication

We also observe that even though the hand-parallel and encoded parallel versions parallelise the equivalent computations in the same fashion as can be seen from Examples 3 and 12, the encoded parallel version is marginally faster than for input sizes 100x100 and 1000x100, 4x-4.5x faster for input size 250x250 and 15x-32x faster for input size 100x1000. This is due to the use of intermediate data structures in the hand-parallel version, which has been removed in the encoded parallel version. Further, the hand-parallel version scales better with a higher number of cores the input size 100x1000. This is because the encoded parallel version achieves better speedup even with fewer cores due to the elimination of intermediate data structures, and hence does not scale as impressively as the hand-parallel version.

4.2 Example: Dot-Product of Binary Trees

The original program to compute the dot-product of binary trees was presented in Example 2. The distilled version of this program remains the same as there are no intermediate data structures. A hand-parallel version of this program using GpH was shown in Example 5. By following the steps in Sect. 3.3, we observe that the encoded data type \(T_{dotP}\) created for function dotP is binary (i.e. \(n = 2\)). Hence, we transform the dot product program to obtain an encoded program that operates over the encoded data type \(T_{dotP}\). Following this, we can obtain an encoded parallel program using the \(reduce_{dotP}\) skeleton defined over the encoded type \(T_{dotP}\) as shown in Example 13.

Example 13

(Dot product of binary trees—encoded parallel program)

figure y

Figure 2 presents the speedups of the encoded parallel program compared to the original and the hand-parallel programs. An input size indicated by N denotes the dot-product of two identical balanced balanced binary trees with N nodes each. When compared to the original program we observe that the encoded parallel version achieves a positive speedup of 1.5x-2.6x for all input sizes. The speedup achieved for the different input sizes scales equally for varying numbers of cores. When compared to the hand-parallel version, we observe that both programs achieve similar speedups for all input sizes. This is primarily because both versions essentially parallelise the dot-product computation in a similar fashion as can be observed from their definitions.

Fig. 2
figure 2

Speedup of dot-product of binary trees

Further, we also note from our experiments that if the input trees are not well-balanced, then speedups achieved by both the encoded parallel and hand-parallel versions are reduced. This is because, from the definitions of these parallel versions, it is evident that their parallelisation and workloads of the parallel threads are dictated by the structure of the input tree. As discussed in other existing work [9], better implementations for n-ary tree skeletons can be obtained using tree-contraction [10, 11] approaches that can efficiently operate over unbalanced trees.

4.3 Example: Maximum Prefix Sum

The original program to compute the maximum prefix sum of a given list is presented in Example 14, where \(mps_1\) computes the maximum prefix sum of list xs using the recursive function \(mps_2\). Here, we consider max to be a built-in operator.

Example 14

(Maximum prefix sum—original program)

figure z

By applying the parallelisation transformation, we encode the input xs of the recursive function \(mps_2\) into a list of type \([T'_{mps_2}]\), and the resulting encoded parallel program is presented in Example 15. Here, \(mps''_2\) is found to be an instance of the list-based accumulate skeleton [1] and is defined using a call to its parallel implementation (parAccumulate).

Example 15

(Maximum prefix sum—encoded parallel program)

figure aa

Example 16 presents the hand-parallel version of the maximum prefix sum program where the recursive function \(mps_2\) is identified as an instance of the accumulate skeleton and is defined using a call to its parallel implementation.

Example 16

(Maximum prefix sum—hand-parallelised program (HPP))

figure ab

Figure 3 presents the speedups of the encoded parallel program compared to the original and hand-parallel program. An input size indicated by N denotes the computation of the maximum prefix sum for a list of size N. In comparison to the original program, the encoded parallel version achieves a positive speedup ranging between 2.2x-4.5x. The speedup achieved also increases, though marginally, for increasingly large input sizes, which shows that the encoded parallel version scales well for different input sizes. However, the overall speedup gained does not scale linearly with the number of cores used due to the significant cost of encoding which is a sequential computation.

Fig. 3
figure 3

Speedup of maximum prefix sum

Further, the encoded parallel and hand-parallel versions parallelise the maximum prefix sum program in the same way using the accumulate skeleton, with the hand-parallel version defined over the original input list and the encoded parallel version defined over the encoded list. We observe that the hand-parallel program is marginally faster than the encoded parallel program for all input sizes due to the added cost of encoding inputs in the encoded program. This is an example where encoding inputs may be an overhead when the original input structures already match the algorithmic structure.

5 Conclusion

5.1 Summary

The proposed transformation facilitates automatic identification of polytypic and list-based skeletons in a given program using the existing distillation transformation and a newly-proposed encoding transformation. Distillation provides efficiency by reducing the number of intermediate data structures used. Thus, the speedup from distillation is directly proportional to the sizes of intermediate data structures eliminated. The encoding transformation targets and benefits programs defined over inputs whose structures does not match that of the algorithm of the program. For instance, the speedups for matrix multiplication are impressive and the ones for dot product of binary trees and maximum prefix sum are moderate. This is because the structures of the inputs of the latter programs match the algorithmic structures of the programs, while it is not the case for the matrix multiplication example used here.

Thus, the efficiency achieved for matrix multiplication is partly due to the mismatch in the list-of-lists representation of a matrix compared to the algorithmic structure of the program. Such a representation may be considered unnatural given that there exist methods to match the data type structure with the algorithmic structure [12] and optimised representations of matrices in languages such as Haskell’s Repa and NESL. Therefore, we need to further evaluate the encoding transformation on its role in the efficiency achieved, in addition to facilitating the identification of skeleton instances.

5.2 Related Work

A seminal work to derive parallel programs using fusion and tupling methods is presented in [3], where the need to extend them to polytypic and nested data structures is highlighted. In [13], the authors propose a method to systematically derive parallel programs from sequential definitions and automatically create auxiliary functions to define associative operators required for parallel evaluation. However, this method is restricted to a first-order language and functions defined over a single recursive linear data type that has an associative decomposition operator such as . To address this, [9] presents the diffusion transformation to decompose recursive functions into functions that can be described using the accumulate skeleton. While diffusion can transform a wide range of functions and can be extended using polytypic accumulate skeleton, it is only applicable to functions with one recursive input, and the properties and unit values of skeleton operators are verified and derived manually.

Since calculational approaches to program parallelisation were based on homomorphisms and the third homomorphism theorem on lists [14], this was extended to trees in [15] by decomposing a binary tree into a list of sub-trees called zipper, and deriving parallel computations on the zipper structure. Though it addresses tree-based programs, this method requires multiple sequential versions of the program to derive the parallel version. To address polytypic parallel skeletons, [16] presents parallel map, reduce and accumulate skeletons for n-ary trees which are implicitly converted to binary trees and uses parallel skeletons for binary trees. Despite introducing polytypic parallel skeletons, this approach requires manual check and derivation of the operator associativity and unit values, respectively. As an alternative, [17] proposes an analytical method to transform general recursive functions into a composition of polytypic data parallel skeletons. Even though this method is applicable to a wider range of problems, the transformed programs are defined by composing skeletons and employ multiple intermediate data structures.