Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Reachability analysis is a standard technique for safety verification, analysis and synthesis of continuous and hybrid systems. Since exact computation of reachable states is in general intractable, set-based conservative computation methods has been proposed in the past with different choice of sets [1, 2, 79, 12, 22]. The reachable states are represented as a collection of continuous sets (\(\subset \mathbb {R}^n\)) with a symbolic representation of each individual set. Precision and scalability has been the two challenges with such set-based methods. The symbolic set representation plays a key role in deciding the efficiency of the reachability algorithm. Recently, algorithms using convex sets represented by support functions [12] and zonotopes [9] have shown promising scalability. Systems having dimension as large as 100 have been shown to be computed efficiently with support-function-based algorithms.

The advent of multi-core architectures and many-core parallel co-processors like graphics processing units (GPUs) have provided tremendous computing power at our disposal. In this work, our goal is to leverage these powerful parallel architectures to speed-up the performance of reachability analysis and also possibly the precision of analysis. There has been prior work on devising parallel algorithms for discrete state concurrent systems in order to speed-up their performance on multi-core machines. Parallel state-space-search algorithms in the model-checker SPIN has been proposed in [6, 13]. A GP-GPU implementation of the algorithms in SPIN has been proposed in [3]. However, no prior work is known to us on parallel state-space exploration of continuous and hybrid systems except for [17] which presents some preliminary results.

In particular, we consider the support-function-based reachability algorithm and propose two parallel versions of it. The first, samples the support functions along the template directions in parallel. This algorithm could be applied to any system with linear dynamics and nondeterministic inputs (\(\dot{x} = Ax(t) + u(t),\ u(t) \in \mathcal {U},\ x(0) \in \mathcal {X} _0\)). The second computes the reachable sets in slices of the time horizon. This algorithm can be applied to the class of linear systems whose dynamics A is invertible and the input set \(\mathcal {U} \) is a point set. We also propose a GP-GPU implementation of the algorithm by following a lazy evaluation strategy. Our current GP-GPU implementation restricts \(\mathcal {X} _0\) and \(\mathcal {U} \) to be specified as hyperbox.

The organization of the paper is as follows. In Sect. 2, we present preliminaries on support functions and a reachability analysis algorithm using support functions. In Sect. 3, a parallel implementation scheme of the algorithm and our parallel state-space exploration algorithm is presented. We present our GPU implementation scheme in Sect. 4 to sample support functions in parallel in GPU cores. The experimental results are presented in Sect. 5 illustrating the achieved speed-up and precision. We conclude in Sect. 6.

2 Preliminaries

Since our work is focused on the support-function representation of compact convex sets, we recap the definition of support functions and template polytopes in Sect. 2.1. The reachability algorithm using support functions is discussed in Sect. 2.2.

2.1 Support Functions

Definition 1

[18] Given a nonempty compact convex set \(\mathcal {X} \subset \mathbb {R}^n\) the support function of \(\mathcal {X} \) is a function \(sup_{\mathcal {X}}:\mathbb {R}^n \rightarrow \mathbb {R}\) defined as:

$$\begin{aligned} sup_{\mathcal {X}}(\ell ) = max \{\ell \cdot x \mid x \in \mathcal {X} \} \end{aligned}$$
(1)

where \(\ell \cdot x\) is the scalar product of direction \(\ell \) and vector x, that is, the projection of x on direction \(\ell \). A compact convex set \(\mathcal {X} \) is uniquely determined by the intersection of the halfspaces generated by support functions in all possible directions \(\ell \in \mathbb {R}^n\).

$$\begin{aligned} \mathcal {X} = \bigcap _{\ell \in \mathbb {R}^n} {\ell \cdot x \le sup_{\mathcal {X}}(\ell )} \end{aligned}$$
(2)

Definition 2

Given the support function \(sup_{\mathcal {X}}\) of a compact convex set \(\mathcal {X} \) and a finite set of template directions \(\mathcal {D} \), the template polytope of the convex set \(\mathcal {X} \) is defined as:

$$\begin{aligned} Poly_\mathcal {D} (\mathcal {X}) = \bigcap _{\ell \in \mathcal {D}}{\ell \cdot x \le sup_{\mathcal {X}}(\ell )} \end{aligned}$$
(3)

Proposition 1

Given a polytope \(\mathcal {X} = \{ x \in \mathbb {R}^n \mid P \cdot x \le Q \}\), the support function of \(\mathcal {X} \) in the direction \(\ell \) is the solution to the Linear Program (LP):

$$\begin{aligned} sup_\mathcal {X} (\ell ) = {\left\{ \begin{array}{ll} &{} \textit{maximize } \ell \cdot x \\ &{} \textit{subject to: } \\ &{} P \cdot x \le Q \\ \end{array}\right. } \end{aligned}$$

Proposition 2

Given a hyperbox \(\mathcal {H} = \{ x \in \mathbb {R}^n \mid x \in [a_1,b_1] \times \ldots \times [a_n,b_n] \}\), the support function of \(\mathcal {H} \) in the direction \(\ell =(\ell _1,\ell _2,\ldots ,\ell _n)\) is given by:

$$\begin{aligned} sup_\mathcal {H} (\ell ) = \sum _{i=1}^{n} \ell _i \cdot h_i\text {, where } h_i = {\left\{ \begin{array}{ll} a_i \textit{ if } \ell _i < 0 \\ b_i \textit{ otherwise} \end{array}\right. } \end{aligned}$$

where \(a_i\) and \(b_i\) are the lower and upper bound respectively of \(\mathcal {H} \) in the dimension i.

2.2 Reachability Analysis Using Support Functions

In this work, we consider continuous linear systems with constrained inputs and initial states. The dynamics of such systems is of the form:

$$\begin{aligned} \dot{x} = Ax(t) + u(t),\ u(t) \in \mathcal {U},\ x(0) \in \mathcal {X} _0 \end{aligned}$$
(4)

where \(\mathcal {X} _0\), \(\mathcal {U} \) is the set of initial states and the set of inputs given as compact convex sets, respectively.

We now discuss the algorithm proposed in [12] for computing reachable states using support functions. The algorithm discretizes time by a time step \(\delta \) and computes an over approximation of the reachable set in time horizon T by a set of convex sets represented by their support functions, as shown in Eq. 5.

$$\begin{aligned} Reach_{[0, T]}(\mathcal {X} _0) \subseteq \bigcup _{i=0}^{N-1}(\varOmega _i) \end{aligned}$$
(5)

The convex sets \(\varOmega _i\) are given by the following equations:

$$\begin{aligned}&\varOmega _{i+1} = \varPhi _{\delta } \varOmega _i \oplus \mathcal {W} \\ \nonumber&\varOmega _{0} = CH(X_0,\varPhi _{\delta } X_0 \oplus \mathcal {V}) \end{aligned}$$
(6)

where \(\oplus \), CH stands for minkowski sum and convex hull operation over sets respectively, \(\varPhi _{\delta } = e^{\delta A}\) and \(\mathcal {W} \), \(\mathcal {V} \) are convex sets given as follows:

$$\begin{aligned} {\begin{matrix} &{}\mathcal {V} = \delta \mathcal {U} \oplus \alpha \mathcal {B} \\ &{}\mathcal {W} = \delta \mathcal {U} \oplus \beta \mathcal {B} \end{matrix}} \end{aligned}$$
(7)

\(\alpha , \beta \) are constants depending on \(\mathcal {X} _0\), \(\mathcal {U} \), \(\delta \) and the dynamics matrix A. \(\mathcal {B} \) is a unit ball in the considered norm.

The support function representation of \(\varOmega _i\) can be seen as an abstraction of its template polyhedra representations. A concretization can be obtained by computing template polyhedra approximations of \(\varOmega _i\) along a set of directions \(\mathcal {D} \). Such concretization provides an efficient computation of intersection, plotting and other efficient operations over polytopes but at the expense of an approximation error depending on the number of template directions and the choice of directions.

The algorithm considers a set of bounding directions, say \(\mathcal {D} \), to sample the support functions of \(\varOmega _i\) and obtains a set of template polyhedra \(Poly_\mathcal {D} (\varOmega _i)\) whose union over-approximates the reachable set.The support function of \(\varOmega _i\) is computed with the following equation obtained using the properties of support functions:

$$\begin{aligned}&sup_{\varOmega _{i+1}}(\ell ) = sup_{\varOmega _{i}}(\varPhi _{\delta }^{T} \ell ) + sup_{\mathcal {W}}(\ell ) \\ \nonumber&sup_{\varOmega _0}(\ell ) = max \big (sup_{\mathcal {X} _0}(\ell ), sup_{X_0}(\varPhi _{\delta }^{T}\ell ) + sup_{\mathcal {V}}(\ell )\big ) \end{aligned}$$
(8)

Simplification of Eq. 6 yields the following relation:

$$\begin{aligned} sup_{\varOmega _i}(\ell ) = sup_{\varOmega _0}\big ((\varPhi _{\delta }^{T})^i \ell \big ) + \sum _{j=0}^{i-1}sup_{\mathcal {W}}\big ( (\varPhi _{\delta }^{T})^j \ell \big ) \end{aligned}$$
(9)

3 Parallel State-Space Exploration

In this section, we present two approaches to parallelize the support-functions-based reachability algorithms in Sects. 3.1 and 3.2.

3.1 Parallel Samplings over Template Directions

The LGG (Le Guernic Girard) scenario of the state of the art tool, SpaceEx [8] computes reachable states using a support-functions algorithm with the provision of templates in box, octagonal and p uniform directions. A box polyhedron has 2n directions (\(x_i=\pm 1,\ x_k=0,\ k\ne i\)) whereas an octagonal polyhedron has \(2n^2\) directions (\(x_i=\pm 1,\ x_j=\pm 1,\ x_k=0,\ k\ne i,\ k \ne j\)), giving a more precise approximation. The support function algorithm scales well when the number of template directions is linear in the dimension n of the system. When computing finer approximations with directions quadratic in n, we trade-off precision for scalability.

The support-functions-based algorithm is easy to parallelize by sampling the template directions in parallel [12]. However, there are implementation challenges. In this work, we propose a multi-threaded implementation with a master thread spawning worker threads for every direction in the template set \(\mathcal {D}\). The pseudocode of master thread is shown in Algorithm 1. A global support function matrix M having R rows and N columns is allocated to store the computed support functions by different worker threads, where R is the number of directions in \(\mathcal {D} \) and \(N = T/\delta \) is the number of iterations. Each worker thread \(t_{i}\) computes the support function samples along a direction d(i) in parallel. The results by thread \(t_i\) are written to the row M[i] resulting in no write contention among the threads. An entry M(ij) stores the support function of \(\varOmega _j\) in the ith direction in \(\mathcal {D}\) as shown in lines 2–3 in Algorithm 2.

The master thread waits for all the worker threads to complete. After all the worker threads have finished, we have a template polytope \(Poly_{\mathcal {D}}(\varOmega _{i})\) for every convex set \(\varOmega _{i}\) which is obtained from the support function as shown in lines 7–9 in Algorithm 1.

figure a
figure b

Sampling with Thread Safe GLPK. It can be seen that when the initial set \(\mathcal {X} _0\) in a location dynamics in Eq. 4 is given as a polytope, the convex sets \(\varOmega _i\) in Eq. 6 are also polytopes. Sampling the support function of a polytope is a linear programming problem. SpaceEx LGG scenario expects initial set \(\mathcal {X} _0\) and \(\mathcal {U} \) to be polytopes and samples their support function using the GLPK (GNU Linear Programming Kit) library [15], an open source and highly optimized linear programming solver library. In our parallel implementation we also assume initial and input sets to be polytopes and use GLPK package to solve their support functions. However, note that the GLPK library is not thread safe This is due to the fact that GLPK implementation uses thread shared data that suffers from race condition in multithreaded executions. To overcome this, we identified the thread shared data and made them thread local to ensure thread safety. The thread safe GLPK objects are used per thread to compute the support functions at different directions in parallel.

3.2 Parallel Exploration of Reachable States

In addition to the parallelization introduced in the previous section, we also propose another parallelization where threads compute reachable states in disjoint intervals of the time horizon in parallel. To bring in parallelism, our key idea is to compute the reachable states at distinct times in the time horizon and treat them as initials states for independent reachability computations. The reachable states from each of the initials states is then computed by an independent thread in parallel. For load balancing, the time horizon T is sliced into equal intervals of size \(T_p=T/N\), N being the degree of parallelism. The limitation of this approach is that the input set \(\mathcal {U} \) is assumed to be a point set and the dynamics matrix A is assumed to be invertible.

Proposition 3

Given a linear dynamics of the form \(\dot{x} = Ax(t) + u(t), u(t) \in \mathcal {U} \), if the input set \(\mathcal {U} ={v}\) is a point set and the matrix A is invertible, the set of states reachable at time \(t_i=iT_p\) is defined as:

$$\begin{aligned} \mathcal {S} (t_i) = e^{AiT_p}.\mathcal {X} _0 \oplus A^{-1}(e^{AiT_p}-I)(v) \end{aligned}$$
(10)

where I is the identity matrix.

Proof

Solving the differential equation \(\dot{x} = Ax(t) + u(t),\ u(t) \in \mathcal {U} \) gives:

$$\begin{aligned} x(t)&= e^{tA}x_0 + \int _{0}^{t} e^{(t - y)A}u(y) dy \\&= e^{tA}x_0 + \int _{0}^{t} e^{(t - y)A}v dy \\&= e^{tA}x_0 + A^{-1} (e^{At}-I)(v) \end{aligned}$$

When \( x(0) \in \mathcal {X} _0\), we apply minkowski sum to get:

$$\begin{aligned} \mathcal {X} (t) = e^{tA}\mathcal {X} _0 \oplus A^{-1} (e^{At}-I)(v) \end{aligned}$$

Substituting \(t = i T_p\):

$$\begin{aligned} \mathcal {X} (i(T_p)) = \mathcal {S} (t_i) = e^{A(i T_p)}.\mathcal {X} _0 \oplus A^{-1}(e^{A (i T_p)} - I)(v) \square \end{aligned}$$

Let \(\varPhi _1 = e^{A(i T_p)}\) and \(\varPhi _2 = A^{-1}(e^{A (i T_p)} - I)\), the support function of the \(\mathcal {S} (t_i)\) is given by:

$$\begin{aligned} sup_{\mathcal {S} (t_i)}(\ell ) = sup_{\mathcal {X} _0}(\varPhi _{1}^{T}\ell ) + sup_{v}(\varPhi _{2}^{T}\ell ) \end{aligned}$$
(11)

The reachable states in each time interval \(I_i = [iT_p,(i+1)T_p]\) starting from states \(x \in S(t_i)\) is defined as \(R(S_i)\) and can be computed sequentially using Eq. 6. Computation of \(R(S_i)\) can also be in parallel over the template directions as proposed in Sect. 3.1.

Proposition 4

An approximation of the reachable states in time horizon T can be computed by the following relation:

$$\begin{aligned} Reach_{[0,T]}(\mathcal {X} _0) \subseteq \bigcup _{i=0}^{N-1} R(\mathcal {S} _i) \end{aligned}$$
(12)

Proof

\(R(\mathcal {S} _i)\) is computed using Eq. 6 with a discretization time step \(\delta \) with \(\mathcal {S} _i\) as the initial set. Since \(S_i\) gives the exact set of states reachable at time instant \(t=iT_p\), the correctness argument shown in [12] guarantees that \(Reach_{[I_i]}(\mathcal {X} _0) \subseteq R(\mathcal {S} _i)\). Therefore, we have:

$$\begin{aligned} {\begin{matrix} Reach_{[0,T]}(\mathcal {X} _0)&= \bigcup _{i=0}^{N-1}Reach_{[I_i]}(\mathcal {X} _0) \subseteq \bigcup _{i=0}^{N-1}R(S_i) {\square } \end{matrix}} \end{aligned}$$

In Sect. 5 we show that computing the reachable states using Proposition 4 gives in some cases more precise results compared to the sequential algorithm in [11, 12]. This is because the approximation error in the computation of \(\varOmega _0\) in Eq. 6 propagates in the sequential algorithm. In our algorithm, since we compute exact reachable states at partition time points in the time horizon and recompute \(\varOmega _0^{t_i}\) using them, the propagation of the error may diminish.

4 Sampling Support Functions in GPU

It can be observed from Eqs. 8 and 9 that the support function of \(\varOmega _i\) can be computed from the support function of \(\mathcal {X} _0\), \(\mathcal {V} \) and \(\mathcal {W} \). The support function of \(\mathcal {V}, \mathcal {W} \) can be, in turn, computed from the support function of \(\mathcal {U} \) and \(\mathcal {B} \). Therefore, to compute the support functions of \(\varOmega _0,\ldots ,\varOmega _{N-1}\) along a direction \(\ell \), it suffices to compute the support function of \(\mathcal {X} _0\), \(\mathcal {U} \) and \(\mathcal {B} \) along the directions \(\ell ,\varPhi _{\delta }^{T} \ell ,(\varPhi _{\delta }^{T})^2 \ell , \ldots , (\varPhi _{\delta }^{T})^N \ell \). Unlike the support function algorithm in [12] which computes the support functions iteratively using Eq. 8, we propose to compute the support functions in a lazy fashion which involves delaying evaluation until all the directional arguments are computed. The support functions are then evaluated in parallel in lines 7–11, as shown in Algorithm 3.

figure c

Observe that we need to run the same support function evaluation routine for a convex set but on different directions in parallel. We build upon this observation and propose to compute the support functions in SIMT (Single Instruction Multiple Threads) parallel architecture wherein multiple instances of a procedure execute in parallel but on different data. The modern day GPU (Graphics Processing Unit) architectures are SIMT in nature and we therefore propose to offload the support function computations to the GPU. A brief introduction to GPU architecture and CUDA programming model is given in Sect. 4.1.

4.1 CUDA Programming Model

We now briefly present the GPU hardware architecture and the programming model used to implement our parallel algorithms. As illustrated in Fig. 1, the GPU architecture consists of a scalable array of N multithreaded Streaming Multiprocessors (SMs), made up of M Stream Processor (SP) cores. Each core is equipped with a fully pipelined integer arithmetic logic unit (ALU) and a floating point unit (FPU) that executes one integer or floating point instruction per clock cycle. In our experiments we have used the NVIDIA GeForce GTX 670 having 7 SMs and 192 SPs for each SM, for a total of 1344 SPs.

The NVIDIA vendor provides also a special Application Programming Interface (API) called Compute Unified Device Architecture (CUDA) that facilitates the developing of efficient applications tuned for NVIDIA GPUs. CUDA extends the C and the FORTRAN languages with special keywords and language primitives that are suitable to achieve a high-performance hardware-based multi-threading. We have implemented our parallel algorithms using the C extension.

Fig. 1.
figure 1

GPU architecture

The CUDA programming model consists in using thousands of lightweight threads arranged into one- to three-dimensional thread blocks. A thread executes a function called the kernel that contains the computations to be run in parallel using a GPU device. A CUDA program starts running in the Central Processing Unit (CPU) referred as the host. Whenever a kernel is launched from the host code, the execution continues then in the GPU. The max number of threads running a kernel is fixed at the launching time (this limitation has some exceptions in the modern GPU cards supporting dynamic parallelism).

Each thread is assigned to a SP and each thread block is processed by a SM. The thread execution model in CUDA is the Single Instruction Multiple Threads (SIMT). SIMT differs from the classical Single Instruction Multiple Data (SIMD) by the fact that the threads sharing the same instruction address and running synchronously are organised within a thread block into groups of 32 threads called warps. In a warp, the divergence of the threads execution in different branches due to if-then-else constructs, reduces considerably the level of parallelism and indeed degrades the performance of the kernel execution.

Threads can access different types of memory and their judicious use is key to performance. The most general is the off-chip global memory, to which all threads can read and write. Also the host can read and write the global memory and so this memory is usually used as a way of communication between the host and the GPU device. The global memory has slow performances and it is very important to access it in a coalesced way using a single memory transaction of 32, 64, or 128 bytes. The two caches L1 and L2 shown in Fig. 1 mitigates this bottleneck by storing copies of the data most frequently accessed in the global memory. Significantly faster levels of memory are available within an SM, including 32–64 KB of on-chip registers partitioned among all threads. As such, using a large number of registers within a CUDA kernel will limit the number of threads that can run concurrently. Finally, local memory is invoked when a thread runs out of available registers. In addition, each SM has a shared memory region (16–48 KB). This level of memory, which can be accessed nearly as quickly as the registers, facilitates communication between threads and can be used as a programmer-controllable memory cache.

Threads located in the same thread block can cooperate in several ways. They can insert a synchronization point into the kernel, which requires all threads in the block to reach that point before execution can continue. They can also share data during execution. In contrast, threads located in different thread blocks cannot synchronize each other and they essentially operate independently. Although a small number of threads or blocks can be used to execute a kernel, this arrangement would not fully exploit the computing potential of the GPU.

4.2 Computing Support Functions of Polytopes in GPU

As discussed in Sect. 2, when \(\mathcal {X} _0\) and \(\mathcal {U} \) are polytopes, their support function evaluation is equivalent to solving a linear program (LP). The Simplex algorithm [4, 5] is a well known and efficient procedure to solve LPs in practice. There is previous work on implementing the simplex algorithm on CUDA executing in a CPU-GPU heterogeneous system. An efficient implementation of the revised simplex method over a CPU-GPU system is shown in [21]. A multi-GPU implementation of the simplex procedure is reported in [14]. However, the reported results shows speed-up compared to sequential CPU implementation only when the size of LP is at least \(500\ \times \ 500\) (500 variables, 500 constraints). The reason why performance is poor for small size LPs is the CPU-GPU memory transfer latency to copy the simplex tableau and therefore the time gain due to parallelization is predominant over the CPU-GPU memory transfer latency only for large instances of LP. Since the benchmarks we know are of dimension much smaller than 500, we did not go for a simplex-algorithm implementation in GPU.

4.3 Computing Support Functions of Hyperbox in GPU

When the initial set \(\mathcal {X} _0\) and input set \(\mathcal {U} \) are given as hyperboxes, which are special cases of polytopes, their support function can be computed using Proposition 2 instead of solving an LP with simplex algorithm. This also avoids expensive memory transfer of simplex tableau, a data structure used in the simplex algorithm, from CPU to GPU. Building on this observation, we implemented a CUDA procedure to compute the support function of a hyperbox. There are challenges to have speed-up derived from GPU due to issues like warp divergence, memory transfer latency and GPU occupancy. We map a CUDA block to compute support function along a sampling direction. Since CUDA blocks are scheduled to SMs, this ensures that all the GPU SMs are utilized when the number of sampling directions are more than the SMs in the GPU. Our block is one dimensional containing only 32 threads (warp size) since instructions in GPU are scheduled per warp which is a collection of 32 threads. The number of threads per block is kept to 1 warp-size since the task of computing the support function of a Hyperbox is lightweight. The maximum number of support function evaluation tasks that can be performed in parallel is limited by the number of directions that can be transferred to the GPU global memory. We attempt maximum parallelism by offloading tasks in batches of maximum possible size. The pseudocode of the GPU offloading routine is shown in Algorithm 4.

figure d

5 Experiments

The parallel algorithms are implemented as part of the tool XSpeed including the CUDA implementation. To measure the performance of our parallel algorithms, we experiment on two benchmarks and compare our performance with the SpaceEx’s (LGG) scenario [8] and with an optimized implementation of the support function algorithm in [12].

5.1 Five Dimensional System

We consider a five dimensional linear continuous system as a benchmark from [10]. Since we require the inputs set \(\mathcal {U}\) to be a point set for our parallel state-space exploration algorithm, we consider \(\mathcal {U}\) = (0.01,0.01,0.01,0.01,0.01). We consider the initial set \(X_0\) as a hyperbox with sides 0.02 centered at (1,0,0,0,0). For the matrix A, the reader may refer to [10].

Figure 2 illustrates the parallel exploration with slicing the time horizon. The figure shows that a time horizon of 5 units is sliced into five intervals each of size 1 unit. Five threads compute the reachable sets in parallel starting from initial sets \(\mathcal {S} (t=0)\), \(\mathcal {S} (t=1)\), \(\mathcal {S} (t=2)\), \(\mathcal {S} (t=3)\) and \(\mathcal {S} (t=4)\).

Fig. 2.
figure 2

Illustrating parallel state-space exploration in sliced time horizon

5.2 Helicopter Controller

To measure the performance on a high dimensional system, we consider the benchmark of helicopter controller from [8, 20]. This benchmark models the controller of a Westland Lynx military helicopter with 8 continuous variables. The controller is a 20 variables LTI system and the control system has 28 variables in total. We consider the initial set \(\mathcal {X} _0\) to be a hyperbox and the input set \(\mathcal {U}\) to be the origin {0}.

Table 1 shows the performance speed-up in computing reachable states with our parallel direction samplings compared to the sequential support-function algorithm in a 4 core and 6 core machine with hyper-threading, namely Intel Core i7-4770, 3.40 GHz, 8 GB RAM and Intel Xeon CPU E5-2420, 1.2 Ghz, 46.847 GB RAM respectively. The results are an average of 10 runs for a time horizon of 5 units and a time step of \(1.7e-3\) units. A speed-up of almost 7\(\times \) is observed for the Helicopter model. The gain in CPU utilization shows how our parallel implementation exploits the power of multicore processors effectively.

Fig. 3.
figure 3

Illustrating speed-up using parallel state-space exploration in time slices over sequential algorithm and the SpaceEx LGG scenario.

Fig. 4.
figure 4

Illustrating gain in precision with parallel state-space exploration over sequential algorithm.

Figure 3 shows the speed-up obtained with the parallel state-space exploration with octagonal directions and time step of 0.0048 on Intel Core i7-4770, 4 core, 8 threads, 3.40 GHz, 8 GB RAM processor. The results from the SpaceEx tool are obtained by running the executable available at http://spaceex.imag.fr/ on the same machine, with same parameters. We show that selecting the right partition size is important to obtain optimal speed-up. Partitioning beyond a limit though give us high precision but degrades the performance as the threading overhead outruns the performance gain due to parallelism. The threshold depends on the number of cores in the underlying multi-core architecture. Figure 4 shows the gain in precision with box directions and time step of 0.01 and 0.0048 respectively for the five dimensional and the Helicopter benchmark. The gain in precision is because the time sliced algorithm computes exact reachable states at time points in the time horizon and diminishes the propagation of approximation error resulting from the computation of \(\varOmega _0\) in the support-function algorithm.

Table 1. Speed-up and utilization gain with parallel support function samplings
Table 2. Performance speed-up with samplings in GPU

Table 2 shows the performance speed-up of reachability analysis when support functions are sampled in parallel in GPU (Algorithms 3 and 4) compared to sequential support function algorithm using GLPK and SpaceEx LGG scenario. The iters in the table refers to the discretization factor of the time horizon. The experiments are performed in Intel Q9950, 2.84 Ghz, 4 Core, no hyper-threading, 8 GB RAM with GeForce GTX 670 GPU card. We observe a speed-up of 9\(\times \) to 12\(\times \) over sequential implementation. A maximum speed-up of 53\(\times \) is observed for the five dimensional and 38\(\times \) for the Helicopter model respectively compared to SpaceEx on different parameters to the support-function algorithm. Observe that the performance of our algorithms improves with the increase in the number of cores in the machine and signifies that performance of XSpeed can scale automatically with future multicore machines and GPUs with higher degree of parallelism.

6 Conclusion

We presented a parallel implementation of the support-function algorithm and a time-sliced parallel state-space exploration algorithm. A lazy strategy of evaluating support functions to bring in parallelism is illustrated and implemented in CUDA to offload the computation task in GPU. We show that the performance of reachability algorithms for linear dynamical systems can be considerably improved using the modern multi-core processors. The use of GP-GPU has shown a promising performance gain in many scientific applications and we show that they can also substantially improve the performance of reachability analysis. The parallel algorithms and the GP-GPU task offloading are implemented in the tool XSpeed.