Keywords

1 Introduction

Objectives. In this paper, we apply the concept of black box checking, as originally presented by Peled et al. [21, 22], in the context of white box module testing. Given an LTL property \(\varphi \), tests are executed for learning the true behaviour of an implementation under test (IuT) which is a software module I; this behaviour is expressed by means of an initially unknown symbolic finite state machine B. While trying to learn the true representation of B from the test cases executed so far, violations of \(\varphi \) are detected either by means of a monitor checking the reactions of I to the test case inputs, or during model checking the model increments \(B = M_1, M_2, \dots \) learnt so far. If a complete test of I against \(B = M_k\) proves the language equivalence between I and B, the verification campaign terminates: I fulfils \(\varphi \) if and only if B fulfils this property. This “proof by testing and property checking” holds under certain hypotheses about the maximal number n of distinguishable states in I, and the guard expressions and output assignments used by I. This information can be extracted from the IuT by means of static analyses. Since these analyses are fairly simple and do not require the full understanding of the programming language semantics, this approach to property verification is a suitable method for testing modules programmed using complex programming languages like C++, Java, C#, where software model checkers accepting the complete syntax do not exist.

Background: Black Box Checking. In the original work by Peled et al. [21, 22], model learning was performed using Angluin’s \(L^*\) algorithm [1]: under the assumption that I has at most n distinguishable states, the black box B can be reconstructed incrementally by executing finitely many tests against I.

Some tests serve to elaborate a new hypothesis about B (say, \(B = M_i\)), other tests serve to verify or falsify that I is language-equivalent to the current version of B. For the latter task, the W-Method [9, 26] was used in [21, 22]. This is a complete testing method in the sense that, under the hypothesis that I has at most n states, I passes the tests generated by the W-Method if and only if it is language-equivalent to \(M_i\). Failed test cases can be used by the \(L^*\)-algorithm to modify and extend \(M_i\), in order to create a refined model version \(M_{i+1}\).

Using model checking, each new version of B is verified against \(\varphi \). To this end, the product of B and a Büchi-automaton P accepting \(\lnot \varphi \) is constructed. If the language of product automaton \(B\times P\) is non-empty, this indicates the existence of a counterexample, that is, an infinite input/output sequence \(\pi \) violating \(\varphi \) [2]. For safety properties, the violation of \(\varphi \) can already be demonstrated on a finite prefix \(\pi '\) of \(\pi \) [24]. For liveness properties, omega regularity implies that the infinite counterexample \(\pi \) can be written as \(\pi _1\pi _2^\omega \) (infinitely many copies of \(\pi _2\) are appended to \(\pi _1\)), with finite input/output sequences \(\pi _1, \pi _2\) [2]. Since I is assumed to have at most n states, it accepts \(\pi =\pi _1\pi _2^\omega \) if and only if it accepts \(\pi _1\pi _2^n\), since the latter already implies the existence of a “lasso” [4] starting with \(\pi _1\) and ending in a loop endlessly repeating \(\pi _2\). Therefore, either \(\pi '\) or \(\pi _1\pi _2^n\) are run against I. If the counterexample is accepted by I, an error has been found, and the combined learning and testing process can be aborted. If I does not accept the counterexample, this information can again be used to update B via continued learning. If the latest increment \(B = M_k\) passes the check against \(\varphi \), and the complete test suite proves that I and B are language-equivalent, the black box testing campaign has proven that I satisfies \(\varphi \), under the hypothesis that I has at most n distinguishable states.

Contributions. In this paper, we refine and optimise the black box checking approach in several ways and specialise it for the purpose of white box software module testing, including tool support. For B, we admit (nondeterministic) symbolic finite state machines (SFSM) with finite state space, input and output variables over arbitrary primitive data types (including infinite types like \(\mathbb {Z}\) and \(\mathbb {R}\)) and transitions labelled by guard conditions over input variables and output expressions over output and input variables. We advocate a white-box approach which is quite realistic for software in safety-critical systems, where source code needs to be verified by independent verification teams [28]. This allows us to determine upper state bounds n and identify the guard and assignment expressions used in the code by means of static analyses. These static analyses ensure that a passed black box checking suite corresponds to an actual proof that I satisfies property \(\varphi \).

Regarding methodological contributions, the application of black box checking to software with conceptually infinite input and output types is enabled by an equivalence class partitioning method previously developed by the authors [14]. Otherwise black box checking would be infeasible, due to the large size of the alphabets involved, when using interface variables of type double, float, int directly.

Furthermore, we reduce the number of situations where tentative models \(B = M_i\) need to be checked by means of a complete testing method. In particular, our strategy allows to check tentative models later, after many distinguishable states (say, \(\ell \)) of the IuT have already been discovered. This significantly reduces the exponential term \(p^{n-\ell +1}\) influencing the size of the complete test suite, where \(n \ge \ell \) is the upper bound of potential distinguishable states in I, and p is the number of input/output equivalence classes derived from guard conditions and output expressions extracted from the code, as described below. Instead of the “classical” \(L^*\)-algorithm, we use a novel, highly effective state machine learning algorithm proposed by Vaandrager et al. [25]. For generating complete test suites, a variant of the complete H-Method [12] is used, which needs significantly fewer test cases than the W-Method in the average case [13]. We have modified the H-Method for online testing: this means that the test case generation is incremental and interleaved with the test execution, so that it is unnecessary to create a complete suite, when tests of I against the current version of B fail early. We apply the monitor concept proposed by Bauer et al. [3] for detecting safety violations on the fly, during tests intended for model learning. This reduces the need to perform complete model checking runs of \(B\times P\) against \(\varphi \). To speed up the learning process and to avoid having to create complete suites for too many intermediate increments of B, we apply coverage-guided fuzz testing [5, 17] for finding many distinguishable states of the implementation at an early stage. Again, this leads to small exponents \(n-\ell +1\) in the term \(p^{n-\ell +1}\) dominating the number of test cases to perform for a complete language equivalence test.

While these techniques for effort reduction cannot improve the worst case complexity that was already calculated by Peled et al. [21, 22], their combination significantly improves black box checking performance in the average case. We confirm this by several experiments verifying control software from the automotive domain. These experiments also show that the property testing approach described in this paper is effectively applicable for testing modules performing control tasks of realistic size and complexity. Therefore, the approach advocated here is an interesting alternative to proving code correctness by means of code-based model checkers or proof assistants. From the perspective of standards for software development in safety-critical systems [8, 16, 28], our approach even has a significant advantage in comparison to “pure” code verification, since tests are actually executed against the IuT. The standards emphasise that verification may never be based on static analyses (model checking, formal proof) alone: it is always necessary to perform dynamic tests of the integrated HW/SW system as well.

To the best of our knowledge, the approach presented here is the first to use equivalence class abstractions for enabling complete property testing of source code with large interfaces, using black box checking in combination with fuzzing.

Regarding the implementation of the approach, we present the open source library libsfsmtest for complete model-based or property-oriented module testingFootnote 1, whose latest version supports the module testing strategy described in this paper. For users only interested in the application of the library for practical testing, a cloud interfaceFootnote 2 is provided, supporting both test suite generation and module test execution.

Related Work. Meng et al. [18] confirm that fuzz testing can be effective for testing software against properties specified in LTL. However, their approach does not provide any completeness guarantees: the tool LTL-fuzzer created by the authors is to be used as an effective bug finder.

Pferscher et al. [23] also combine model learning and fuzzing, but with the objective to check whether an implementation conforms to a reference model, while our focus here is on property-oriented testing. The fuzzer is not guided by the code coverage achieved, as in our approach, but by the coverage of a reference model. Since the latter has not been validated with respect to completeness and consistency, the testing process can only reveal discrepancies between reference model and implementation, but not a correctness proof.

The model learning aspect of black box checking has received much attention since Angluin’s seminal paper [1], and a comprehensive overview about improvements and alternative approaches to automata learning is given by Vaandrager et al. [25]. We could have made use of the LearnLib library [15] for the model learning part in our Algorithm 2 (see Sect. 3). However, we would not have used the W-Method or Wp-Method implemented there for equivalence testing and finding counter examples, since our own library libfsmtest provides methods like the H-Method that requires far less test effort in the average case. Moreover, the new data structure and associated algorithms for learning that has been proposed by Vaandrager et al. [25] is not yet available in LearnLib, and it seemed particularly attractive with respect to maintainability and performance to us.

An alternative to LearnLib is AALpy by Aichernig et al. [19]. While its Python implementation seems less attractive to us, due to the better performance of C++, AALpy uses a strategy for disproving conformance between preliminary model versions and an implementation that is an interesting alternative to our current implementation: AALpy tries to avoid the generation of unnecessary complete conformance test suites by combining random testing with the W-Method, expecting to find early discrepancies between a preliminary version of the model and the implementation by means of random testing. In our approach, we prefer to focus the application of random testing in an initial phase using coverage guided fuzzing with the objective to find an initial candidate for machine B with as many states as possible. After that, we relay on conformance tests without randomisation, but create the cases of the H-Method incrementally, which also avoids the creation of a full conformance test suite as long as B and I do not conform.

Waga [27] presents a black box checking approach that is complementary to ours in several ways. (1) The main objective is bug finding for cyber-physical systems, while we focus on complete property checks for software modules. (2) Waga applies signal temporal logic, while we apply LTL. (3) Waga does not use any means of abstractions comparable to the equivalence class abstractions we consider to be crucial for complete property checking. Summarising, Waga’s approach performs well for the purpose of bug finding on system level, while the method advocated here provides complete checks on module level.

Overview. In Sect. 2, we summarise the foundations required for the combined testing and black box checking approach described in this paper. In Sect. 3, the methodological main result is presented. In Sect. 4, a short summary of the available tool support is given. In Sect. 5, the application of our approach with this tool platform is described, and performance data is presented. Section 6 contains a conclusion.

2 Theoretical Foundations

2.1 Black Box Checking

The strategy for combined learning, model checking, and testing proposed by Peled et al. [22] is shown in Algorithm 1, with some adaptations for the notation used in this paper. The strategy uses two sub-functions for learning and testing: (1) As the first sub-function, Angluin’s \(L^*\)-algorithm [1] is invoked (lines 6, 25) for learning the internal structure of the black box B representing the true behaviour of implementation I. The \(L^*\)-algorithm is called in Algorithm 1 with three parameters \((I,M_i,\pi )\): I is the implementation, and the \(L^*\)-Algorithm may execute additional tests against I, in order to produce a new model. Parameter \(M_i\) specifies the latest assumption for the representation of B, and \(\pi \) is a word representing a counterexample that is either accepted by \(M_i\), but not by B, or vice versa. Based on this information, \(L^*\) returns a more refined model \(M_{i+1}\).

(2) As the second sub-function, the W-Method [9, 26] \(\text {VC}(I,M_i,\ell , k)\) is used as a conformance test that is able to prove or disprove the language equivalence between \(M_i\) and I, under the hypothesis that I has no more than k distinguishable states. The algorithm is called with the implementation I to be used in the test, the currently learnt, minimised model \(M_i\) that may or may not be equivalent to I, the number \(\ell \) of distinguishable states in \(M_i\), and the currently assumed upper bound \(k\le n\) of distinguishable states in I. Note that the worst case estimate for the number of test steps to be executed for such a conformance test is \(O(\ell ^3p^{n-\ell +1})\) [9].

Initially, the \(L^*\)-algorithm is set up with the empty machine (line 6). Then the implementation is tested until (a) either the learnt model B satisfies \(\varphi \) and has been shown to be language-equivalent to I by means of complete tests, under the hypothesis that I has at most n states (line 18), or (b) an approximation \(M_i\) of B has been learnt that violates \(\varphi \) on an infinite word \(\pi _1\pi _2^\omega \), and this word is accepted by the implementation (line 22).

figure a

Once a hypothetical model \(M_i\) has been proposed by the \(L^*\)-algorithm, its product with the Büchi-automaton P accepting \(\lnot \varphi \) is constructed (line 10). If the language of this product is empty, this implies that \(M_i\) does not accept a word violating \(\varphi \). Therefore, it is checked whether \(M_i\) is language equivalent to I, under the hypothesis that I does not have more than n states (lines 11—19). This is done incrementally over \(k=\ell ,\dots ,n\), in order to avoid superfluous tests if the non-equivalence can already be detected with a smaller value \(k < n\). Therefore, the full number of \(O(\ell ^3p^{n-\ell +1})\) test steps only needs to be executed if I conforms to \(M_i\). If language equivalence between \(M_i\) and I can be established by the conformance tests, the strategy terminates with verdict ‘\(\text {pass}\)’, since I conforms to a mealy machine \(B = M_i\) that fulfils \(\varphi \).

If the language of the product \(X := M_i\times P\) is non-empty, this means that \(M_i\) accepts a word satisfying \(\lnot \varphi \). Omega regularity implies that such a word can be written as \(\pi _1\pi _2^\omega \), with finite prefix \(\pi _1\), followed by an infinite repetition of finite word segment \(\pi _2\). To test whether the implementation accepts \(\pi _1\pi _2^\omega \), it suffices to check whether it accepts \(\pi _1\pi _2^n\), since I is assumed to have at most n distinguishable states. If I accepts the finite test \(\pi _1\pi _2^n\), we know that it accepts a word violating \(\varphi \) and can stop the procedure by returning ‘\(\text {fail}\)’ (line 22). There is no further need to look for a more refined model \(B = M_{i+j}\) representing the true behaviour of I, since the implementation must be fixed anyway.

If, however, I rejects \(\pi _1\pi _2^n\), this implies that the implementation cannot be language-equivalent to the currently assumed representation \(M_i\) of B. Now we look for the shortest prefix \(\pi \) of \(\pi _1\pi _2^\omega \) that is rejected by I. This prefix is suitable as a “teacher’s response” for the \(L^*\)-algorithm, to be used to construct a more refined version \(M_{i+1}\) of the true implementation behaviour (line 25).

Peled et al. prove ([22, Theorem 3]) that if the implementation satisfies \(\varphi \), the worst-case time complexity of the strategy described above is \(O(\ell ^3p^\ell + l^3p^{n-\ell +1} + l^2mn)\), otherwise (error case), it is \(O(\ell ^3p^\ell + l^2mn)\). The higher complexity in the no-error case given by term \(l^3p^{n-\ell +1}\) in the complexity sum is derived from the fact that the equivalence tests of the implementation against the learnt model B need to execute all test steps required for the conjecture that I has at most n states. In the error case, these tests can be aborted earlier.

2.2 Equivalence Class Construction for SFSM

We summarise here previously obtained results [14] that are relevant for the present paper. A symbolic finite state machine (SFSM) M is a state machine operating on a finite set of control states and input variables and output variables from a symbol set \( V = I \cup O\). Variables are typed by some (possibly infinite) set D. A variable valuation is a function \(\sigma \in D^ V \), associating a value \(\sigma (v)\) with each variable symbol \(v\in V \). Given a quantifier-free first order expression e with free variables in \( V \), we say that \(\sigma \) is a model for e (written \(\sigma \models e\)), if and only if the formula \(e[v/\sigma (v)~|~v\in V ]\), that is created from e by exchanging every occurrence of a variable symbol \(v\in V \) by its valuation \(\sigma (v)\), evaluates to true.

Table 1. Construction method for I/O equivalence classes (from [14]).

A transition relation \(s_1 \xrightarrow {g/a} s_2\) connects certain control states \(s_1, s_2\). The transition label g/a consists of a guard expression g, that is, a quantifier-free first order expression over variables from I, and update expressions a that are first order expressions over at least one output variable and optional variables from I. The language of M is the set \(L(M)\subseteq (D^ V )^\omega \) of all infinite traces of valuations \(\sigma _1,\sigma _2,\dots \in D^ V \), such that there exists a sequence of states \(s_0,s_1,\dots \) starting in the initial state and guard and output expressions \((g_1/a_1)(g_2/a_2)\dots \), such that

$$ \forall i > 0\centerdot s_{i-1} \xrightarrow {g_i/a_i} s_i \quad \sigma _i \models g_i\wedge a_i. $$

The property testing approach described in this paper applies to all software modules whose input/output behaviour can be described by means of an SFSM. The class of real-world applications that can be modelled by SFSM is quite large, examples are airbag control modules, anti-lock braking systems (see Sect. 5) or train control systems.

An input/output equivalence class \(\text {io}\) is a set of valuations \(\sigma \in D^ V \) constructed according to the specification in Table 1. The intuition behind this specification is that the input/output equivalence classes partition the set \(D^ V \) of valuations: two members of the same class are models for exactly the same conjunction over all guard conditions, output expressions, and atomic propositions occurring in the LTL property \(\varphi \) to be verified, each conjunct occurring either in positive or negated form. No valuation can be in more than one class, since two classes differ in the sign (positive/negated) of at least one conjunct.

Two sequences \(\pi _1, \pi _2\in (D^ V )^\omega \) of valuations are equivalent if each pair of corresponding sequence elements \((\pi _1(i), \pi _2(i)),\ i = 1,2,\dots \) is contained in the same input/output equivalence class. The following properties of equivalent traces \(\pi _1, \pi _2\in (D^ V )^\omega \) are crucial in the context of this paper [14, Theorem 2]Footnote 3: (1) \(\pi _1\in L(M)\) if and only if \(\pi _2\in L(M)\). (2) \(\pi _1\) and \(\pi _2\), when contained in L(M), cover the same sequences of states in M (there is only one uniquely determined state sequence if M is deterministic). (3) \(\pi _1\models \varphi \) if and only if \(\pi _2\models \varphi \).Footnote 4

3 Optimisation of the Test Method

Based on black box checking (Algorithm 1), we propose the new white box module testing strategy specified in Algorithm 2 and incorporating several optimisations. This strategy is divided into three phases: (1) setup, (2) fuzzer-guided exploration, and (3) learning as explained below.

figure b

Phase 1: Setup. In the first phase, we exploit white box knowledge on the IuT in order to abstract from its possibly infinite input and output domains to finitely many equivalence classes. To this end, the algorithm uses the two input parameters \(\varSigma _I\) and \(\varSigma _O\), denoting the guard conditions and output expressions occurring in the IuT, respectively. Together with the atomic propositions AP occurring in the LTL property \(\varphi \) to check, these are employed in computing input/output classes \(\mathcal {A}\) (lines 10 and 11) using the techniques described in Sect. 2. These classes could then already serve as symbolic inputs. However, since multiple input/output classes may share the same input valuations, this could introduce superfluous inputs. Thus, line 12 of the algorithm attempts to minimise the number of inputs by only selecting sufficiently many input valuations \(\sigma \in \mathcal {H}\) to provide input representatives of all input/output classes. In the following, we use elements of \(\mathcal {H}\) both as symbols and as concrete input valuations.

The first phase concludes by initialising a tree T representing a prefix-closed set of symbolic traces observed in the IuT (line 13), as well as a property monitor P constructed as proposed by Bauer et al. [3] (line 14) to accept \(\lnot \varphi \). This monitor detects violations of safety properties \(\varphi \) observed during the subsequent execution of Algorithm 2. Since violations of liveness properties can only be determined on infinite traces, these are accepted, but do not lead to failure indications by the monitor.

Phase 2: Fuzzer-Guided Exploration. In the second phase, coverage-guided fuzzing is employed to quickly reach a large number of distinct states in the IuT and record observations on the behaviour of the IuT, with the aim of speeding up the subsequent learning phase. Experiments confirming the efficacy of this approach are discussed in Sect. 5.

Fuzzing is used for several iterations (lines 19—27). In each iteration, a non-empty sequence of integers \(\bar{b}\) is obtained from the fuzzer and translated into a sequence of input symbols \(\bar{x}\) by mapping each integer to an element of \(\mathcal {H}\) (lines 21 and 22). Thereafter, \(\bar{x}\) is applied to the IuT (line 24). All such invocations of the IuT in Algorithm 2 occur via calls to procedure \(\text {outputQuery}(I,T,P,\bar{x})\). These reset the IuT and P to their initial states and initialise a symbolic trace \(\gamma = \epsilon \). The following steps are then performed for each input symbol x in \(\bar{x}\) in turn: First, x is translated into the concrete input valuation \(\sigma _I\) it symbolises, which is then applied as input to the IuT. Next, the outputs \(\sigma _O\) observed in response are used to create a valuation \(\sigma _I\cup \sigma _O\) ranging over all input and output variables. This valuation belongs to exactly one input/output class \(y \in \mathcal {A}\), which is considered as the output symbol observed for input symbol x. Thereafter, x/y is appended to observation \(\gamma \), which is then added to T. Finally, P is used to check whether \(\gamma \) violates \(\varphi \), in which case Algorithm 2 returns fail.

The fuzzer-guided exploration terminates as soon as one of the following conditions is satisfied: (1) Fuzzing has been performed for \(r_{max}\) iterations, where \(r_{max}\) is another input parameter of the algorithm, and the number of iterations is tracked in variable r (lines 17 and 25), or (2) fuzzing has identified n distinct states in the IuT. A lower bound on the number of identified distinct states in the IuT is tracked in variable \(l_T\) (lines 18 and 26), which is updated after each iteration. This is realised via function \(\text {maximalPairwiseDistinguishableSubsetOf}(T)\) as follows: First, pairs of traces are identified that are distinguishable in T.Footnote 5 From these, a maximal set \(S \subseteq T\) is selected such that any pair of distinct traces in S is distinguishable.Footnote 6 Variable l is then set to |S|, as distinct traces in S must reach distinct states in IuT I and hence \(|I| \ge |S|\).

Phase 3: Learning. The third phase (lines 30—50) finally implements learning in analogy to Algorithm 1. It differs from the latter in two aspects: First, instead of Angluin’s \(L^*\) algorithm [1], learning is performed using an adaptation of the efficient \(L^\#\) algorithm proposed by Vaandrager et al. [25] (lines 30 and 47). \(L^\#\) follows the same minimally adequate teacher framework as \(L^*\) in generating hypothesis state machines and providing these to the teacher to check for equivalence with the IuT; hence it can directly replace the original calls to \(L^*\) in Algorithm 1. In contrast to line 6 of Algorithm 1 and also differing from the original description of \(L^\#\), which starts without prior knowledge, line 30 of Algorithm 2 provides initial knowledge to the learning algorithm in the form of T, the previously observed traces.

The second difference consists in the conformance testing strategy employed to check whether the current hypothesis \(M_i\) is language-equivalent to the IuT (line 38). Instead of the W-Method [9, 26], we employ the H-Method [12]. While both strategies exhibit the same worst case behaviour in terms of test steps, the H-Method has been observed in practice to require on average significantly fewer test steps [13]. We adapted the H-Method for online testing. That is, instead of computing the entire test suite and only thereafter applying it to the IuT, we interleave test case generation and application in an attempt to find failures early. This is particularly effective if the current hypothesis contains fewer than n states, as the term \(|\mathcal {H}|^{n-\ell +1}\) dominates the number of test cases to consider.

Finally, recall that all interactions with the IuT within Algorithm 2 are performed via function \(\text {outputQuery}\) first called in the second phase. Thus, for every query to the IuT performed by the H-Method or by \(L^\#\) (lines 30, 38, 47), property monitor P continues to check for violations of \(\varphi \).

4 Tool Support: libfsmtest and libsfsmtest

We have implemented the described approach as a C++ framework based on libFuzzerFootnote 7, the coverage-guided fuzzing engine distributed as a part of the LLVM project and on ltl3toolsFootnote 8 supporting the generation of runtime monitors for LTL properties [3].

The implementation I is integrated into a test harness, which contains an implementation of Algorithm 2. Alphabets \(\varSigma _I\) and \(\varSigma _O\) and the LTL property \(\varphi \) are read from files using the libsfsmtestFootnote 9 library, which also extracts the atomic propositions AP occurring in \(\varphi \). From these, the equivalence classes over \(\varSigma _I \cup \varSigma _O \cup AP\) and a propositional abstraction of \(\varphi \) are constructed, from which ltl3tools can construct a runtime monitor in a specific pseudo code representation. Using libsfsmtest again, this monitor is transformed into an executable version reading and checking input/output valuations observed on I. We have implemented the \(L^\#\) algorithm in libfsmtestFootnote 10. The fuzzer invokes the test harness, which orchestrates the translation from fuzzer inputs to input equivalence classes, the application of inputs to I and the feedback of the observations on I to the runtime monitor for \(\varphi \) and the learning algorithm.

Libraries libfsmtest and libsfsmtest are available as open source under MIT license. If users are not interested in obtaining the source code, they can perform the whole testing approach described here by using a cloud service.Footnote 11

5 Experiments

For evaluation of the property testing approach described in this paper, we re-implemented an anti-lock braking system (ABS) for cars with lane stability control, as designed and published by Bosch GmbH [11]. The full functionality described there has been reduced to ABS for the front-left wheel only, and we do not consider gravel road conditions.

The ABS system implements two fundamental tasks: (1) locking a wheel should be avoided if the driver brakes too hard or brakes on slippery roads. The ABS controller prevents wheel locking by alternately holding, reducing and increasing the brake pressure for each wheel individually so that each wheel rotates recurringly while braking, in order to keep the car steerable. (2) The ABS controller implements a lane stability control to prevent the car from swerving on asymmetric road conditions during braking with straight steering angle. The ABS controller then adjusts the braking force in a car for all wheels, to facilitate the steering intervention by the driver, while still applying the maximal possible braking force. The ABS controller measures constantly the wheel velocity \(v_U\) and calculates the brake slip \(\lambda _B\) for each wheel, relative to the vehicle target speed \(v_R\), which in this example is measured at the car powertrain. The equation to calculate the slip is [11]

$$ \lambda _B = \frac{v_U - v_R}{v_R}. $$

The ABS controller evaluates the momentary acceleration \(\alpha \) of each wheel to detect each wheel’s tendency to lock. If \(\alpha \) falls below the threshold \(-a < 0\), a possible wheel lock is detected and the input valve VI (in front of the brake fluid inlet of the wheel brake cylinder) is closed, as well as the output valve VO (after the brake fluid outlet of the wheel brake cylinder) to hold the current brake pressure. The additional brake pump P to artificially increase the brake pressure is set to mode OFF. Consequently, the negative wheel acceleration \(\alpha \) is not reduced any further. Then, if the brake slip falls below the maximum slip threshold, the output valve is opened again. Thus, the brake pressure decreases again, and \(\alpha \) and the slip increase.

When \(\alpha = -a\), the valves are switched to hold pressure (both valves closed, pump off). Now, the acceleration increases and can exceed two thresholds \(+a, +A\) satisfying \(+a < +A\). In the first iteration, the brake pressure will be increased when \(\alpha > +A\), by setting \(\mathtt{VI := OPEN}\), \(\mathtt{VO := CLOSED}\), and \(\mathtt{P := ON}\). After a certain time, \(\alpha \) decreases and reaches \(+A\), so that the ABS controller switches again to hold pressure. The braking pressure is held until the \(+a\) threshold is reached. At this point, the second iteration begins (henceforth repeating) and the brake pressure is slowly increased until \(-a\) is reached. In the following cycles, \(\alpha \) is kept between the two thresholds \(-a\) and \(+a\) by the ABS controller. If the ABS controller receives a signal from the yaw sensor that the car rotates around the z axle during braking, an asymmetric road condition is detected. If the car rotates to the direction of the current wheel, the driver is braking and the steering angle is in direction straight ahead, the controller then tries to facilitate the steering intervention by the driver by alternately reducing and increasing the brake pressure of the current wheel but applying maximum possible braking force in threshold \(-a_\text {GMA}\) (slightly higher than \(-a\)) and \(+a\) until the rotation is within the yaw threshold again.

The C++ implementation consists of one model with approx. 700 lines of code. It processes 6 input variables of type double and writes to three output variables with small enumeration types. The module behaviour depends on 11 (not necessarily pairwise distinguishable) internal control states. Table 2 shows the LTL properties we tested on the example implementation.

The atomic propositions in the LTL formulae, together with the guards and update expressions contained in the code, result in up to 784 input/output equivalence classes \(\text {io}_i\in \mathcal {A}\) that were calculated in about 203 s.

For each LTL property, we created one mutant that violates that property. For properties 1, 2 and 4 we did so by manually applying one of the five mutation operatorsFootnote 12 described in [20] at random locations in the program, until we found a mutant that violated that property. For each mutant we determined how fast it could be found with our approach for different numbers of fuzzing rounds. The time it took for a mutant to be killed for each run is shown in Table 3.

Table 2. The set of LTL properties checked on the example implementation.
Table 3. Program runtimes for the example described above. These were recorded on a kubernetes cluster with 1 CPU core and 16 GiB of RAM allocated to the task. OOM denotes that the property violation was not found during fuzzing, and the learning approach ran out of memory. \(r_{max}\) denotes the maximal number of fuzzing rounds performed.

From this small set of data we can already conclude that the fuzzing can enable a learning-based approach to be used on problems where it would otherwise be not a sensible choice for economic reasons: While a purely learning-based approach was able to find the property violations for properties 2 and 3, it ran out of memory space in the other cases. This happens when equivalence queries are done with too large of a difference between the number of discovered states and the specified upper bound on the number of states. In all cases, we noticed that some amount of fuzzing usually drastically reduced the runtime of the approach. However, we also see that there is a trade-off to be made between making sure that learning does not start too early, as can be seen for the case where we used 100 fuzzing rounds for property 1, and taking too much time fuzzing, as can be seen for the other approaches where 100 fuzzing rounds found the violation faster than most other settings. Obviously, a violation for property 2 was rather easy to find on the corresponding mutant, and we attribute the runtime differences in the different fuzzing round configurations to runtime noise in the execution setup.

To investigate the runtime of the approach when there is no property violation found, we also ran it for the unmutated implementation which satisfies all four properties. In this case, the full model for the implementation, which has 11 states, has to be learnt. Due to the differing amounts of input/output equivalence classes for the properties, the runtimes can differ significantly for runs with different properties. We ran the approach with 5000 fuzzing cycles once for each property and logged the runtimes, number of applied input sequences and number of applied inputs of the fuzzing and learning portions of the approach, separately. For the number of applied inputs and input sequences we also separately noted how many were applied during equivalence queries. Table 4 shows the average, minimum and maximum numbers determined this way over all properties. We performed these experiments with the same fixed seed initializing the random choices for the fuzzer, starting with \(\text {seed} = 1\). This was incremented by one only when the fuzzer would not discover enough states for the learning to be able to learn the rest of the states without posing equivalence queries with too few discovered state. For the four properties tested on the conforming IuT, we succeeded with seed 1 twice, and with seeds 2 and 3 once. While this could seem inconvenient, we found that simply launching several fuzzing runs with different seeds was inexpensive and fast enough to still be practical.

Table 4. Runtime, number of applied input sequences and number of inputs applied during testing all properties against the implementation satisfying all properties.

Compared to testing the mutants, testing the conforming implementation takes significantly longer, which matches the complexity results. In our set of problems, the equivalence queries are consistently the most expensive part of the whole approach which is also supported by the complexity results reported in Sect. 2.1. Furthermore, some of the runtime variations can be explained by the variations in input/output equivalence classes caused by the atomic propositions of the respective properties.Footnote 13

6 Conclusion

In this paper, a novel white box module testing strategy based on learning has been presented. This strategy is complete: given an LTL property \(\varphi \) and an implementation under test I, it decides whether I satisfies \(\varphi \) under the assumption that a representation of I as a symbolic finite state machine contains at most n states and employs only guard and assignment expressions contained in a set of expressions \(\varSigma \). As n and \(\varSigma \) can be determined from static analysis of I, the strategy effectively performs a proof whether I satisfies \(\varphi \).

The strategy improves previous checking strategies based on learning [22] in several aspects. First, it performs input/output abstraction and hence allows checking of implementations with possibly infinite input and output domains. Next, it employs fuzzing in order to quickly reach distinct states in I, speeding up subsequent learning. Thereafter, it applies the efficient \(L^\#\) learning algorithm [25] and reduces the number of required test cases for equivalence checks by using the H-Method [12]. Finally, throughout the algorithm, violations of \(\varphi \) observed in interactions with I are efficiently detected using a monitor [3]. The efficacy of these optimisations has been demonstrated in experiments with modules performing control tasks of significant size and complexity

For future work, we plan to fully implement the LTL model checking performed in Algorithm 2 in our tool and to pursue several further optimisations. These include the use of parallelisation within computationally extensive tasks such as the construction of input/output equivalence classes or applications of the H-Method. Furthermore, we plan to evaluate various heuristics for tasks such as the selection of input valuations (line 12 of Algorithm 2). Additionally, we plan to lift the restriction that our current implementation of Algorithm 2 supports only deterministic IuT (the underlying theory already covers nondeterministic IuT behaviour). Finally, we plan to develop an argument for the tool qualification [7] of our implementation based on the idea that if the strategy claims that I satisfies \(\varphi \), then the final hypothesis \(B=M_i\) of I’s model representation can be used as reference model for and independent model-based testing algorithm to be executed against I.