1 Introduction

Runtime verification (RV) [10, 26, 39, 55] has emerged as a field of computer science within the last couple of decades. RV is concerned with the rigorous monitoring and analysis of software and hardware system executions. The field, or parts of it, can be encountered under several other names, including, e.g., runtime checking, monitoring, dynamic analysis, and runtime analysis. Since only single executions are analyzed, RV scales well compared to more comprehensive formal methods, but of course at the cost of coverage. Nonetheless, RV can be useful due to the rigorous methods involved. Conferences and workshops are now focusing specifically on this subject, including the Runtime Verification conference, which was initiated by the authors in 2001 as a workshop and became a conference in 2010, and runtime verification is now also often listed as a subject of interest in other conference calls for papers.

The paper is written as a response to the ‘Test of Time Award’ attributed to the authors for the 2001 paper [45] (Monitoring Java Programs with Java PathExplorer), presented 17 years ago (at the time of writing) at the first Runtime Verification workshop (RV’01) in Paris, July 23, 2001.

This paper reports on our own RV work, with some references to related work that specifically inspired us, and discusses the lessons learned and our perspective on the future of this field. Note that we do not try to identify all literature that inspired us. That task would be impossible. Previous publications of ours [26, 42, 44] have provided more technical tutorial-like presentations of the field. This paper rather offers information about the motivations for our work and philosophical considerations. As such this paper is closer in spirit to the longer paper [43]. It should be mentioned that most of the works over time have been done in collaboration with other people and inspired/initiated/driven by other people. We have just been lucky to be in the midst of all this work.

The paper is organized according to the time line of events, first leading up to [45], then the work described in that paper, the work that followed, and finally some thoughts on the future of this field.

2 In the Beginning

The initial interest of the first author in formal methods stems from his involvement in the design of the Raise specification language Rsl [30], during the period 1984–1991, and even with earlier work in the early 1980’s on developing a parser and type checker for its predecessor VDM [14, 15, 28]. These are so-called wide-spectrum specification languages permitting formal specification at a high level, and “programming” at a low level, all within the same language, supported by a formal refinement relation between the different levels. These languages were impressively ahead of their time if one looks at these from a programming language perspective. For example, Vdm \(^{++}\) has many similarities with today’s Scala programming language.

However, these languages were fundamentally still specification languages, and not programming languages, in spite of the fact that these languages have a lot in common with modern high-level programming languages, such as e.g. Ml. The thought therefore was: why not benefit from the evolution of modern high-level programming languages and focus on verification of such? This was the first step: the focus on programs rather than models. This lead to the work [34] of the first author on attempting to develop a specification language for an actual programming language, namely Concurrent ML (Cml), an extension of Milner’s Ml with concurrency.

Later work with the very impressive Pvs theorem prover [35] helped realize that theorem proving is hard after all, and that some form of more automated reasoning on programs would be useful as a less perfect alternative. Hence, thus far the realization was that automated verification of programs was a desirable objective. Note that at the time the main focus in the formal methods community was on models, not programs.

The next big move was the development of the Java PathFinder (Jpf), a Java model checker, first as a translator from Java to the Promela modeling language of the Spin model checker [41] (often referred to as Jpf1), and later as a byte code model checker [50] (occasionally referred to as Jpf2). The goal of this work was to explore how far model checking could be taken wrt. real code verification, either using Java as just a better modeling language, or, in the extreme case, for model checking real programs. A sub-objective was to explore the space between testing and full model checking.

Jpf1 suffered from the problem of translating a complex language such as Java to the much simpler language Promela, resulting in a sensation that this approach worked for some programs but not for all programs. It was hard to go the last 20%. Jpf2 solved part of this problem, but still suffered from the obvious problem of state space explosion. In addition, the model checker itself was a homemade JVM on top of the real JVM, and hence slow.

At this time we came across two inspiring invited talks at the SPIN 2000 workshop, which we organized. The first was a presentation by Harrow from Compaq on the VisualThreads tool [33]. The purpose of this tool was to support Compaq’s customers in avoiding multithreading errors. Specifically two algorithms appeared interesting: predictive data race and deadlock detection. These algorithms can detect the potential for a data race or deadlock by analyzing a run that does not necessarily encounter the error. The second invited talk was presented by Drusinsky, on the Temporal Rover [25] for monitoring temporal logic properties. We implemented the data race algorithm, also known as the Eraser algorithm [61], and a modification of the deadlock detection algorithm in Jpf2. The idea was to first execute the program to check for data races and deadlocks using the two very scalable algorithms, and then only if error potentials were found between identified threads, to launch the model checker focusing specifically on those threads.

The two authors of [45] met at NASA Ames in 2000, when the second author started his first job right out of school, and this way, without knowing it, a fruitful, life-time collaboration and friendship with the first author. Inspired by recent joint work with his PhD adviser, Joseph Goguen, the second author was readily convinced that otherwise heavy-weight specification-based analysis techniques can very well apply to execution traces instead of whole systems, and thus achieve scalability by analyzing only what happens at runtime, as it happens. This, paired with provably correct recovery, gives the same level of assurance as formal verification of the whole system, but in a manner that allows us to divide-and-conquer the task. So the second author was “all in”, ready to use his fresh algebraic specification and formal verification knowledge to rigorously analyze execution traces.

At this point, the previously mentioned observations about scalability of the traditional verification approaches, the experiments with data race and deadlock detection mentioned above, and some other less technical issues, led to our research focusing just on observing program executions. A constraint was that it should not be based on test case generation, since so many people were studying this already. We wanted to follow the path less explored. This is where the Java PathExplorer project began, inspired by other work, but not too much other work.

3 Java PathExplorer

Our first pure runtime verification system was Java PathExplorer (JPaX), described in the award winning paper [45], as well as in other papers [46,47,48,49, 60]. The system is briefly described below.

3.1 Architecture

JPaX was a general framework for analyzing execution traces. It supported three kinds of algorithms: propositional temporal logic conformance checking, data race detection, and deadlock detection, as discussed earlier. Figure 1 shows JPaX’s architecture. A Java program is instrumented (at byte code level) to issue events to the monitoring side, which is customizable, allowing the addition of new monitors. The temporal logic monitoring module was originally based on a propositional future time linear temporal logic, but was later extended to also cover past time.

Fig. 1.
figure 1

The JPaX architecture.

An interesting aspect of the system was the use of the Maude [21] rewriting system for implementing monitoring logics as deep DSLs. One could in very few lines implement, e.g., linear temporal logic (LTL), with syntax and its monitoring algorithm, and have Maude function as the monitoring engine as well. There was a grander vision present at the time: to use a powerful Turing complete language, such as Maude, for monitoring, and not be restricted to just, e.g., LTL. However, that vision did not evolve beyond the thought stage, and had to wait some additional years, as discussed in Sect. 4. Below we briefly discuss some of the algorithms developed during the JPaX project.

Future Time LTL. The future time LTL monitoring used Maude to rewrite formulas. Consider, e.g., the LTL formula \(p \ \mathcal{U}\ q\), meaning q eventually becomes true and until then p is true. The implementation of JPaX was based on classical equational laws for temporal operators, such as:

$$\begin{aligned} p \ \mathcal{U}\ q = q \wedge {\bigcirc }(p \ \mathcal{U}\ q) \quad \quad \mathrm {and}\quad \quad \Box p = p \wedge {\bigcirc }(\Box p) \end{aligned}$$
(1)

Consider the sample formula \(\Box (green \rightarrow {\bigcirc }(\lnot red \ \mathcal{U}\ yellow))\). Upon encountering a green in a trace, the formula will be rewritten into the following formula, which must be true in the next state: \((\lnot red \ \mathcal{U}\ yellow) \wedge \Box (green \rightarrow (\lnot red \ \mathcal{U}\ yellow))\). In Maude this was realized by a few simple rewrite rules, including the following two for the until operator (E is an event and T is a trace, the first rule handles the case of a trace consisting of only one event):

figure a

3.2 Past Time LTL

Later, an efficient dynamic programming algorithm for monitoring past time linear temporal logic was developed [48], inspired by an initial encoding in Maude described in [45]. Consider the following past time formula: \(red \rightarrow \blacklozenge green\) (whenever red is observed, in the past there has been a green). The algorithm for checking past time formulas like this uses two arrays, and , recording the status of each sub-formula now and previously. Index 0 refers to the formula itself with positions ordered by the sub-formula relation. Then for this property, for each observed event the arrays are updated as follows.

figure b

This dynamic programming algorithm was generalized and optimized in [49, 59] and later found way into three other systems for monitoring parametric temporal formulas, namely MOP [57], MonPoly [11], and DejaVu [40].

3.3 Data Races and Deadlocks

When used for bug finding, the effectiveness of runtime verification depends on the choice of test suite. For concurrent systems this is critical, due to the many possible non-deterministic execution paths. Predictive runtime verification approaches this problem by replacing a target property P with a stronger property Q such that there is a high probability that the program satisfies P iff a random trace of the program will satisfy Q. As already mentioned, one such algorithm was the Eraser algorithm [61], for detecting potentials for data races (where two threads can access a shared variable simultaneously). It is often referred to as the lock set algorithm as each variable is associated with a set of locks protecting it. The lock graph algorithm [33], would detect “dining philosopher”-like deadlock potentials by building a simple lock graph where a cycle indicates a deadlock potential. In [13] we augmented the original lock graph algorithm to reduce false positives in the presence of so-called guard locks (locks that prevent cyclic deadlocks). That paper was later followed by [12], which suggested a code instrumentation method (inserting wait statements) for confirming found deadlock potentials. Other forms of data races than those detected by Eraser are possible. In [3] a dynamic algorithm for detecting so-called high-level data races (races involving collections of variables) is described.

3.4 Code Instrumentation

JPaX code instrumentation was performed with Compaq’s JTrek [22], a Java byte code instrumentation tool. Operating at the byte code level offers expressive power, but makes writing code instrumentation instructions inconvenient. An attempt was later made to develop an easier to use code instrumentation tool named JSpy [31] on top of JTrek. In this tool code instrumentation could be expressed as a set of high-level rules, formulated in Java (an internal Java DSL), each consisting of a predicate and an action.

3.5 Trace Visualization

Execution trace visualization is a subject that in our opinion has promising potential, although our own involvement in this direction is limited to [4]. The advantage of visualization is that it can provide a free-of-charge abstract view of the trace, from which a user potentially may be able to conclude properties about the program, or at least the execution, without having to explicitly formulate these properties. We can distinguish between two forms of trace visualization as outlined in [4]: still visualization, where all events are visualized in one view, and animated visualization. In [4], an extension of Uml sequence diagrams with symbols is described for representing still visualizations of the execution of concurrent programs.

4 The Aftermath

The period after JPaX followed two tracks, which can be summarized as: experiments with aspect-oriented programming for program instrumentation, and so-called parametric monitoring of events carrying data.

4.1 Aspect-Oriented Programming

Whilst initial runtime verification frameworks targeted Java, the RMOR (Requirement Monitoring and Recovery) framework [36] targeted the monitoring of C programs against state machines using a homegrown aspect-oriented framework to perform program instrumentation. Rmor was implemented in Ocaml using Cil (C Intermediate Language), a C program analysis and transformation system, itself written in Ocaml. Later it was attempted to “go all aspect-oriented”, meaning that aspects no longer were thought of as just the plumbing for performing code instrumentation, but instead that monitors are aspects. Some of our experiments went in the direction of what today is called state-full aspects [1, 65]. Here one takes a starting point in an aspect-oriented language framework (such as e.g. AspectJ) and extends it with so-called tracecuts, denoting predicates on the execution trace. An advice can be associated with a tracecut, and executes when the tracecut is matched by the execution. We proposed this line of work already in [27]. Other later work included [16, 51, 62, 63]. The main observation in these works was that aspect-oriented programming can be extended vertically (allowing more pointcuts) and horizontally (allowing temporal advice, essentially monitoring temporal constraints).

4.2 Runtime Verification with Data

JPaX had a number of limitations. The perhaps most important was the propositional nature of the temporal logics. One could not, for example, monitor parametric events carrying data, such as openFile(“data.txt”), where openFile is an event name and “data.txt” is data. It is perhaps of interest to note, that at the time we were not (and are still not) aware of any system that at the time was able to monitor such parametric events in a temporal logic.

4.3 The Beginning of Data

These considerations lead to two different systems: Eagle [6] and Mop [19]. Eagle was a small and general logic having similarities with a linear time \(\mu \)-calculus, supporting monitoring events with data, and allowing user-defined temporal operators. The later Hawk system [23] was an attempt to tie Eagle to the monitoring of Java programs with automated code instrumentation using aspect-oriented programming, specifically AspectJ [53].

The same JPaX limitations that motivated the development of Eagle also stimulated the apparition of monitoring-oriented programming (Mop) [18,19,20]. Mop proposed that runtime monitoring be supported and encouraged as a fundamental principle of software development, where monitors are automatically synthesized from formal specifications and integrated at appropriate places in the program. Violations and/or validations of specifications can trigger user-defined code at any points in the program, in particular recovery code, outputting/sending messages, or raising exceptions. Mop has made three important early contributions. First, it proposed specification formalism independence, allowing users to insert their favorite or domain-specific requirements specification formalisms via logic plugin modules. Second, it proposed automated code instrumentation as a means to weave the monitoring checking code within the application; the first version in 2003 used Perl for instrumentation [19], while the subsequent versions starting with 2004 [18] used AspectJ [53]. Finally, it proposed a formalism-independent semantics and implementation for parametric specifications. Conceptually, execution traces are sliced according to each observed instance of the parameters, and each slice is checked by its own monitor instance in a manner that is independent of the employed specification formalism. The practical challenge is how to deal with the potentially huge number of monitor instances. JavaMop proposed several optimizations, presented in [58] together with the mathematical foundations of parametric monitoring.

The Eagle system mentioned earlier was considered quite an elegant system, but its implementation was complicated. The subsequent rule-based lower level Ruler system [9] was meant as an “assembler” into which other temporal specification languages could be compiled for efficient trace checking. However, it assumed a life of its own as a specification language. Ruler was given a finite-trace semantics with four verdicts. The verdicts still_true and still_false are given if the rule system would accept/reject the trace if it were to end at the current event, whilst the verdicts true and false were reserved for traces where every extension would be accepted/rejected. Ruler allowed for very complex rule systems that could be chained together such that one rule system produced outputs for another rule system to consume as input events. Rule systems could be combined sequentially, in parallel, and conditionally.

A project solidly rooted in an actual space mission was the development of the LogScope temporal logic for log analysis [7]. The purpose of the project was to assist the team testing the flight software for JPL’s Mars rover Curiosity, which successfully landed on Mars on August 6, 2012. The software produces rich log information. Traditionally, these logs are analyzed with complex Python scripts. The LogScope logic was developed to support notations comprehensible to test engineers, including a very simple and convenient data parameterized temporal logic, which was translated to a form of data parameterized automata, which themselves could be used for specification of more complex properties that the temporal logic could not express. LogScope was furthermore implemented in Python, allowing Python code fragments to be included in specifications, all in order to integrate with the existing Python scripting culture at JPL.

4.4 Internal DSLs

Earlier we mentioned a grander vision to use a powerful Turing complete language for monitoring. The fundamental problem with a logic is that it likely may be insufficient for practical purposes if not designed extremely optimally. Engineers are, e.g., often observed using Python for monitoring tasks. Of course in lack of a better notation, but also because it provides expressive power to perform arbitrary computations, e.g. on observed data. This observation led to several experiments with so-called internal DSLs, where one extends a programming language with monitoring features. This allows the user to use the features of the programming language when the features of the monitoring logic do not suffice. TraceContract [8, 37] is such an internal Scala DSL (effectively an API) for monitoring, based on a mixture of temporal logic and state machines. It is developed using Scala’s features for defining internal DSLs. TraceContract, although a research tool, was later used for analysis of command sequences sent to NASA’s LADEE (Lunar Atmosphere and Dust Environment Explorer) spacecraft throughout its mission.

Another example of an internal Scala DSL is LogFire [38]. LogFire is a rule-based system similar to Ruler, but based on a modification of the Rete algorithm [24, 29], used in several rule-based systems. LogFire was part of an investigation of the Rete algorithm’s applicability for runtime verification. LogFire has become part of the software that daily processes telemetry data from JPL’s Mars Curiosity rover. LogFire’s ability to generate facts can be used for Complex Event Processing (CEP) [56], where higher-level events (abstractions) are generated from lower-level events. CEP can be used for further analysis and/or human comprehension, e.g. through visualization. Another CEP system is Nfer [52], which in part was influenced by our work on rule-based systems, and LogFire in particular. The result of applying an Nfer specification to an event stream is a set of time bounded intervals. The specification consists of rules of the form: (a rule name followed by a rule body). The semantics is similar to that of Prolog (hence the symbol): when the is true an interval is generated with that . A difference from Prolog is that rule bodies contain temporal constraints based on operators from Allen Temporal Logic [2]. Nfer was created due to a need for comprehending large telemetry streams from Mars rovers. Abstracting these to higher level intervals, compared to the low level raw event stream, should ease human comprehension.

4.5 First-Order Beyond Slicing

Ruler, as a layer of syntactic sugar on top of the rule formalism, offered a sub-formalism resembling a data parameterized automaton language. Likewise, LogScope, inspired by Ruler, offered a data parameterized automaton notation (in addition to the temporal logic). Quantified event automata (Qea) [5] was an attempt to design a pure data parameterized automaton monitoring system logic, using the efficient trace slicing approach previously introduced in the JavaMop tool [57], but dealing with some of the limitations with respect to expressiveness. A Qea specification consists of a list of first-order quantifications (universal and existential) and an automaton. They can be compared to extended state machines (allowing arbitrary guards and actions on transitions operating on local state, but are more succinct due to the fact that automata are “spawned” according to parameters (there is a local state for each combination of parameters).

A different approach to optimizing monitoring of parametric data is implemented in the DejaVu tool [40], which uses BDDs [17] to efficiently represent data observed in the trace. Logic-wise, the system supports a standard past time temporal logic with quantification. The logic in itself is not the innovation, rather it is the use of BDDs to represent the sets of values observed in the trace for the quantified variables. The representation of sets of assignments as BDDs allows a very simple algorithm that naturally extends the dynamic programming monitoring algorithm for propositional past time temporal logic shown on page 5 and presented in [47], using two vectors now and pre. However, while in [47] the vectors contain Boolean values, here the values are BDDs.

5 Discussion

Numerous runtime verification logics have been developed over time. They include various forms of temporal logics, state machines, regular expressions, context free grammars, rule systems, variations of the \(\mu \)-calculus, process algebras, stream processing, timed versions of these, and even statistical versions, where data can be computed as part of monitoring. It is clear that parametric/first-order versions of these logics are needed. Some efforts have been made to combine two or more of these logics, such as, e.g., combining temporal logic and regular expressions. An interesting trend is logics which not just produce a Boolean value, but rather a data value of any type. This leads to systems computing arbitrary data values from traces. It is, however, nearly impossible at this point to estimate which of these approaches would potentially get infused in industrial settings.

Whether to develop a DSL as external or internal is a non-trivial decision. An external DSL is usually cleaner and more directly tuned towards the immediate needs of the user. In addition, they are easier to process and therefore optimize for efficiency. However, the richer the DSL becomes (moving towards Turing-completeness) the harder the implementation effort becomes. An internal DSL can be very fast to implement and augment with new (even user-defined) operators, and can provide an expressiveness that would require a major effort to support in an external DSL. One also gains the advantage of IDEs for the host language. A hypothesis is that monitoring logics used in practice will need to support very expressive expression languages to process data, such as strings and numbers that are part of the observed events. Temporal logic could become part of a programming language assertion language. This could be seen as part of a design-by contract approach also supporting pre/post conditions and class invariants.

An important topic may be inferring specifications from execution traces. Our own limited work in this area includes [54, 64]. Related to specification mining is execution trace visualization (the visualization can be considered a learned model). The advantage of visualization is that it can provide a free-of-charge abstract view of the trace, from which a user potentially may be able to conclude properties about the program, or at least the execution, without having to explicitly formulate these properties.

Full verification is of course preferred over partial verification performed by a monitor. The combination of static and dynamic verification can provide the best of both worlds: prove as much as is feasible statically and verify the remaining proof obligations during runtime. To properly achieve this goal, we need formal specifications not only for the properties to verify, but also for the programming language itself. Moreover, we need provably correct monitor generation techniques, so we can put all the specification and proof artifacts together and assemble a proof of correctness for the entire system. Interestingly, once a specification of the programming language itself is available, then one can go even one step further and monitor the execution of the program even against the language specification. This may seem redundant at first, but it actually makes full sense for some languages with complex semantics, like C. For example, tools like Valgrind or UBSan detect undefined behaviors in C/C++ programs, which are essentially deviations from the intended language semantics. The RV-Match tool [32] is an attempt to push runtime verification in this direction.

In fault-protection strategies, the goal is to recover the system once it has failed. The general problem of how to recover from a bad program state is interesting and quite challenging. The ultimate solution to this problem can be found in planning and scheduling systems, where a planner creates a plan (straight-line program) to execute for a limited time period, an executive executes the plan, and a monitor monitors the execution. Upon failure detected by the monitor, a new plan (program) is generated online.