Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

1.1 Timing Analysis

Timing analysis of embedded real-time programs attempts to determine tight upper, and sometimes also lower bounds on the execution times of the programs. Ideally, one would find out the worst-case and best-case execution times. This is possible in principle since real-time programs are programmed in a way that termination is guaranteed, and since the execution platform has only finite resources. However, the complete exploration of the associated state space would take far too long to be practically feasible.

It was therefore clear that abstraction would need to be applied to arrive at sound execution-time bounds in acceptable times. Our entry into the timing-analysis area started with the (quite successful) attempt to predict the cache behavior by abstract interpretation [3, 10]. Abstract interpretation had not been applied to the timing-analysis problem. The existing approaches were rather ad hoc and of doubtful correctness. It turned out that using abstract interpretation was the recipe for success.

1.2 What is Different?

The standard textbook on static program analysis, authored by Flemming Nielson, Hanne Riis Nielson, and Chris Hankin, covers most needs of a designer of abstract interpretations. At least this was what we thought when we started out to design static analyses of the cache behavior of real-time programs. However, it turned out that timing analysis offers a number of challenges that were not foreseen by the existing theory or not used in previous practice. Here is a list, some items more absent from the Nielson/Nielson/Hankin book than others, some that could be covered by adaptations of the theory. Some concern cache analysis, others more general timing analysis.

  • All traditional static program analyses we found in the literature were, in the best case, based on a semantics of the programming language that abstracted from the underlying execution platform. However, any timing analysis needs to talk about architectural behavior. Hence, the behavior of the execution platform must be an integral part of the semantics of the programming language, on which the static analysis is based.

  • Any timing analysis is composed of many component analyses, one for each architectural component contributing to the timing behavior of programs. These component analyses interact in possibly complex ways, essentially originating from the dependencies of the architectural components on each other. Worst are cyclic dependencies since they render separate analyses of component behavior more or less impossible [18]. An adequate design of the individual analyses and of the composition is needed to arrive at overall timing analyses that are both precise and efficient.

    This composition of timing analysis of many component analyses is in contrast with the application of static program analysis in compilation, where typically one static analysis checks the applicability of one program transformation [26]. It is also different from the composition of abstract domains as in [5] used to increase precision of one static analysis by using information from another one.

  • Timing analyses need to analyze programs on the executable level since the source level does not contain the information on memory allocation of instructions and data, indispensable for cache analyses, nor the information on when memory is accessed, needed for the analysis of bus-access conflicts.

  • The replacement strategy of a cache architecture always needs some bookkeeping mechanism about past memory accesses. The state representation of this mechanism is optimized for the speed and the size of the update logic. Cache analysis, however, is interested in the state itself, not its representation. The design of cache analyses therefore starts with a lossless abstraction of the HW implementation of the cache.

    This is somewhat comparable to the situation in shape analysis [24], which is based on a storeless semantics abstracting from actual heap addresses, but keeping connectivity information.

  • The cache semantics shows many indirect effects of state changes of one object on another independent object, e.g. memory block a is loaded into the cache, thereby replacing memory block b.

    These indirect effects are different from the ones caused by manipulations through pointers: These may also have side-effects, however, only on aliases, i.e. pointer expressions reaching the same object. So, these indirect effects result from program execution.

    In contrast, cache loads have indirect effects on objects related by the execution platform, i.e., by the cache-set mapping.

  • A particularly hard problem is the static analysis of write-back caches. Here a modification of the contents of a memory block a residing in the cache leads to a temporary inconsistency in the value of a in the memory hierarchy. This inconsistency is repaired by a write back, possibly much later, when a is evicted by loading some other memory block b. These delayed cause-effect chains are quite unusual in the semantics of programming languages and therefore also in traditional static program analyses.

  • The invariant at a program point, computed by some static analysis, may have different expressivity (precision), depending on whether the invariant is to hold for all executions reaching this program point or only for a subset corresponding to a particular context and/or a particular control flow. Traditional static program analyses may therefore be context- and/or flow sensitive or insensitive depending on the desired precision of the results and the required effort. The notion of context is defined by some abstraction of the set of call strings.

    Timing analyses must be flow-sensitive in order to obtain any precision at all. In addition, timing analyses need and use a generalized notion of context to be precise. Different iterations of a loop may have vastly different execution times. Hence, they have to be considered as contexts for the instructions in the body. This is an instance of trace partitioning, invented before trace partitioning was proposed in the literature [21].

  • Static analyses of concurrent systems focus on the interaction of the concurrently executed tasks on global variables. In contrast to this, cache analysis, and more general timing analysis, has to determine safe approximations of the resource-occupancy interaction [1]. An additional complication, compared to the static analysis of concurrent systems, is the non-transparency of which objects compete with which other objects for resources.

  • Cache analysis, and more general timing analysis, determines invariants about execution states at program points and derives safety properties from these invariants, i.e., certain timing accidents like a cache miss will never happen at a program point. The proof of such safety properties allows reducing the execution-time bound by the timing penalty corresponding to the excluded timing accident. This use is different from that in traditional static program analyses used in verification, where such a safety property typically proves the absence of a run-time error.

  • The profitability of code optimizations involving static analyses as check for their applicability is seldom clear. In contrast, excluding a timing accident by a strong invariant computed by a timing analysis is often associated with the elimination of a very clear penalty.

2 From Microarchitectures to Abstractions for Timing Analysis

When developing a timing analysis, the first task is to obtain a faithful model of the microarchitecture that the analysis is targeted at. This can be very challenging because documentation at the required level of detail is seldom available. One promising approach is to start from cycle-accurate models in hardware description languages like VHDL or Verilog if those are made available by the hardware manufacturer [25]. If such models are not provided by the manufacturer, they have to be constructed manually based on processor manuals and extensive measurements on evaluation boards. For some microarchitectural features, such as caches the modeling process can be partially automated [2]. In the remainder of this section, we assume that a cycle-accurate model of the microarchitecture has already been obtained by one of the ways described above.

Mathematically, a cycle-accurate model is a transition relation \(R \subseteq S \times S\) that captures the behavior of the processor in a single execution cycle. Programs and their input data are part of the states S of the processor. So the initial states of a program P under all possible inputs are a subset \(I_P\) of S. The goal of timing analysis is then to determine a bound on the number of cycles from any possible state in \(I_P\) to the program’s termination, i.e., until it reaches one of its final states \(F_P\). Brute-force exploration of all possible reachable states from the set of initial states is practically infeasible due to its large number. Therefore, a number of abstractions have been introduced to arrive at safe approximations of the worst-case execution time. In the following, we will discuss the two most important such abstractions.

2.1 Analysis Framework

Microarchitectures implement instruction set architectures (ISA). The semantics of binary programs in terms of the computed values in registers and memory are governed by the instruction set architecture. In particular, they are independent of its microarchitectural implementation.

As a consequence, analysis at the ISA level can be separated from analyses specific to a microarchitecture. This separation has led to the high-level structure of WCET analysis tools depicted in Fig. 1. For a given instruction set, in a preprocessing step, a value analysis [4] determines the possible values of registers and memory locations, usually based on interval and congruence abstractions. The control-flow graph of the program under analysis is annotated with the results of value analysis for use in the subsequent analysis steps. They are required for precise data-cache analysis within microarchitectural analysis, as well as in control-flow analysis. Control-flow analysis [7, 9, 19, 27] determines loop bounds and other characterizations of the set of semantically-feasible paths through the control-flow graph.

The task of microarchitectural analysis [6, 8, 11, 12, 29], which we will illuminate further in the following section, is to determine bounds on the execution times of small program fragments such as basic blocks. These bounds, together with the results of the control-flow analysis are then used in path analysis [20, 28] to determine an upper bound on the execution time of the program as a whole. Path analysis is usually performed using integer linear programming formulations.

Fig. 1.
figure 1

Main components of a timing-analysis framework and their interaction.

2.2 Separation into Value and Microarchitectural Analysis

Microarchitectural analysis first computes an overapproximation of the set of all reachable microarchitectural states. From this approximation and the transitions between the different reachable states, execution time bounds for basic blocks can be determined.

A relatively simple microarchitecture may consist of the following components: pipeline control, pipeline datapath, register file, branch predictor, cache, and main memory. The microarchitecture’s space can be modeled as the cartesian product of the state spaces of its components:

$$\begin{aligned} S&= \textsc {PipelineControl}\times \textsc {PipelineDatapath}\times \textsc {RegisterFile}\\&\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \times \textsc {BranchPredictor}\times \textsc {Cache}\times \textsc {Memory}\end{aligned}$$

The set of reachable states of program P is the least fixed point of the \(\textit{next}\) operator containing the initial states \(I_P\) of program P:

$$\begin{aligned} Col(P) = I_P \cup \textit{next}(I_P) \cup \textit{next}^2(I_P) \cup \dots , \end{aligned}$$

where \(\textit{next}\) captures the effect of one execution cycle:

$$\begin{aligned} \textit{next}(M) = \{s' \mid (s, s') \in R\} \end{aligned}$$

The first abstraction, discussed informally in the previous section, is to perform value analysis separately, prior to microarchitectural analysis. Value analysis can be formalized as abstracting values in the register file and the memory.

Let \(\textsc {Value}^{\#}\) be the abstract domain used in value analysis. A concretization function \(\gamma _\textit{VA} : \textsc {Value}^{\#}\rightarrow \mathcal {P}(\textsc {RegisterFile}\times \textsc {Memory})\) provides the meaning of a result of value analysis. Such analyses are flow- or even context-sensitive so that information about registers and memory is available for each program location separately. Formally,

$$\textsc {Value}^{\#}= (\textsc {Loc}\times \textsc {Context}) \rightarrow (\textsc {Register}^\#\times \textsc {Memory}^\#),$$

where \(\textsc {Loc}\) and \(\textsc {Context}\) are sets of program locations and contexts, and \(\textsc {Register}^\#\) and \(\textsc {Memory}^\#\) are abstractions of the register file and memory, respectively. For reasons of brevity we cannot further elaborate on these abstractions. A correct abstract \(\textit{next}_\textit{VA}^\#: \textsc {Value}^{\#}\rightarrow \textsc {Value}^{\#}\) operator guarantees global correctness of the value analysis.

Given value analysis results, microarchitectural analysis can thus focus on the remaining parts of the microarchitecture, pipeline control, as well as the state of the branch predictor and the cache. The state of the pipeline datapath can be inferred from the state of the pipeline control and the values of registers and memory, and is thus not explicitly represented by either value analysis or by microarchitectural analysis.

Let \(\mu \textsc {Arch}^\#\) be the abstract domain used in microarchitectural analysis, and let \(\gamma _{{\mu }A} : \mu \textsc {Arch}^\#\rightarrow \mathcal {P}(\textsc {PipelineControl}\times \textsc {BranchPredictor}\times \textsc {Cache})\) be its concretization function. While value analysis does not depend on microarchitectural analysis, the converse is not true. In particular, \(\textit{next}_{{\mu }A}^\#\) depends on the results of value analysis: \(\textit{next}_{{\mu }A}^\# : \mu \textsc {Arch}^\#\times \textsc {Value}^{\#}\rightarrow \mu \textsc {Arch}^\#\). For example, upon a memory access, microarchitectural analysis will query the results of value analysis, which have been annotated to the program’s control-flow graph, to determine which memory block is being accessed to be able to classify the access as a cache hit or a cache miss.

For a correctness argument, the abstract operators \(\textit{next}_\textit{VA}^\#\) and \(\textit{next}_{{\mu }A}^\#\) can be combined to obtain the abstract \(\textit{next}^\#\) operator as follows:

$$\begin{aligned} \textit{next}^\#(v^\#,m^\#) := (\textit{next}_\textit{VA}^\#(v^\#), \textit{next}_{{\mu }A}^\#(m^\#, v^\#)). \end{aligned}$$

Given correctness of \(\textit{next}_\textit{VA}^\#\) and \(\textit{next}_{{\mu }A}^\#\), it can be shown that \(\textit{next}^\#\)’s least fixed point,

$$\begin{aligned} Col^\#(P) = (i_v^\#, i_m^\#) \sqcup \textit{next}^\#(i_v^\#, i_m^\#) \sqcup {\textit{next}^\#}^2(i_v^\#, i_m^\#) \dots , \end{aligned}$$

overapproximates the set of reachable states Col(P), with the combined concretization function

$$\begin{aligned}&\gamma (v^\#, m^\#) := \{(pc,pd,rf,bd,c,m) \in S \mid \\&\qquad \qquad \qquad \qquad \qquad \quad (rf, m) \in \gamma _\textit{VA}(v^\#) \wedge (pc, bd, c) \in \gamma _{{\mu }A}(m^\#),\} \end{aligned}$$

given that \(\gamma (i_v^\#, i_m^\#) \supseteq I_P\).

From the formalization it is apparent that value analysis can be performed in a preprocessing step, as it does not depend upon the results of microarchitectural analysis. This preprocessing step produces a control-flow graph annotated with the results of value analysis, which is then used by microarchitectural analysis.

Fig. 2.
figure 2

Transitions from two different initial states of a simple processor consisting of an in-order pipeline, a 2-way fully-associative cache, and a 1-bit branch predictor.

2.3 Microarchitectural Analysis

Now let us turn to the internal structure of \(\mu \textsc {Arch}^\#\). Can the pipeline control, the branch predictor, and the cache be analyzed independently of each other? Unfortunately, this is not the case due to the mutual dependencies of the three components. This is best explained with the help of Fig. 2. At the top, we see two microarchitectural states of a simple processor, consisting of an in-order pipeline containing an instruction fetch buffer, two integer units, one floating-point unit, a cache, and a 1-bit branch predictor. The two processor states initially only differ in their cache states. The pipeline is about to dispatch the add instruction from the instruction fetch buffer to integer unit 1. Assume this add instruction adds the contents of a memory address to the contents of a register. Then, the instruction can be dispatched as soon as the memory operand is available. In the state on the left, the operand a is in the cache, and so add is dispatched immediately. On the other hand, in the state on the right, operand a needs to be fetched from memory, as it is not in the cache, and the add instruction cannot be dispatched yet. So the cache state has an influence on the pipeline state.

Now consider the successor states. On the left, the next instruction from the instruction fetch buffer to execute is a branch equal zero instruction. As the condition upon which the branch depends has not yet been evaluated, the pipeline queries the branch predictor to decide in which direction to speculate. The prediction influences which instruction to fetch next, which in turn will affect the cache contents. So the future cache state depends on the current state of the pipeline and the branch predictor. Due to this tight coupling of the three components, they need to be analyzed relationally to obtain reasonably precise results.

The example also demonstrates that pipeline states cannot easily be ordered in terms of “progress”: Intuitively, the successor state on the left has progressed further than the state on the right, as the add instruction has already been dispatched in this case. However, if speculative execution proceeds in the wrong direction, the pipeline state on the left may result in a longer execution time than the state on the right, which has no potential to speculate. Due to this lack of a natural ordering, which are usually the basis of abstractionsFootnote 1, no efficient and precise abstractions are known so far for sets of pipeline states. In Sect. 3, we speculate about the design of abstractable pipelines and its abstraction.

The analysis essentially operates on the power-set domain of sets of concrete pipeline states, where only the datapath is abstracted away, as discussed earlier. For the cache, however, precise and efficient abstractions have been found.

2.4 Two Abstractions for Caches

The abstraction described in the following applies to caches with least-recently-used (LRU) replacement and was originally proposed by Ferdinand and Wilhelm [12]. For simplicity we assume a fully-associative cache, i.e., the cache consists of a single cache set, as the example cache in Fig. 2. The extension to set-associative caches is straightforward, as set-associative caches can be seen as cartesian products of multiple independent fully-associative caches, each of which can be abstracted independently of the others.

A Lossless Logical Abstraction. The first abstraction to perform in the analysis of caches is a cache’s physical implementation to a formal, logical model of its behavior. In physical implementations, caches consists of multiple memories containing data, tags, and status bits. In particular, each cache line is associated with a tag to keep track of which memory block is cached in the respective line. In addition, a number of status bits are maintained in each cache set to record the “logical” state of the replacement policy.

For instance, an implementation of least-recently-used replacement needs status bits to remember in which order the cache lines of each set have been used. Abstracting from the data stored in the cache, a model of a fully-associative cache with LRU replacement fairly close to the physical implementation might thus consist of two functions: (1) a function \(cl: \{1, \dots , k\} \rightarrow \mathcal{B}\) that captures which memory block is stored in each of the k cache lines, and (2) a function \(age_{cl} : \{1, \dots , k\} \rightarrow \{0, \dots , k-1\}\) that maintains the “age” of each cache line, i.e., the number of distinct cache lines that have been accessed since the last access to the given cache line.

For cache analysis it is irrelevant in which physical cache line a memory block is stored; only the relative ages of different cached memory blocks are required to predict the future cache hit behavior. Thus, a lossless abstraction can be applied that captures the age of each memory block \(age : \textit{Cache} = \mathcal{B} \rightarrow \{0, \dots , k-1, k\}\), where uncached blocks assume age k. One can relate the two models by an abstraction function \(\alpha \) defined as follows:

$$\alpha (cl, age_{cl}) := \lambda b \in \mathcal{B}: {\left\{ \begin{array}{ll} age_{cl}(i) &{}: \textit{if } cl(i) = b\\ k &{}: \textit{if } cl(i) \ne b \forall i \in \{1, \dots , k\} \end{array}\right. }$$

Upon a load of memory block b, the ages are updated as follows:

$$\textit{up}(age, b) := \lambda b' \in \mathcal{B}: {\left\{ \begin{array}{ll} 0 &{}: \textit{if } b' = b\\ age(b') &{}: \textit{if } age(b) \le age(b')\\ age(b')+1 &{} : \textit{if } age(b) > age(b') \end{array}\right. }$$

An Interval Abstraction. Cache analysis needs to represent sets of cache states. In particular, at program start no knowledge about the cache state may be available, and so cache analysis needs to represent all possible cache states. Obviously, explicit representations are practically infeasible in such cases. A further abstraction is required to compactly represent large sets of cache states with little precision loss.

Fortunately, such abstractions are possible in the case of LRU. This is because LRU exhibits a form of monotonicity. Intuitively, the “younger” a memory block, i.e., the lower its age, the better. Thus, it is sufficient to maintain upper and lower bounds on the age of each memory block independently of the ages of the other memory blocks. This yields the following abstract domain

$$\begin{aligned} \widehat{CacheInterval} = \{(l, u)&\mid l, u \in \mathcal{B} \rightarrow \{0, \dots , k-1, k\}\\&\wedge \forall b \in \mathcal{B}: l(b) \le u(b)\} \end{aligned}$$

storing a lower and an upper bound on the age of each memory block.

In the literature, the two analyses have been proposed separately, where lower bounds are maintained in what is called may analysis and upper bounds are maintained in must analysis. Lower bounds can be used to reason about which memory blocks may be cached, whereas upper bounds can be used to reason about which memory blocks must be cached.

Fig. 3.
figure 3

Must and may cache abstractions.

\(\widehat{CacheInterval}\) forms a join semi-lattice with the following order:

$$\begin{aligned} (\widehat{a}_\textit{may}, \widehat{a}_\textit{must}) \sqsubseteq (\widehat{a}_\textit{may}', \widehat{a}_\textit{must}')&:\Leftrightarrow \widehat{a}_\textit{may} \sqsubseteq _\textit{may} \widehat{a}'_\textit{may} \wedge \widehat{a}_\textit{must} \sqsubseteq _\textit{must} \widehat{a}'_\textit{must} \\ \widehat{a}_\textit{may} \sqsubseteq _\textit{may} \widehat{a}'_\textit{may}&:\Leftrightarrow \forall b \in \mathcal{B}: \widehat{a}_\textit{may}(b) \ge \widehat{a}'_\textit{may}(b)\\ \widehat{a}_\textit{must} \sqsubseteq _\textit{must} \widehat{a}'_\textit{must}&:\Leftrightarrow \forall b \in \mathcal{B}: \widehat{a}_\textit{must}(b) \le \widehat{a}'_\textit{must}(b) \end{aligned}$$

Abstract cache states are related to sets of concrete cache states by a Galois connection via the following abstraction and concretization functions:

$$\begin{aligned} \alpha (C)&:= (\alpha _\textit{may}(C), \alpha _\textit{must}(C)), \textit{with}\\ \alpha _\textit{may}(C)&:= \lambda b \in \mathcal{B}: \min _{age \in C} age(b)\\ \alpha _\textit{must}(C)&:= \lambda b \in \mathcal{B}: \max _{age \in C} age(b) \end{aligned}$$

and

$$\begin{aligned} \gamma (\widehat{age}_\textit{may}, \widehat{age}_\textit{must})&:= \gamma _\textit{may}(\widehat{age}_\textit{may}) \cap \gamma _\textit{must}(\widehat{age}_\textit{must}), \textit{with}\\ \gamma _\textit{may}(\widehat{age})&:= \{ age \mid \forall b \in \mathcal{B}: \widehat{age}(b) \le age(b)\}\\ \gamma _\textit{must}(\widehat{age})&:= \{ age \mid \forall b \in \mathcal{B}: age(b) \le \widehat{age}(b)\} \end{aligned}$$

In Fig. 3, the two abstractions are illustrated at the example of the set of cache states found in the two initial states of Fig. 2. In the concrete cache states the \(i^{th}\) row contains the memory block with age i. In the abstract cache states the \(i^{th}\) row contains all memory blocks with age bound i.

The abstract update functions for the lower and upper bounds closely resemble the concrete update function and can be proven correct rather easily:

$$\textit{up}_\textit{may}(\widehat{age}, b) := \lambda b' \in \mathcal{B}: {\left\{ \begin{array}{ll} 0 &{}: \textit{if } b' = b\\ age(b') &{}: \textit{if } \widehat{age}(b) < \widehat{age}(b')\\ age(b')+1 &{} : \textit{if } \widehat{age}(b) \ge \widehat{age}(b') \ne k\\ k &{} : \textit{if } \widehat{age}(b') = k \end{array}\right. }$$
$$\textit{up}_\textit{must}(\widehat{age}, b) := \lambda b' \in \mathcal{B}: {\left\{ \begin{array}{ll} 0 &{}: \textit{if } b' = b\\ age(b') &{}: \textit{if } \widehat{age}(b) \le \widehat{age}(b')\\ age(b')+1 &{} : \textit{if } \widehat{age}(b) > \widehat{age}(b') \end{array}\right. }$$

Integration of Cache Analysis Within Microarchitectural Analysis. As discussed earlier, no good abstractions for sets of pipeline control states are known, and so they are analyzed using a power-set domain. How can the analysis of the pipeline be integrated with the analysis of the cache behavior? Due to their mutual dependencies they need to be analyzed in a relational manner. The idea is to associate with each possible pipeline state, one cache state:

$$\begin{aligned} \widehat{\mu \textsc {Architecture}} := \textsc {PipelineControl}\rightarrow (\widehat{CacheInterval} \cup \{\bot \}), \end{aligned}$$

where \(\bot \) is used to express that the respective pipeline state is not possible. For simplicity, here, we omit branch predictors, which can be treated similarly to caches.

Other Replacement Policies. We have seen a precise and efficient abstraction for caches with LRU replacement. For other replacement policies, similarly efficient abstractions have been developed [1317, 22]. However, they do not reach the same level of precision as the replacement policies are less predictable [23].

3 An Abstractable Pipeline

We have seen in the previous section that pipelines in modern high-performance microprocessors don’t provide for compact abstractions similar to abstract cache states. This forced the pipeline analysis to work with a power-set domain [29]. We now speculate about an abstractable instruction pipeline, i.e. an instruction pipeline that has a compact abstract domain, simple update and join functions, and thereby admits efficient and precise pipeline analysis. The goal is to have an abstract instruction pipeline which looks much like a concrete instruction pipeline. The concrete state of an instruction pipeline contains a set of instructions of a given program, each in one particular pipeline stage. In addition, the pipeline is connected to a set of queues, buffers, and functional units holding instructions to be fetched next, stores to still be executed, or operations under execution.

The progress in executing a given program, as given by a particular concrete pipeline state, consists in

  • which instructions of the program have already retired from the pipeline,

  • how far other instructions of the program have progressed in the pipeline,

  • how many instructions to be executed next have been prefetched into prefetch queues,

  • how far operations dispatched by instructions currently in the pipeline have progressed in the pipelined functional units,

  • how many outstanding stores are still in the store buffer.

An abstract state of an instruction pipeline, as we envision it, should look much like a concrete pipeline state. However, the interpretation (concretization) is different:

  • Any progress of an instruction in the abstract pipeline state is a guaranteed progress of the instruction.

  • the contents of the abstract prefetch queue is a sequence of instructions, guaranteed to have been prefetched,

  • the progress of dispatched operations in the functional units is guaranteed progress, and

  • the stores removed from the store buffer have definitely been performed, the ones still in the store buffer may be still outstanding.

This notion of progress is the basis for defining a partial order of the abstract pipeline domain.

Let us discuss the implication for the pipeline architecture. It means that the pipelines should be an in-order pipeline, i.e., without reordering of instructions. An out-of-order pipeline admits several dynamically selected schedules of a given sequence of instructions. The join function would be applied to the different schedules resulting in an abstract pipeline state where each instruction is recorded with its slowest possible progress. Thus the effect of out-of-order execution would be completely lost in the pipeline analysis.

A first step towards abstractable pipelines has been done in [18]. We proposed a strictly in-order pipeline, i.e., one where no phase of a later instruction can block execution of a phase of an earlier instruction. This restriction excludes timing anomalies, which were still possible in in-order pipelines, against common beliefs. The simple pipeline design admits a compact abstract domain based on the maximally guaranteeable progress.

Pipeline analysis is typically performed on basic blocks. For each predecessor block of a basic block to be analyzed it has produced a final abstract state. These final states need to be combined to an initial state by applying the join function of the abstract domain. Different predecessor basic blocks will consist of different instruction sequences, such that their final abstract states will have different subsequences of instructions in the pipeline. Joining the set of abstract final states would roughly correspond to flushing the pipeline, a costly approach if the pipeline is deep. The efficiency gain of overlapping execution across basic-block boundaries would always get lost. A way out of this dilemma could consist in delaying the join at the beginning of basic blocks until the remaining instructions of the predecessor blocks have retired.

4 Conclusions

We have shown how the architectural basis of static timing analysis influences the character of static timing analysis, which includes a particular instance of abstract interpretation as its most important component. In particular, we have described how two important transformations of the underlying complex cartesian-product domain were needed and successfully used to arrive at efficient analyses.